Those of us who deal with big data have a tendency to describe working with it in cavalier terms. “Oh, I just grabbed the XML file, wrote a quick parser to turn it into CSV, bulk loaded it into MySQL, laid an API on top of it, and I was done.” The truth is that things very rarely go so well.

i-0c48ee4912251e052427d4d2df005e8e-Untitled.jpg

Real-world data is messy. Data doesn’t convert correctly the first time (or, often, the 10th time). File formats are invalid. The provided data turns out to be incomplete. Parser code that was so straightforward when written for the abstract concept of this data quickly turns into a series of conditionals to deal with all of the oddities of the real data.

I’ve been developing a state law parser for The State Decoded, which will store all U.S. state-level legal codes in a common format, smoothing over their collective differences to provide them via a standardized API. Or so goes the theory of how such a system could work. The reality is that state laws themselves are too messy to standardize entirely, and the data formats in which states provide those laws are, in turn, too messy to import easily.

As a case study of the messiness of real-world data, here are some of the challenges I’ve encountered in parsing state legal codes.

Encoding Errors

A precious few states provide their state codes as bulk data. While it might seem like a real gift to get an XML file of every state law, if the XML is invalid, then that’s really more of a white elephant. One state provided me with SGML that they, in turn, had been provided with by LexisNexis. It was riddled with hundreds of errors, and could not be parsed automatically. After hours of attempting to fix problems by hand, I finally threw in the towel. Weeks later, LexisNexis provided a corrected version, and my work could continue.

Often, working with big data means doing path-breaking work. Sometimes nobody has ever before attempted to do anything with the data sets in question. Assembled by well-meaning but inexperienced people, those data sets may consequently be encoded incorrectly.

Changing Realities

State laws are occasionally restructured, renumbering huge portions of the code. Virginia’s entire criminal code — Title 18.1 — became Title 18.2 about 15 years ago. No redirects exist in 18.1, no pointers to the new location, no sign of what was. One must simply know that it changed. Court cases, articles from legal journals, or attorney generals’ opinions that cited sections of code within 18.1 are thus either useless or must be passed through a handcrafted filter to point the citation to the new section number.

It would be nice if reality would consent to remain static to ease the process of cataloging it. But the world changes, and data reflects those changes. That can make it awfully frustrating to parse and apply that data, but that’s just the price of admission.

Inconsistencies in the Data

There are at least a few states that violate their standard state code structure. They might structure their code by dividing it into titles, each title into chapters, and each chapter into sections — except, sometimes, when chapters are called “articles.” Why do they do this? I have no idea. If lawmakers consulted with database developers prior to recodifying their state’s laws, no doubt our legal codes would be normalized properly.

These inconsistencies might be illogical, but they’re how it is, and must be reflected in the final application of the data. This can be particularly frustrating if the provided bulk data doesn’t record these inconsistencies internally, requiring the gathering of external information to be applied to the data, as is often the case.

Missing Data

One state’s code contains periodic parenthetical asides like “See Editor’s note.” What does the editor’s note say? There’s no way to tell — it’s not part of the bulk data or the state’s official website for their legal code. Those editor’s notes will have to be obtained from the state’s code commission, which are likely to be delivered in the form of a bunch of Word files attached to an e-mail.

Not all data exists electronically, and not all data that exists electronically exists in a single location. Often, piecing together a meaningful data set requires gathering information from disparate sources, sometimes in awkward ways. And sometimes the last few bits of data just aren’t available, and the data set is going to have to be incomplete.

All of these problems are solvable, in one way or another, but those solutions are often time-consuming. The ratio of the Pareto principle applies here: One is liable to get 80% of the data set whipped into shape in the first 20% of the time spent on the project. The remaining 20% of the data will require the remaining 80% of the time. That first 80% feels magical — everything just falling into place — but that last 20% is just plain hard work.

Real-world data is messy. Working with big data means cleaning it up.

Related