In my intro news writing class at UNC-Chapel Hill we ding students 50 points for allowing a fact error to creep into a story. If the fire happened at 123 W. Main St. and it really happened at 123 E. Main St., it's an instant F. But I've learned from working on OpenBlock that online maps seem to have been put into the same category as horseshoes and hand grenades.
Which creates an interesting problem for copy editors and AP style sticklers alike -- how do you ensure that, in a county with 38,000 addresses, each shows up on the map "close enough?"
For several months earlier this year I struggled with increasing our precision and automation of mapping addresses using OpenBlock in rural areas. We've made several technical adjustments to the application's code, but we're now at the point where spending more time with the code will be more expensive than turning the remaining errors over to a human to fix one-by-one.
What's "close enough" for us? Well, when an address of a news event goes into OpenBlock it will fail 5-25 percent of the time, depending on the source of the data. So the question is not whether we can weed out all errors -- as it is in my news writing class -- but to make sure that when addresses fail they do so in the right way.
hit rate and precision
Geocoding accuracy has two components -- the hit rate and precision. The hit rate measures how often you type an address and get back any sort of location on a map. Google Maps, for example, has a very high hit rate. But it's not always accurate. In fact, if you're looking for the Burger King in Chadbourn, N.C., Google will erroneously send you across the state line into South Carolina.
So that's a hit, but not a very precise one. Google puts the Burger King in another state because it uses its address to guess where it might fall along a very long road. OpenBlock won't guess like that much, so it has a lower hit rate. And we want any address that fails to end up in the hands of an editor -- so that she or he can check the problem and correct it.
The higher the hit rate, the more editors have to be concerned with precision. This isn't a topic I've not heard widely discussed at conferences or in newsrooms. I suspect that if they are anything like I was a few months ago, most editors are blissfully unaware of just how imprecise their geocoding and how many unchecked fact errors might be surfacing on their sites.
We checked our address precision for the pilot installation of OpenBlock Rural by getting a shapefile from the Columbus County's GIS office that provides latitude and longitude coordinates for all addresses in the county. We then took the list of the address from that file and ran them through OpenBlock's geocoder, which gave us another set of latitude and longitude coordinates. We ran the two sets of points through a tool for calculating the distance.
With that dataset we found, for example, that the point at which the county says is "235 Hammond Drive" is about 3 miles away from the point that OpenBlock's geocoder thinks is "235 Hammond Drive." OpenBlock puts that address on a map, but not with very much precision.
I'll come back to Hammond Drive in a bit, but first want to describe some of the other ways we've tried to determine whether our automatic-F-in-intro-newswriting level of accuracy is tolerable.
Of our more than 38,000 addresses ...
- Only 9 percent of the points are within a foot or less of the point given by the county for that address. That doesn't seem very good.
- Garmin GPS advertises accuracy within about 50 feet. Only 16 percent of our geocoding is that accurate. Our audience isn't driving around, so maybe we can afford to be less precise than that.
- Google's point for "235 Hammond Drive" is 482 feet away from the front doorstep of the house at that address. About 94 percent of our addresses are at least that close to the points at which the county places them.
- Our median error was 118 feet, which is better than the median error of 127 feet found in a similar test of rural addresses using the Gold Standard of geocoding with ESRI's ArcGIS software.
So I'm not going to fuss over each address like I would over each fact in a story. Using these industry benchmarks we're going to triage the most imprecise addresses and worry first about those.
And all errors won't be created equal. Going back to our "235 Hammond Drive" example, a user who searches for that address would get two results: "235 Hammond St., Fair Bluff, NC 28439" and "235 Hammond, Cerro Gordo environs, NC 28439."
For the first problem, you might quickly point out that the search term itself is ambiguous -- why wouldn't someone simply search again and include the city this time? The bottom line is we have no data on how often people search on street alone, as opposed to other permutations of the complete address. A good copy editor should be able to work with developers to produce a test that would yield us that data.
The other difficulty -- and this is going to be trouble more often in rural areas than urban -- is that the second address is outside the Fair Bluff city limits, but has a Fair Bluff mailing address. So the most likely alternate search that a user would perform seems like it would be "235 Hammond Drive, Fair Bluff, NC, 28439." That search would yield only the first result shown here. And it would be the wrong one.
So copy editors who work with online maps have to have enough local knowledge so that when they look at a map of 235 Hammond Drive, they know it's out in the country and not downtown.
the copy editor/debugger
Copy editors also have to be good debuggers. Again, they have to think about breaking the problem down into testable parts. The first problem is that deep in the OpenBlock code, a human made the correct assumption that sometimes people make mistakes -- and that when they type "Hammond Drive" they really might mean "Hammond Street." What in this case looks like a bug is often a feature. Fix this problem and you'll break a solution.
The other possible problem is that the underlying geographic data is incorrect. The Census Bureau's TIGER/Line Shapefiles are the most common source for geographic data, but in rural areas the Census omits many address ranges for street segments. In this particular case, it had "Hammond Drive" in the wrong ZIP code. The true ZIP code is 28430.
So instead of using Census data -- which is a key assumption of OpenBlock -- we're now using a shapefile given to us by Columbus County -- but only 49 of North Carolina's 100 counties provide such information on their websites. This local data source improves the completeness and accuracy of our underlying geographic data, but will make it more expensive to deploy OpenBlock. We won't be able to rely on a standardized script to import geographic data if each county's file format is different.
Geocoding in OpenBlock has been a good warning to me, and a reminder to newsrooms everywhere that the copy editor/debugger is going to be at least as valuable as the reporter/programmer. And that means I need to figure out how to teach that skill to my students at UNC.
How does your newsroom provide quality control for its online maps? What's your hit rate? And precision? Share your experiences in the comments here or by tweeting to @OpenRural.