Jonathan Stray

by Jonathan Stray

There are some amazing algorithms coming out the computer science community which promise to revolutionize how journalists deal with large quantities of information. But building a tool that journalists can use to get stories done takes a lot more than algorithms. Closing this gap has been one of the most challenging and rewarding aspects of [...] more »

by Jonathan Stray

Before computers, all document-driven stories started with a big stack of paper. Often, the first task was to organize all that paper, by sorting individual documents into piles by type. This gives journalists a high-level idea of “what’s in there” and helps them decide what to read more closely — and just as importantly, what [...] more »

by Jonathan Stray

Overview produces intricate visualizations of large document sets — beautiful, but what do they mean? These visualizations are saying something about the documents, which you can interpret if you know a little about how they’re plotted. Same documents, different visualizations There are two visualizations in the current prototype version of Overview, and both are based [...] more »

by Jonathan Stray

Reporting on an incident where private security contractors fired at civilians in Iraq is one thing, but reporting on all such incidents is something else entirely. That’s the situation we were faced with when, in reporting on the role of private security firms in Iraq, we wanted to analyze 4,500 pages of recently declassified material [...] more »

by Jonathan Stray

Overview is a project to create an open-source document-mining system for investigative journalists and other curious people. We’ve written before about the goals of the project, and we’re developing some new technology, but mostly we’re stealing it from other fields. The following are some of the best ideas we saw in 2011, the data-mining work [...] more »

by Jonathan Stray

The Overview project is an attempt to create a general-purpose document set exploration system for journalists. But that’s a pretty vague description. To focus the project, it’s important to have a set of test cases — real-world problems that we can use to evaluate our developing system. In many ways, the test cases define the [...] more »

by Jonathan Stray

Over the last year, my colleagues and I at The Associated Press have been exploring visualizations of very large collections of documents. We’re trying to solve a pressing problem: We have far more text than hours to read it. Sometimes a single Freedom of Information request will produce a thousand pages, to say nothing of [...] more »