Over the last year, my colleagues and I at The Associated Press have been exploring visualizations of very large collections of documents. We're trying to solve a pressing problem: We have far more text than hours to read it. Sometimes a single Freedom of Information request will produce a thousand pages, to say nothing of the increasingly common WikiLeaks-sized dumps of hundreds of thousands of documents, or huge databases of public documents.
Because reading every word is impossible, a large data set is only as good as the tools we use to access it. Search can help us find what we're looking for, but only if we know what we are looking for. Instead, we've been trying to make "maps" of large data sets, visualizations of the topics or locations or the interconnections between people, dates, and places. We've had a few notable successes, such as this visualization of the Iraq war logs (click image to expand to full size):
But frankly, this has been a slow process, because the tools for large-scale text analysis are terrible. Existing programs break when faced with more than a few thousand documents. More powerful software exists, but only in component form. It requires lots of programming to get a useful result.
All eyes on data visualization
Meanwhile, DIY visualization thrives. At the Eyeo festival in Minneapolis this summer, I was overwhelmed by the vibrant community that has formed around data visualization. Several hundred people sat in a room and listened raptly to talks by data artist Jer Thorp, social justice visualizer Laura Kurgan, the measurement-obsessed Nick Felton, and many others. Suddenly, a great many people are enthusiastically making images from code and data.
The weapon of choice for this community is Processing, a language designed specifically for interactive graphics by Ben Fry and Casey Reas (both of whom were at Eyeo). Creative communities thrive on good tools; think of Instagram, Instructables, or Wikipedia.
We want Overview to be the creative tool for people who want to explore text visualization -- "investigative journalists and other curious people," as our grant application put it.
The algorithms that our prototypes use are old by tech standards, dating mostly from information retrieval research in the '80s. But then, the algorithms that the resurgent visualization community is implementing in Processing are mostly old, too; I coded many of them in C++ in the early 1990s when I was learning computer graphics programming. Today, one doesn't have to learn C++ to make pictures with algorithms. The Processing programming environment takes care of all the hard and boring parts and provides a simple, lightweight syntax. It's a visualization "sketching" system, tailor-made for the rapid expression of visual ideas in code.
No such programming environment exists if you want to do visualizations of the text content of large document sets. First, you have to extract some sort of meaning from the language. Natural language processing has a long history and is advancing rapidly, but the available toolkits still require a huge amount of specialist knowledge and programming skill.
Big data also requires many computers running in parallel, and while there are now wonderful components such as distributed NoSQL stores and the Hadoop map-reduce framework, it's a lot of work to assemble all the pieces. The current state of the art simply doesn't lend itself to experimentation. I'd love for people with modest technical ability to be able to play around with document set visualizations, but we don't have the right tools.
This is the hole that we'd like Overview to fill. There are certain key problems, such as email visualization, that we know Overview has to solve. But we'd like to solve them by building a sort of text visualization programming system. The idea is to provide basic text processing operations as building blocks, letting the user assemble them into algorithms. It should be easy to recreate classic techniques, or invent new ones by trial and error. The distributed storage and data flow should be handled automatically behind the scenes, as much as possible.
That's an ambitious project, and we are going to have to scale it down. Perhaps the first version of Overview won't be as expressive or efficient as we'd like; we are explicitly prioritizing useful solutions to real problems over elegant tools that can't be used for actual analysis. By the end of our Knight Foundation grant, Overview has to solve at least one difficult and essential problem in data journalism.
But ultimately, what we intend to build is a sketching system for visualizing the content and meaning of large collections of text documents -- big text, as opposed to big data. Just as the Processing language has been a great enabler of the DIY visualization community, we hope that Overview will give interested folks a simple way to play with lots of different text processing techniques -- and that we'll all learn some interesting things from mining our ever-increasing store of public documents.