Reporting on an incident where private security contractors fired at civilians in Iraq is one thing, but reporting on all such incidents is something else entirely.

That’s the situation we were faced with when, in reporting on the role of private security firms in Iraq, we wanted to analyze 4,500 pages of recently declassified material — the raw reports generated every time a security contractor working for the U.S. Department of State fired a weapon in Iraq, from 2005 to 2007. There was more material here than we could possibly read on deadline, so we used our prototype Overview document-mining system to visualize the major topics and themes of the material.

Overview analyzes the text of each document, extracts key words, and then tries to cluster documents together based on similar key words, which usually indicates similar topics. You can see how the process works in this video:

Using the Overview prototype for document mining from Jonathan Stray on Vimeo.

As we reported in our full story, the documents show that mostly, these contractors fired at approaching civilian vehicles to protect U.S. motorcades from the threat of suicide bombers. The documents also show how often shots were fired, and provide a window into how State Department oversight of security contractors tightened during the war. We found 14 incidents where an Iraqi was injured from shots fired by a security contractor, including 10 deaths, of which we suspect six are previously unreported.

Reporting was a multi-stage process. First we used a small custom Ruby script to split the large PDF files into individual documents by detecting cover pages. (This is one of the major document-mining problems that Overview wants to solve.) Then we used Overview to quickly visualize, explore and tag the resulting 666 documents. We also did some random sampling to determine the prevalence of various kinds of events, such as how often there was a followup investigation. Finally, we followed up with interviews and data from the State Department, to put these events in context and answer some unresolved questions.

A week later, at the annual NICAR conference for data journalists, we released the prototype software, and you can now download and install Overview for yourself. If you can get your document text into a .CSV file, with one row per document, you can analyze it with Overview. The current system works well for up to about 20,000 or 30,000 documents.

Our next step is to make Overview easier to install and use, including training materials and tools to help wrangle your documents into the right input format. But in the long term, we need to build the technology into a web interface, to avoid the installation problem entirely. The plan is to interface Overview with DocumentCloud, for an integrated system that combines document upload, OCR (optical character recognition), storage, viewing and search (DocumentCloud) with advanced visualization, analysis and tagging (Overview.)

If you want to try Overview on your own documents, contact us @overviewproject and we’ll work with you to get your story going. The code is on GitHub, and you’re welcome to hack on it, but if this is the sort of project that excites you, we’re hiring Java and JavaScript developers to work on it full-time. More than just creating a web version of this prototype, we’ll soon be coding up the next generation of document-mining tools and techniques. For example, we haven’t done any work yet on visualizing email archives, which have incredibly rich social structure.