What does it mean to work on a project where open-source principles are written into the founding contract? A little over a month after receiving a 2009 Knight News Challenge grant, DocumentCloud released its first open-source component.
The system, called CloudCrowd, performs the distributed computing that helps process the vast quantities of documents that will eventually be stored in DocumentCloud. It might seem premature to be releasing code so early — in the past some Knight grantees have chosen to wait until the end of their grant — but the larger part of open-source is community, not code. We’re planning on releasing portions of DocumentCloud as we build them in order to take advantage of the contributions that the open-source community can provide.
When finished, DocumentCloud will be a software system, a website, and a set of open standards that will make it possible to read, search, and organize primary source documents across the web. As a journalist or researcher, you will be able to run filtered searches across the library of documents, and embed your source documents right alongside an article or blog post. All the aspects of the system — the search engine, the document viewer, the journalist workspace — will be open-sourced during the course of our grant.
The Value of Open Source
Going open-source is often viewed by skeptics as a sort of feckless altruism: a free handout of valuable intellectual property. I couldn’t disagree more, especially with respect to nonprofit organizations. A small team with limited resources benefits greatly from the ideas, bug reports and patches that a community can provide.
Since releasing CloudCrowd a month ago, we’ve fixed a handful of bugs and added dozens of features, directly as a result of input from the community of contributors. There are, as of this writing, 375 developers that choose to follow the project on GitHub, which means that they pay attention to the project and are notified any time changes are made. 12 of them have ‘forked’ the project, pushing CloudCrowd in the directions that matter most to them.
Needless to say, this community of contributors dwarfs the size of DocumentCloud as an organization, and is invaluable in helping to increase the quality of the software. CloudCrowd is already being used to process biomedical data and align gene sequences across strains of influenza virus, an application that’s far afield of our original use. We’re looking forward to hearing more about how it holds up in other arenas.
Most of our work over the past few months has been on the internal DocumentCloud prototype, which is a complete first draft of what the system will become. We’re hoping to extract additional portions of the prototype for release in the near future. So stay tuned as more and more bits of DocumentCloud come online.