Technology »

Underwritten by John S. and James L. Knight Foundation

Idea Lab is a group blog by innovators who are reinventing community news for the Digital Age.

Read more about Idea Lab »

  • Check out Idea Lab Sponsorship opportunities!

  • Follow us on Twitter »
  • Each Idea Lab blogger is a winner of the Knight News Challenge grant to reshape community news.

    Learn more about the Knight News Challenge »

    ScraperWiki Digs Up Dirty Data So You Don't Have To

    Knight 2011 News Challenge Winner

    The best journalism comes from digging. Not phoning up press officers but speaking with those in the know. Not seeking comments from experts but going out onto the streets. The real stories, the scoops and the breakthroughs don't come prepackaged. So why limit data-driven journalism to the relatively few sources of clean, pre-packaged and nicely delivered data?

    This is where ScraperWiki comes in. ScraperWiki is a developer platform that aims to liberate data from the web, build upon this information to make useful applications, and get journalists and developers working together in true HacksHackers fashion!

    WHAT WE DO

    ScraperWiki Logo Facebook.jpg

    We let you think big when it comes to dirty data. Think global corporate data. OpenCorporates did. So we made a call out and got 10 percent of the world's companies' data in just two weeks.

    We let you think quick and dirty for timely news stories. The Texas Tribune scraped the Department of Criminal Justice to build an interactive of executions on Gov. Rick Perry's watch.

    We let you dig up dirt with dirty data. Even a simple little scraper can uncover facts for a front-page scoop as James Ball from The Guardian found out.

    Or even fish for a story like @OJCstatements, which is tweeting out, and attaching a hashtag to reports on judicial complaints.

    HOW WE DO IT

    We're an open wiki for coding scripts called "scrapers." These extract data from the web -- be it HTML, CSV or PDF files, or even data stored behind online forms.

    blob_developer.pngOur platform is structured so that lots of developers can work on many scrapers feeding a data store, making large-scale data projects feasible for a deadline-driven newsroom.

    Our structure helps prevent these projects from decay, meaning you can make interactives that stay up to date with the data rather than being used to wrap up tomorrow's virtual fish and chips.

    Because our platform is built on code, you can integrate other services, be it Mechanical Turk, Refine or any service with an interface to the data. You can also integrate the data output for other services such as RSS, Twitter or email.

    We have a data-digging community using ScraperWiki everyday, so if you don't have programmers in your newsroom or you need many more for a specific project, we have the right people you're looking for.

    WHAT WE'RE GOING TO DO

    Thanks to funding from the Knight Foundation, here's what we're plugging into our system for you:

    blob_requester.png

    • Data embargo, so journalists can keep their stories secret until going to print, but publish the data in a structured, reusable, public form with the story.
    • Data on demand service. Often journalists need the right data ordered quickly; we're going to create a smooth process for this.
    • News application hosting. We'll make it scalable and easier.
    • Data alerts. Automatically get leads from changing data. For example, watch bridge repair schedules, and email when one isn't being maintained.

    If you want to explore our platform, I suggest you try out this tutorial for non-programmers and check your answers here.

    You can also keep track of our progress by finding us on Twitter and Facebook and checking out the blog.

    We'll be at ONA (we're nominated for an award!) and Strata. Catch us if you can!

    Rate this entry

    • Currently 0/5
    • 1
    • 2
    • 3
    • 4
    • 5

    Rating: 0/5 (0 votes cast)

    Check out MediaShift Sponsorship opportunities!

    Featured Comment

    I think newspapers, blogs, and magazines should all be doing audio versions. I grew up enjoying and listening to audiobooks and now I don't have the same option for the short form content that I prefer to consume.

    Will Mayo
    Do Touch That Dial: Turn Your Newspaper Into a Radio Station

    Newsletters

    MediaShift delivers the best news on media and technology directly to your in-box.

    Monthly Archives