Technology »

Underwritten by John S. and James L. Knight Foundation

Idea Lab is a group blog by innovators who are reinventing community news for the Digital Age.

Read more about Idea Lab »

  • Check out Idea Lab Sponsorship opportunities!

  • Follow us on Twitter »
  • Each Idea Lab blogger is a winner of the Knight News Challenge grant to reshape community news.

    Learn more about the Knight News Challenge »

    The Top 10 Data-Mining Links of 2011

    Knight 2011 News Challenge Winner

    Overview is a project to create an open-source document-mining system for investigative journalists and other curious people. We've written before about the goals of the project, and we're developing some new technology, but mostly we're stealing it from other fields.

    overview.png

    The following are some of the best ideas we saw in 2011, the data-mining work that we found most inspirational. Many of these links are educational resources for learning about specific technology. Some of this work illuminates how algorithms and humans treat information differently. Other are just amazing, mind-bending work.

    1. What do your connections say about you? A lot. It is possible to accurately predict your political orientation solely on the basis of your network on Twitter. You can also work out gender and other things from public information.

    2. Free textbooks from Stanford University. "Introduction to Information Retrieval" teaches you how a search engine works, in great detail. "Mining Massive Data Sets" covers a variety of big-data principles that apply to different types of information.

    3. We're not above having a list of lists. Here's the Data Mining Blog's top 5 articles. Most of these are foundational, covering basic philosophy and technique such as choosing variables, finding clusters, and deciding what you're looking for.

    4. The MINE technique looks for patterns between hundreds or thousands of variables -- say, patterns of gene expression inside a single cell. It's very general, and finds not only individual relationships but networks of cause and effect. Here's a nifty video, here's the original paper, and here's one statistician's review.

    5. This is one of those papers that really changed the way I look at things. How do we know when a data visualization shows us something that is "actually there," as opposed to an artifact of the numbers? "Graphical Inference for Infovis" provides one excellent answer, based on a clever analogy with numerical statistics.

    6. Lots of text-mining work uses "clustering" or "classification" techniques to sort documents into topics. But doesn't a categorization algorithm impose its own preconceptions? This is a deep issue, which you might think of as "framing" in code. To explore this question Justin Grimmer and Gary King went meta with a system that visualizes all possible categorizations of a document set, and how they relate.

    7. A few years ago Google showed that the number of searches for "flu" was a great predictor of the actual number of outbreaks in a given location -- faster and more specific than the Center for Disease Control's own surveillance data. The team has now expanded the technique into Google Correlate, which instantly scans through petabytes of data to find search terms which follow any user-supplied time series. Here's New Scientist taking it for a test drive.

    stanford.png

    8. Not content with free professional textbooks, Stanford has created two free online courses for machine learning and natural language processing. Both are live-streamed lecture series taught by experts, with homework. Learning these intricate technologies has never been easier.

    9. Lots of people have speculated about the role of social media in protest movements. A team of researchers looked at the data, analyzing a huge set of tweets from the "May 20" protests in Spain last year. How do protests spread from social media? Now we have at least one solid answer.

    10. And the craziest data-mining link we ran across in 2011: IBM's DeepQA project, which beat human Jeopardy champions. This project looks into an unstructured database to correctly answer about 80% of all general questions posed to it, in just a few seconds. Here's a TED talk, and here's the technical paper that explains how it works. I can't tell you how badly I want one of these in the newsroom. If enough journalist hackers build on each other's work, maybe one day ...

    Happy data mining! We'll be releasing our own prototype document-mining system, and the source, at the NICAR conference next month. If these are the sorts of algorithms you like to play with, we're also hiring programmers who want to bring these sorts of advanced techniques within everyone's reach.

    Rate this entry

    • Currently 0/5
    • 1
    • 2
    • 3
    • 4
    • 5

    Rating: 0/5 (0 votes cast)

    Check out MediaShift Sponsorship opportunities!

    Featured Comment

    I think newspapers, blogs, and magazines should all be doing audio versions. I grew up enjoying and listening to audiobooks and now I don't have the same option for the short form content that I prefer to consume.

    Will Mayo
    Do Touch That Dial: Turn Your Newspaper Into a Radio Station

    Newsletters

    MediaShift delivers the best news on media and technology directly to your in-box.

    Monthly Archives