Education »

  • Share
Underwritten by John S. and James L. Knight Foundation

Idea Lab is a group blog by innovators who are reinventing community news for the Digital Age.

Read more about Idea Lab »

  • Check out Idea Lab Sponsorship opportunities!

  • Follow us on Twitter »
  • Each Idea Lab blogger is a winner of the Knight News Challenge grant to reshape community news.

    Learn more about the Knight News Challenge »

    Machine-Generated News a Threat to Journalists? I Think Not

    Knight 2007 News Challenge Winner

    Software that writes baseball game stories from box scores and play-by-play information now has a name: StatsMonkey. And it's making some journalists nervous -- needlessly.

    The software, the first version of which was developed this spring by a team of computer science and journalism students at Northwestern University, has evolved significantly since then. John Templon and Nick Allen (a "programmer-journalist" attending the Medill School of Journalism on a Knight News Challenge scholarship) were two of the students who worked on the initial version of the software, which has been made available on an open-source basis. John and Nick, both Medill grad students, developed the software with Tian Huang of Medill and Thu Cung, a student in the McCormick School of Engineering and Applied Science.

    The software, then called "Machine Generated Sports Stories," was one of five projects developed in an experimental collaboration with the McCormick School's Intelligent Information Laboratory, or InfoLab. The class brought together students from Medill's Interactive Innovation Project and from McCormick's practicum in intelligent information systems. Two professors from Medill (me and Jeremy Gilbert) and two from McCormick (Kris Hammond and Larry Birnbaum) led the collaboration.

    If you want to know more about the class and the software the students developed, you can read the class blog, watch the students' final presentation, or download their comprehensive report that includes recommendations for journalists, media companies and journalism education.

    Since June, Nick and John have kept working on the baseball project as paid interns at the InfoLab. They've reconstructed the code, built a greater variety of game narratives and begun to incorporate details about trends in player and team performance over time.

    Earlier this month, an article in Medill's alumni magazine brought StatsMonkey to the attention of a lot of journalists. A couple of them didn't like it:

    • "Soon enough, sports reporters could be obsolete," wrote Andrew Greiner at NBCChicago.com.
    • Rick Green of the Hartford Courant, asked, "... isn't something lost when the reporter isn't there at the games, talking to players, paying attention to what's not said and feeling the mood?"

    These weren't the first journalists to express concerns about StatsMonkey. Back in August, Gregory Hardy of CBSSports.com worried about what might happen "if robot sportswriters take over."

    Given the turmoil in the news business these days, it's understandable that journalists -- especially sports journalists -- would be nervous about StatsMonkey. But I don't think sportswriters need to be worried -- if StatsMonkey becomes a commercial product, it is highly unlikely to put sports journalists out of work.

    To understand why, let's start by explaining what StatsMonkey actually does:

    • It imports the box score and play-by-play information, information routinely captured for games in professional leagues, college baseball and some lower levels (high school, youth leagues, etc.).
    • It uses some baseball-geek stats (leverage index and win probability added) to identify high-stakes at-bats and key plays that significantly change the probability that one team will win.
    • It determines a game narrative -- for instance, a come-from-behind win, a pitcher's duel, etc. -- from these key at bats and plays
    • It constructs a headline and story from the options for game narratives and incorporates key events from the play by play
    • It uses historical data -- about teams and players -- to add context (for instance, that a particular player's hit broke a 5-game hitless streak, or that this was the team's third win in a rwo).

    Of course, the program will have a limited number of possible game narratives, and it cannot account for events that don't show up in the box score or play by play (for instance, the infamous play in a 2003 Chicago Cubs playoff game in which a fan caught a foul ball that might otherwise have been fielded for an out). A StatsMonkey story will be a very poor substitute for richly textured narrative by a professional sportswriter.

    But think of a few ways StatsMonkey could add to what professional journalists do:

    • It could instantly write a game story as soon as the last out is made, freeing a reporter to go down to the field or the locker room to do interviews
    • It could generate a story about any game in progress, at any point during the game -- just what someone might want when checking on a favorite team during the work day.
    • It could create stories about games -- for instance, college baseball -- that are not routinely covered by professional journalists.
    • It could generate stories about each player in a game for whatever people are especially interested in particular players (not hard to imagine for college baseball)
    • If Little League coaches start to enter game information through a mobile device (and there already is at least one "app for that"), it could generate game stories about Little League games, which have a passionate following but will never be covered by professional journalists.

    Beyond that, even given my background and identity as a journalist, I would have to say to any sportswriter: If your game story CAN be generated by a computer, at some point it WILL be generated by a computer. Human journalists will do -- and should do -- the kind of reporting and storytelling that computers can't.

    Beyond that, StatsMonkey is just a first experiment in identifying formulaic stories could conceivably be generated by software rather than people. Some other possible examples: corporate earnings reports, obituaries, even accounts of what City Council did last night. As with StatsMonkey, software that generates these kinds of stories most likely wouldn't replace journalists. The software would create stories that would otherwise not be written, or free up journalists to do more important work that can only be done by humans.

    Got any other ideas for topics that would be a good fit for computer-generated stories? Post in the comments below.

    Rate this entry

    • Currently 4.5/5
    • 1
    • 2
    • 3
    • 4
    • 5

    Rating: 4.5/5 (10 votes cast)

    Check out MediaShift Sponsorship opportunities!

    Featured Comment

    The problem is that the remedies proposed would undermine the characteristics of the Internet that have made it such a fantastic engine of innovation -- primarily the right to innovate without permission from an incumbent who may be threatened by your innovation.

    bradburnham
    #DontBreakTheInternet: How the Web Became a Political Force vs. SOPA

    Monthly Archives