This all began at Joe’s Coffee on the Columbia University campus one morning last spring. Mark Hansen was then teaching journalism students who didn’t know much about data or computing. I was working with engineers and skilled data people, who didn’t understand how important storytelling is to problem-solving, especially because different choices, which might seem minute at the time, can lead to very different conclusions. We agreed broadly on the need to bridge that gap: Just as stories are richer when supported by data, the best data analysis is context-laden and done with an eye to the bigger picture. Mark and I talked about the kinds of tools that would enable more thoughtful and creative data-driven storytelling.

Cathy O'Neil.

Cathy O’Neil.

A year later, he was looking for someone to put those ideas to work. Together, we’ve built the Lede Program, a post-bac certification course in the Columbia School of Journalism that will offer hands-on training in data, code and algorithms to journalists, as well as researchers, designers, policy analysts, and others with interest in the digital humanities. We hope to get beyond the hype surrounding data journalism and positively affect students and the movement as a whole.

The program’s initial aim was to equip students with technical skills needed to enroll in the dual Journalism/Computer Science master’s program that Columbia launched in 2010. Since then, though, it has taken on a life of its own. Basic scripting and data science skills have never been more in demand, and we saw a real opportunity to cut past the noise and train a crop of students who’d have both technical chops and a creative understanding of what their skills can more broadly mean.

A paired set of goals

I am setting a dual mandate for the Lede Program. On a practical level, we want Lede students to develop the hard scripting and analytical skills that they can use to build new investigative tools — for journalism, or for policy, humanities-based, and other forms of research. Philosophically, we have high ambitions, as well. We also want them to understand the creative, expressive and critical aspects of this work — to question existing models and designs and the incentive systems behind them.

Big Data models and practices aren’t just describing the world, they’re affecting the world. We need to be able to understand, investigate and communicate those effects.  There’s a tendency these days for people to have a kind of blind faith in “data,” with data-driven decision models replacing human accountability. We want students to learn to look behind that computational complexity at the politics and incentives shaping those models — to not only mine data sets, but to question the models that others use to mine them.

An intense core and sustained program

When students show up for our first class on May 27, they’ll begin an intense summer of daily coursework in computing, databases, algorithms and platform studies, guided by classroom lectures and term-length investigative projects. I’m especially excited about the cross-disciplinary set of instructors who’ve come aboard to team-teach our four core courses, including Columbia academics from the Applied Math, English, and History departments, as well as people who have taught an approach to data and to programming that is explicitly creative and open-ended.

Students who stay on for the full “Lede-24” program, which we hope most of them do, will spend the Fall semester taking classes at Columbia’s Department of Computer Science and Institute for Data Sciences and Engineering, learning advanced concepts and skills in Computer Science, Statistics, or other digital humanities areas with support from our team. All students will graduate with a portfolio of digital work. Lede-24 students also get a post-bac certification degree.

Data-driven journalism in larger contexts

A screenshot of FiveThirtyEight.com.

A screenshot of FiveThirtyEight.

There’s no question that the hype around data journalism is reaching a fever pitch, thanks in large part to a handful of heavy-hitting new ventures including 538 and Vox. It’s an exciting time, and while there will be a lot of different approaches, we have our own flavor. Specifically, we will be emphasizing the creative and storytelling aspects of data journalism, and we are also explicitly requiring high standards of reproducibility and transparency.

Although we want to tell stories with data as evidence, as the best data journalism does, I don’t only mean that. The practice of modeling and working with data has its own internal form of “narrative” — the description that explains the choices you’ve made in building any given model or other representation, the story of why you chose to use certain data sets or analysis tools over others, and how you’ve framed them. Data people need to learn to present their modeling choices this way more often and that includes data journalists.

A great example of what I’d like us to aspire to is a very cool iPython notebook that was recently built by computational social scientist Brian Keegan. The post exposed and reproduced the underlying analysis that a 538 article used to look at women in film, making it available for others to understand, scrutinize and modify. In a recent blog post, Keegan emphasized why that kind of openness is so important for data journalism, and it’s a philosophy I hope to impart to students.

We want to set a standard of transparency in this emerging field and to encourage — or someday even host — a live, interactive platform were anyone can see and share and learn from the tools that others have built -– a kind of GitHub just for data journalism. I see that as the way to further the conversation, to really bridge the gap between journalists and data people and make sure we keep finding and telling new kinds of stories.

Cathy O’Neil is director of The Lede Program: An Introduction to Data Practices at Columbia Journalism School. She earned a Ph.D. in math from Harvard, completed a postdoc at MIT, and served as a math professor at Barnard College. She did quantitative work for the hedge fund D.E. Shaw during the credit crisis and then for RiskMetrics. O’Neil also writes a blog at mathbabe.org and facilitates Occupy Wall Street’s Alternative Banking group.

Related