Terabytes of Tweets Headed to Library of Congress

BY kgardiner  April 15, 2010 at 9:57 AM EST

More than 6 billion public tweets — the entire ‘corpus’ of the micro-blogging site Twitter — are headed to the Library of Congress in the next few months.

Martha Anderson, the director of the Library’s national digital information infrastructure and preservation program, says Twitter’s representatives came to the Library with an eye towards storing the information for social scientists who want to track social change over time.

“We’re hoping to engage the community of scholarly researchers in ways that they want to use the content,” Anderson said. “In the short term, we think the archive could have an application for humanities and social science scholars.”

“In the future, they may be more interested in the content,” she said. “But today people are looking at the mass of tweets… and trying to understand the viral transmission of reaction to public figures.”

On the company’s blog, Twitter co-founder Biz Stone wrote, “Over the years, tweets have become part of significant global events around the world — from historic elections to devastating disasters…It is our pleasure to donate access to the entire archive of public Tweets to the Library of Congress for preservation and research.”

Anderson says accepting Twitter’s archives is one step in a multi-level project to preserve parts of the Internet.

“We have more than 167 terabytes of information already saved,” she said. “We first started collecting tweets during the nomination for Supreme Court Justice [Sonia] Sotomayor. Since then, we’ve started thinking about ways to collect more information.”

Anderson says the Library is considering working with a group of Stanford students to make the data useful to researchers.

“We’d be looking into ways to index, extract data, mining, disambiguate proper names,” she said. “We’re hoping to get it in a usable format in three or four months. It may be optimistic, but we’re hopeful.”

In a related announcement, Twitter and Google announced a new tool that allows users to replay “social media events,” such as the election tweet of President Obama and the plane that landed in the Hudson River last year.

Google’s replay feature was released in conjunction with new uses of data feeds that will shift the way information shared on Twitter and Facebook will be indexed over time to an even more instantaneous aggregate model.

Stone also announced Wednesday that Twitter.com has registered more than 105 million users since its inception. Speaking at his company’s developer conference, Chirp, in San Francisco, he reported the website sees more than 55 million 140-character tweets per day.

Stone reported a third round of venture capital funding netted his company an additional $35 million in February. Twitter is valued in excess of $250 million. It announced its first-ever revenue stream — promotional posts paid for by advertisers — this week.