Researchers hope search engine will shed light on dark data
As much as 90 percent of information on the Internet is “dark” — locked away in clunky or outdated formats that makes it difficult, sometimes impossible, to access.
Kenton McHenry gets frustrated just talking about what he had to go through to open a research paper in the now-obsolete format PostScript. That was in 2000, when he was still a college student. First he had to download a viewer and then uncompress the document before he could read the article — all to determine if it even had information he could use.
“It would drive me nuts,” he said. “I don’t want the tools to uncompress the thing. I just want the data.”
McHenry, now a senior research scientist at the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign, believes in accessing data as quickly and efficiently as possible, whether that’s your old wedding video or a massive scientific dataset. And he’s developed a search engine to do just that.
Brown Dog is designed to convert defunct computer files into accessible formats, preserving information in those files for generations to come. This means that one may no longer need a patchwork of computer applications to use scientific datasets, read old thesis papers or access family videos uploaded onto the Internet.
The search engine has two functions, McHenry says. First, the user can feed a file saved in an outdated format into Brown Dog’s “Data Access Proxy” bookmarked onto their web browser. Within the cloud, stored computer coding transforms the file into something the browser can read and the user can access. Brown Dog’s other function, its Data Tilling Service, enables users to look at otherwise inaccessible data while cloud-based coding assigns metadata to previously unreadable images, audio, video and other uncurated data. That allows users to use keywords to search a collection of photos to find what they need.
Brown Dog developers demonstrate how to use the search engine’s two primary functions at a workshop this year. Video courtesy of YouTube/ISDA Software
While that may sound just as cumbersome as opening Postscript, McHenry promises it’s not. The whole process, when working correctly, should take a few clicks.
Brown Dog fits into an emerging field of cyber infrastructure. It received a $10 million, five-year grant from the National Science Foundation* in 2013 as part of its Data Infrastructure Building Blocks program. This program is intended to complement several services that preserve and power cyber-based data, said Robert Chadduck, who directs the NSF program. For example, one program called Wrangler at the University of Texas at Austin focuses on constructing a data resource big enough to respond to scientific data analysis on a national scale. Another project uses geospatial data collected from maps, satellite images and more to analyze information ranging from the effects of climate change to how densely packed population centers are.
Brewster Kahle, digital librarian and founder of the Internet Archive, knows the hassle of keeping digital files up to date in a time of rapidly changing technology. The San Francisco-based archive’s web collection, he said, contains more than 1 million video files alone, which have been moved into new file formats six times over the last decade. The process, Kahle said, is intended to ensure that the data remains relevant and accessible to people with “different devices and different expectations.”
“It’s an active job. You can’t just sit around,” Kahle said.
Technology, Chadduck said, should “proceed at the speed of app development which frankly is how we conduct our lives.”
“If you have baby pictures, wedding pictures, pictures of departed family members that are invaluable to all of us, the thing is…those images may be encoded or are on digital cameras that no longer exist or have become obsolete,” Chadduck said. “The images of all of our lives are invaluable to each of us. That’s the time machine part of Brown Dog.”
It’s still unclear when Brown Dog will be operational, but McHenry expects it to be available on a limited basis for demonstrations and testing in March 2015.
*For the record, the National Science Foundation is a funder of the NewsHour.