TOPICS > Arts > Culture at Risk

Internet history is fragile. This archive is making sure it doesn’t disappear

January 2, 2017 at 6:20 PM EDT
What’s online doesn’t necessarily last forever. Content on the Internet is revised and deleted all the time. Hyperlinks “rot,” and with them goes history, lost in space. With that in mind, Brewster Kahle set out to develop the Internet Archive, a digital library with the mission of preserving all the information on the World Wide Web, for all who wish to explore. Jeffrey Brown reports.

WILLIAM BRANGHAM: We’re increasingly cataloging our lives online, Facebook, Twitter, Instagram, seemingly endless YouTube videos. It’s a digital-first and often digital-only world.

The advantages of this unlimited digital storage seem obvious, but how permanent are some of those records? How do we preserve digital history?

Jeffrey Brown has the story, part of our ongoing series Culture at Risk.

JEFFREY BROWN: So, this is like an ancient temple come to life in a modern age.

BREWSTER KAHLE, Founder, Internet Archive: In a modern day. It’s a Greek-style building, which we loved, because the whole idea is the Library of Alexandria reborn now.

JEFFREY BROWN: It’s an ancient idea: to gather and preserve the world’s knowledge. But now that library will look like this.

These stacks of servers, Brewster Kahle told me recently, represent a 20-year and running effort to build a kind of digital library, and to essentially back up the ever-expanding World Wide Web.

BREWSTER KAHLE: In one of these would be 100 years of a channel of television. Or this much is all of the words in the Library of Congress.

We need to be able to preserve our digital history.

JEFFREY BROWN: Kahle was an early Internet entrepreneur who in 1996 founded the Internet Archive, a nonprofit that operates out of an old Christian Science church in San Francisco.

It was designed to address a fundamental flaw in the original creation of the World Wide Web by Tim Berners-Lee in 1989.

BREWSTER KAHLE: The wonder of it is, it’s very, very simple. Anybody could go and set up a web server on their computer and make it available to the world.

Unfortunately, it’s too simple. It’s fragile, that if something happens to that piece of equipment, that Web site just, blink, is gone.

JEFFREY BROWN: If it’s online, it lives forever, right? Well, no.

Kahle says the average lifespan of a Web page is just 92 days. Information is altered and deleted all the time for all kinds of reasons.

A 2013 Harvard study, for example, found that half the hyperlinks in Supreme Court cases, today’s equivalent of footnotes, are broken, a phenomena known as link rot. Government agencies remove documents, and companies fail, and with them the sites they host. Think of GeoCities, Yahoo! Video, and, more recently, the news site Gawker.

ABBY SMITH RUMSEY, Author, “When We Are No More”: People mistake the fact that the Internet is ubiquitous with the fact that it’s permanent.

JEFFREY BROWN: Abby Smith Rumsey is the author of “When We Are No More: How Digital Memory Is Shaping Our Future.”

She began her scholarly career studying how information was purposely deleted in the totalitarian Soviet system. These days, she thinks, we have a new kind of storage and retrieval problem.

ABBY SMITH RUMSEY: It isn’t permanent at all. And, in fact, the thing about digital technology is, you can inscribe something onto a computer, but you can’t put it on a shelf and expect to pick it out at random at 50, let alone 500, years, and be able to read it.

In fact, you won’t have the hardware or the software to do that. So, it’s very fragile, indeed.

JEFFREY BROWN: And while there might be plenty online not worth saving, Rumsey sees much higher stakes.

ABBY SMITH RUMSEY: I think we’re losing obviously the past, but by saying that we’re losing the past, the record of the past, we’re saying that, in a sense, we’re losing our own memory and sense of who we are.

JEFFREY BROWN: Brewster Kahle’s answer? The Wayback Machine, fancifully named for a feature on the old “Rocky and Bullwinkle Show.”

There, a genius dog named Mr. Peabody took his adopted boy, Sherman, back in time to better understand key historical events.

MARK GRAHAM, Director, The Wayback Machine: We have been collecting captures of the public Web for the last 20 years.

JEFFREY BROWN: Mark Graham, director of the new Wayback Machine, says it’s already saved more than 500 billion Web captures in its 20-year history.

MARK GRAHAM: We have software that’s referred to as crawlers or spiders that go out and go to individual Web pages, look at those Web pages, look at all of the links on those pages and then go to those pages, look at all the links on those pages, and then goes to those pages, et cetera, et cetera. So, kind of like a spider crawling on the Web, the software goes out and discovers what’s available.

JEFFREY BROWN: Users can then visit Web pages at different points in their history, seeing how they looked before they were altered or deleted.

MARK GRAHAM: We’re a library, and so we don’t really try to like figure out what it is people may want. We just want to have as much as we can of what’s available, because we know that people are probably going to want these things.

JEFFREY BROWN: Exactly what the archive saves is determined by popularity, the number of references or links to that page, and by some 1,000 librarians and experts around the globe working in concert with the Internet Archive.

Case in point, in July 2014, Russian-backed rebels claimed to have shot down a military plane over Ukraine, until it became clear the jet was a passenger airliner, with 283 people killed.

MARK GRAHAM: So, what we’re seeing here is a capture, actually one out of the 38 captures of a post made on a Russian social media site by a pro-Russian rebel who was boasting about shooting down a plane at the same time that MH-17 was shot down.

And this was removed within a few hours after it was posted. And as far as we know, these are the only captures of this Web page that exist.

JEFFREY BROWN: These kinds of pages have in some cases even been used as evidence in courts.

The Internet Archive’s home is a strange world: several generations of media and technology, stained glass windows and church pews, and almost eerie rows of sculptures of the many people who’ve worked on the project. Its motto is universal access to all knowledge, and Kahle’s aspirations could hardly reach higher.

BREWSTER KAHLE: The idea is to build the Library of Alexandria, version two. Could we make all the published works of humankind, books, music, video, Web pages, software, available to anybody who wanted to have access to them anywhere in the world?

JEFFREY BROWN: Today, the project is digitizing films, books, video games, software, even round-the-clock television news channels.

The original Library of Alexandria is thought to date to around 295 B.C. When and how it was destroyed is still much debated, but the loss of so much of the classical world’s greatest works is beyond debate.

BREWSTER KAHLE: The best thing to learn from the Library of Alexandria, version one, is don’t just have one copy. If we had had another copy in India, or in China, we’d have the other works of Aristotle, the other plays of Euripides. But we don’t.

JEFFREY BROWN: Not to mention so many other things that were just lost. We don’t even know what we don’t have.

BREWSTER KAHLE: We don’t even know what they were. And they’re just — they’re — they’re gone. And some people think it’s — say it’s good to forget. And I’m sure there’s good things to forget, but there’s a lot that we should have remembered and kept alive.

JEFFREY BROWN: There’s a lot more to be done.

Kahle and his group are teaming with companies like Mozilla and Wikipedia to make preservation of Web pages more automatic. They’re also working on ways to make the Wayback Machine more easily searchable.

And, of course, copyright reform remains an important ongoing question in any attempt to create a true global library. Twenty years in, Kahle is impatient.

BREWSTER KAHLE: Why aren’t all of the books in all of the libraries already digital?

JEFFREY BROWN: And why aren’t they?

BREWSTER KAHLE: I think it’s that institutions don’t know what roles they’re supposed to play going forward. They knew what it was when they were supposed to buy books and put them on shelves, but now, do they do their own digital services? Do they wait for somebody else to do it and subscribe to it?

I’m hoping that, by at least 2020, all right, so in three, four years from now, we’re not talking about, wouldn’t it be great to build a complete digital library of the Library of Congress online? We say, OK, that’s done. Now what do we do? How do we go and make the next better services? How do we make a global brain? How to we go and make it so that Nobel scientists are using these vast resources to go and make new discoveries?

I think we only have pieces now.

JEFFREY BROWN: Billions of pieces, with billions more being collected all the time.

From San Francisco, I’m Jeffrey Brown for the “PBS NewsHour.”