Nothing but Net

Preserving the Internet, 1 terabyte at a time.

Feb 28, 19973:30 AM

The Internet is a moving target. Every minute, thousands of Web pages are updated or abandoned. Messages sent to newsgroups replace older postings. All but a fraction of the chat-room conversations and digital images that streak across the Net vanish after they’re displayed.

Seeking to preserve the chaos of the Net for posterity is Brewster Kahle, a man with a mission, a server, and a lot of magnetic tape. Kahle, who once designed computers for Thinking Machines Corp., founded the Internet Archive in 1996 to collect and store all the disparate bits of the Internet. From offices in the Presidio, the former Army base adjacent to San Francisco’s Golden Gate Bridge, the Internet Archive’s powerful computer the Internet at high speeds. Consulting intelligent algorithms about what information to store and how often, the archive’s computer copies data to tape cassettes on a Quantum DLT4500 recorder. When each cassette is full, a robotic arm removes it, stores it in a carousel, and replaces it with a blank one.

T he Internet may seem impossibly vast to users, but in fact it’s quite finite. The entire World Wide Web is currently estimated to contain about 1.5 terabytes (or 1.5 million megabytes) of data. Newsgroups,, and other Internet subsystems account for another 5.5 or so terabytes. (Compare these numbers with the 20 terabytes of ASCII data contained in the Library of Congress’$2 20 million books or the 8 terabytes of data at the average video store.) With tape-cassette storage costing only $20 per gigabyte (1 billion bytes), archiving the Internet is practically economical. Already, the archivists have stockpiled more than 2 terabytes of the Net, and currently they’re storing about 100 gigabytes of data every month. Faster connections to the Net promise to speed things up, and Kahle estimates that his group will be done by the end of 1997.

Storing the Internet once is only the beginning. As experienced Web surfers know, things change rapidly on the Net. The archive doesn’t have the computer muscle to store the publicly available Internet every week, but even if it did, a lot of stuff would still fall through the cracks. On sites like MSNBC and CNN, breaking news comes and goes every minute, which means pages disappear faster than they can currently be squirreled away. Slate is updated daily. Shifting faster still are Web sites generated by databases, such as the online bookstore Amazon.com. Because the information these sites produce is specific to a user’s experience, they can generate a literally infinite number of different pages. Finally, much of the traffic on the Internet is dynamic–chat rooms, instant messages, and now even phone conversations. To archive the Internet with absolute fidelity would require cloning not only every computer on the Internet, but also every person using every computer.

Many responsible netizens already archive themselves for selfish reasons. Archiving is a no-brainer for publication sites like the San Francisco Chronicle’s The Gate, which collects the contents of the daily newspaper and connects them to a good search engine. And other sites like Deja News already assemble postings from the Internet newsgroups.

Where the Internet Archive trumps these archives, of course, is in its sheer comprehensiveness. While it isn’t a replica of the Internet, it’s a start. And it’s not useful just to historians. Suppose your Web browser allowed you to specify not only an address but also a date. Remember that headline you saw on Wired News, but have been unable to find since? The headline was posted for only a day, and you haven’t had much luck using the site’s search tool to locate the piece. But using the Internet Archive to turn back the hands of time will uncover it for you. And what about your teen-age cousin’s Web page, with that cute picture of her Mohawk? Cousin’s mother cancelled her ISP account, and now the site is gone. But an intelligent browser could catch the “no such site” error and look it up on the archive instead, displaying the last-known version. Did your favorite politician really just flip-flop on your hot-button issue? Compare last year’s campaign Web site with today’s. These are just a few of the many valuable services that promise to keep the nonprofit Internet Archive richly endowed.

U seful though it might be, the idea of archiving the Internet is assailed by all sides. David Berreby argued last year in Slate that exhaustive documentation of our world threatens to box us into a corner. The recent “Documenting the Digital Age” conference gathered experts from the computing, telecommunication, and archiving worlds to explore these issues. Corporate executives complained that because their archives are routinely subpoenaed by plaintiffs’ attorneys, they have every incentive to shred their data instead of preserving them. Lawyers worried aloud about privacy and copyright concerns. Should you have the right to exclude your public page from the archive? (Consensus opinion: Yes.) Should we be saving usage logs, which detail every page a person sees? (Probably not.) Doesn’t this whole thing violate current copyright laws left and right? (Almost certainly.) Should those laws be amended to allow such an archive? (Probably.)

Professional archivists argue that it’s a waste of time to store the Internet without providing a proper historical context. Others say that having toomuch information about the Web at our disposal will be as bad as not having enough. They add that finding things promptly on the Web with a search engine is hard enough, that using it as a historical research tool would be incredibly painful. They advocate an orderly weeding, assembling, and categorizing of digital records. Microsoft’s chief technical officer (and Slate contributor), Nathan Myhrvold, whose “Save the Web” memo last year helped start the archive movement, counters that we don’t know now what will be important later. Your cousin might grow up to be president, at which point her teen-age Mohawk Web site will become substantially more important than it is now. Myhrvold adds that it’s better to start saving today’s Internet now, even if it is badly collected and organized, rather than lose it forever.

And to think that Brewster Kahle thought he was just solving a problem by starting the Internet Archive, and not introducing lots of new ones.