Webhead

The Archivist

Brewster Kahle made a copy of the Internet. Now, he wants your files.

I’m a few minutes late for lunch at the Internet Archive, but they know what kept me. The view of San Francisco Bay outside the archive’s digs at the Presidio is captivating even if you already live here. Just up the road, the Golden Gate Bridge rises, impossibly huge and unbelievably beautiful, to straddle the bay. (Check out the satellite photo if you’re prepared to weep with envy.) The wraparound splendor inspires fanciful thought. No wonder Gene Roddenberry conjured the Starfleet Academy right where I’m standing.

Thanks to the ruthless hippies who run local politics, the Presidio’s former Army barracks are filled by nonprofits rather than condos. Search-engine wiz and dot-com multimillionaire Brewster Kahle founded the archive here in 1996 with a dream as big as the bridge: He wanted to back up the Internet. There were only 50 million or so URLs back then, so the idea only seemed half-crazy. As the Web ballooned to more than 10 billion pages, the archive’s main server farm—hidden across town in a data center beneath the city’s other big bridge—grew to hold a half-million gigabytes of compressed and indexed pages.

Kahle is less the Internet’s crazy aunt—the tycoon who can’t stand to throw anything away—than its evangelical librarian. “The history of digital materials in companies’ hands is one of … loss,” he tells me in a rushed meeting. Like it or not, the Web is the world’s library now, and Kahle doesn’t trust the guys who shelve the books. They’re obsessed with posting new pages, not preserving old ones. Every day, Kahle laments, mounds of data get purged from the Web: government documents, personal sites, corporate communications, message boards, news reports that weren’t printed on paper. For most surfers, once a page disappears from Google’s cache it no longer exists.

Instead of creating another startup that crawls the Web to make money, Brewster used his millions to preserve as much knowledge as possible and—just as important—make it accessible to anyone who can get to a computer. The archive’s Wayback Machine has captured only a fraction of the Internet’s history, but it still holds 40 billion pages from 50 million sites. With a couple of clicks, you can revisit CNN’s home page from the day the U.S. began bombing Iraq and learn that salon.com was once a hairdressers’ site.

As a time-travel device, the Wayback Machine is far from perfect. Many sites blocked Kahle from crawling them—thanks for nothing, Hotwired—and lots of copyrighted material has been removed at the owner’s request. You can search old nytimes.com front pages, for instance, but the articles themselves are locked up in the Times’ paid archive. My biggest gripe is that there’s no way to run a simple keyword search over all 40 billion pages. Instead, you have to type in a specific URL and a date range and then click through a list of preserved copies of that page. Maybe someday they’ll add a search box, but serving queries on a Web cache five times the size of Google’s would take lots more hardware than what they’ve got under the bridge.

The Internet Archive isn’t just the Wayback Machine—the nonprofit’s two dozen or so employees have filled an equal amount of disk space with uploaded film collections, presidential debates, Bugs Bunny cartoons, and news broadcasts from the Middle East. The archive is especially keen on books. They’ve scanned about 25,000 of them so far as part of the Million Book Project, a collaboration with Indian and Chinese agencies to create an online library in the place of bricks-and-mortar reading rooms.

I test out the books project by spending an afternoon searching, reading, and printing pages from old tomes like Dion Clayton’s English Costume, a 1907 coffee-table book on Brit dandies through the ages. Some of the scans look like awkward, off-center Xeroxes, but other ones let you search inside, just like on Amazon.com, or cut and paste passages into your homework. You’d better spell-check that homework, though. When I copy a passage from Clayton’s book that begins, “Here you see the coat,” it comes out as, “Here $’ou see tlae coat.”

In cost and complexity, scanning a million books is as big a challenge as hosting a million gigabytes (they designed custom servers to solve that problem). Cheap is the watchword for everything here. I sit in on a demo of a home-brew book scanner designed to cost a fraction of automated commercial models. It’s a black-curtained box about 4 square feet, with two store-bought digital SLR cameras hung to point at the pages of a book cradled below. Software on a nearby workstation turns the photos into XML-enhanced files that go into a searchable, sharable database.

The final step in building the archive into a true global library: getting you to contribute. Ourmedia, a project launched two weeks ago, offers free, unlimited, permanent storage of your videos, photos, Word files, podcasts—anything that’s not porn and not covered by someone else’s copyright. The one catch: The files, stored on Internet Archive servers, will be freely available to anyone in the world.

Sure, you could store your files for free—and in private—in your Gmail account or share your photos on Flickr. But Kahle thinks you should trust him, not Internet companies that have a habit of disappearing along with their customers’ data. Remember MP3.com? After convincing thousands of indie bands to create pages and upload music, the site’s owners sold the company and the whole thing got trashed on short notice. Google may have a zillion dollars, a do-gooder management team, and a wide-open future, but the same was once true of Netscape.

An A-list of big-brain bloggers like Lawrence Lessig and Howard Rheingold is supplying the ideas for Ourmedia, but Kahle’s superfat server setup is what makes the whole thing possible. After a day at the archive, I have no doubts about his sincerity or his team’s dedication. What I worry about is his $5 million budget—that’s a lot closer to mine than to Google’s. And I wonder who could replace Kahle’s brains, drive, and connections if he gets hit by a Presidio bus. The archive has already outlasted both MP3.com and Netscape, though. Maybe that’s because, unlike the other guys, Kahle planned on nonprofitability from the start.