The Internet Archive wants your files.

Inside the Internet.
April 7 2005 1:39 PM

The Archivist

Brewster Kahle made a copy of the Internet. Now, he wants your files.

Illustration by Mark Alan Stamaty

I'm a few minutes late for lunch at the Internet Archive, but they know what kept me. The view of San Francisco Bay outside the archive's digs at the Presidio is captivating even if you already live here. Just up the road, the Golden Gate Bridge rises, impossibly huge and unbelievably beautiful, to straddle the bay. (Check out the satellite photo if you're prepared to weep with envy.) The wraparound splendor inspires fanciful thought. No wonder Gene Roddenberry conjured the Starfleet Academy right where I'm standing.

Thanks to the ruthless hippies who run local politics, the Presidio's former Army barracks are filled by nonprofits rather than condos. Search-engine wiz and dot-com multimillionaire Brewster Kahle founded the archive here in 1996 with a dream as big as the bridge: He wanted to back up the Internet. There were only 50 million or so URLs back then, so the idea only seemed half-crazy. As the Web ballooned to more than 10 billion pages, the archive's main server farm—hidden across town in a data center beneath the city's other big bridge—grew to hold a half-million gigabytes of compressed and indexed pages.


Kahle is less the Internet's crazy aunt—the tycoon who can't stand to throw anything away—than its evangelical librarian. "The history of digital materials in companies' hands is one of … loss," he tells me in a rushed meeting. Like it or not, the Web is the world's library now, and Kahle doesn't trust the guys who shelve the books. They're obsessed with posting new pages, not preserving old ones. Every day, Kahle laments, mounds of data get purged from the Web: government documents, personal sites, corporate communications, message boards, news reports that weren't printed on paper. For most surfers, once a page disappears from Google's cache it no longer exists.

Instead of creating another startup that crawls the Web to make money, Brewster used his millions to preserve as much knowledge as possible and—just as important—make it accessible to anyone who can get to a computer. The archive's Wayback Machine has captured only a fraction of the Internet's history, but it still holds 40 billion pages from 50 million sites. With a couple of clicks, you can revisit CNN's home page from the day the U.S. began bombing Iraq and learn that was once a hairdressers' site.

As a time-travel device, the Wayback Machine is far from perfect. Many sites blocked Kahle from crawling them—thanks for nothing, Hotwired—and lots of copyrighted material has been removed at the owner's request. You can search old front pages, for instance, but the articles themselves are locked up in the Times' paid archive. My biggest gripe is that there's no way to run a simple keyword search over all 40 billion pages. Instead, you have to type in a specific URL and a date range and then click through a list of preserved copies of that page. Maybe someday they'll add a search box, but serving queries on a Web cache five times the size of Google's would take lots more hardware than what they've got under the bridge.

The Internet Archive isn't just the Wayback Machine—the nonprofit's two dozen or so employees have filled an equal amount of disk space with uploaded film collections, presidential debates, Bugs Bunny cartoons, and news broadcasts from the Middle East. The archive is especially keen on books. They've scanned about 25,000 of them so far as part of the Million Book Project, a collaboration with Indian and Chinese agencies to create an online library in the place of bricks-and-mortar reading rooms.

I test out the books project by spending an afternoon searching, reading, and printing pages from old tomes like Dion Clayton's English Costume, a 1907 coffee-table book on Brit dandies through the ages. Some of the scans look like awkward, off-center Xeroxes, but other ones let you search inside, just like on, or cut and paste passages into your homework. You'd better spell-check that homework, though. When I copy a passage from Clayton's book that begins, "Here you see the coat," it comes out as, "Here $'ou see tlae coat."

In cost and complexity, scanning a million books is as big a challenge as hosting a million gigabytes (they designed custom servers to solve that problem). Cheap is the watchword for everything here. I sit in on a demo of a home-brew book scanner designed to cost a fraction of automated commercial models. It's a black-curtained box about 4 square feet, with two store-bought digital SLR cameras hung to point at the pages of a book cradled below. Software on a nearby workstation turns the photos into XML-enhanced files that go into a searchable, sharable database.

The final step in building the archive into a true global library: getting you to contribute. Ourmedia, a project launched two weeks ago, offers free, unlimited, permanent storage of your videos, photos, Word files, podcasts—anything that's not porn and not covered by someone else's copyright. The one catch: The files, stored on Internet Archive servers, will be freely available to anyone in the world.

Sure, you could store your files for free—and in private—in your Gmail account or share your photos on Flickr. But Kahle thinks you should trust him, not Internet companies that have a habit of disappearing along with their customers' data. Remember After convincing thousands of indie bands to create pages and upload music, the site's owners sold the company and the whole thing got trashed on short notice. Google may have a zillion dollars, a do-gooder management team, and a wide-open future, but the same was once true of Netscape.


Sports Nut

Grandmaster Clash

One of the most amazing feats in chess history just happened, and no one noticed.

The Extraordinary Amicus Brief That Attempts to Explain the Wu-Tang Clan to the Supreme Court Justices

Amazon Is Officially a Gadget Company. Here Are Its Six New Devices.

Do the Celebrities Whose Nude Photos Were Stolen Have a Case Against Apple?

The NFL Explains How It Sees “the Role of the Female”

Future Tense

Amazon Is Now a Gadget Company


How to Order Chinese Food

First, stop thinking of it as “Chinese food.”

Scotland Is Inspiring Secessionists Across America

The Country Where Women Aren’t Allowed to Work Once They’re 36 Weeks’ Pregnant

The XX Factor
Sept. 18 2014 11:40 AM The Country Where Women Aren’t Allowed to Work Once They’re 36 Weeks’ Pregnant
Sept. 17 2014 5:10 PM The Most Awkward Scenario in Which a Man Can Hold a Door for a Woman
  News & Politics
Sept. 18 2014 3:19 PM In Defense of Congress Leaving Town Without a New War Vote
Business Insider
Sept. 18 2014 3:31 PM What Europe Would Look Like If All the Separatist Movements Got Their Way
Sept. 18 2014 3:24 PM Symantec Removes Its “Sexual Orientation” Filter
  Double X
The XX Factor
Sept. 18 2014 3:30 PM How Crisis Pregnancy Centers Trick Women
  Slate Plus
Behind the Scenes
Sept. 18 2014 1:23 PM “It’s Not Every Day That You Can Beat the World Champion” An exclusive interview with chess grandmaster Fabiano Caruana.
Sept. 18 2014 4:00 PM When The Cosby Show Got “Very Special” Why were The Cosby Show’s Very Special Episodes so much better than every other ’80s sitcom’s?
Future Tense
Sept. 18 2014 2:39 PM Here's How to Keep Apple From Sharing Your iPhone Data With the Police
  Health & Science
Sept. 18 2014 3:35 PM Do People Still Die of Rabies? And how do you know if an animal is rabid?
Sports Nut
Sept. 18 2014 11:42 AM Grandmaster Clash One of the most amazing feats in chess history just happened, and no one noticed.