Search Me

Why it’s still harder to find things on the Web than at the library.

April 25, 19973:30 AM

Imagine walking into your local library to look for a book. Hoping for a librarian to guide you, you are confronted instead by a bewildering array of entrepreneurs, each offering to find what you’re looking for. But none has cataloged the whole library, each has cataloged a different part of it, each uses a different system, and none is terribly satisfactory. That is the situation on the Internet today.

The old-fashioned librarian still has an edge over the vast resources and computing power of the Internet in several ways. First, a library offers more valuable stuff. Certainly there is more raw information on the Web than in any library. But the kind of stuff people are actually willing to pay for is sadly lacking. So long as publishers fear cannibalizing their sales–or aren’t able to make money maintaining Web sites–there will be books and periodicals that are only available offline, or online for a fee. You can read the Wall Street Journal free at your library, but you have to pay to read it on the Web. And of course there are for-a-fee search services like LEXIS-NEXIS that allow you to read documents online, but for the time being, serious research–especially for the budgetarily challenged–still usually leads to paper.

Libraries also have a distributed cataloging infrastructure. This means that everyone shares a common cataloging system. When a new book gets published, a single library will do the work to abstract and catalog it, then share that work with all the others. This effort (coordinated by the Online Computer Library Center) is assisted by standards for cataloging information. The Dewey Decimal System and the Library of Congress are both schemes that help guarantee that a book can be found the same way in different libraries.

Most important, though, libraries invite reduced expectations. No one expects to walk into a library and get a list of every book that contains the word “poker” organized by subject, title, and author. We’re just happy to look up “poker” in the (online?) card catalog and find the books that are actually about poker. And typically we’d be just as happy not to find the references to “red-hot poker” or “poker faced.”

O f the many Internet search services, Yahoo! comes the closest to this card-catalog approach. It does it the old-fashioned way, hiring people to look at each site and assign it an abstract and one or more categories. This is easy for people to understand, but it’s not comprehensive. Even very diverse sites seldom appear in more than a few categories. For instance, a search for Slate (click here) shows it somewhere under the Politics category, but not the Movie Reviews or Economics categories. Yet Slate runs a movie review every week and has published many articles on economics.

Even if Yahoo! wanted to be this comprehensive, the humans cataloging the materials cannot possibly keep up with the ever changing nature of the Web. To stay current, they would have to read every page of every site every day. To solve this problem, most other search services use computers instead of humans to do their cataloging. Their machines scan every Web page they can find by a process known as crawling (see my discussion of crawling in a previous “Webhead” column) and put every word on every page into a giant index.

One of the first sites to take that approach was Infoseek. Click here to search for “slate.” The naive user might expect to see a link to our home page. But the problem is that our home page includes few instances of the word “slate”–many fewer than, say, the home page of a dealer of roofing materials or pool tables is likely to. That’s the limitation of a system based on text indexing. It doesn’t really know what a page is about, just what words appear on it. Using a little artificial intelligence, the computer tries to decide which pages are more “about” a given word than others, but it’s not always successful.

HotBot is another search site. Try the “slate” search by clicking here. By the way, as you try out these searches, you may see an advertisement for Slate. That means we bought the rights to the word “slate.” Whenever someone searches for it, our ad shows up. On the Web you can buy words–isn’t that great?

Excite’s site uses some of that artificial intelligence to help you refine your searches. Try searching for “slate” by clicking here. If you find a page that you like you can click on “more like this.” Excite will return a list of pages which “look like” the page you clicked on. That means that similar words appear on the pages with similar frequency. Sometimes this works, more often it returns pages that seem impressively unrelated. For instance, this search for pages like the Slate parody Stale yields not only Slate (a good guess) but also the Steel Lunch Boxes Web page (not a good guess, but entertaining nonetheless).

AltaVista’s site takes a weirder approach to this idea of refining searches. Search for Slate by clicking here. Adding or subtracting words from your search criteria can help find what you’re looking for. For instance adding “Kinsley” and subtracting “roofing” would probably increase your chances of finding our home page. Their “LiveTopics” technology attempts to help you do this. They look up “slate” in their index and see that it often occurs on the same page as “roof,” so they suggest this as a possible refinement of the search. But since these suggestions are generated by computer, they can be very weird. (To see this in action, click on one of the “LiveTopics” links on the AltaVista search-results page.) How did “Adaptec” or “Skadden” get on the list? Your guess is as good as mine. (Well, for Skadden it’s not such a mystery, considering that “Slate” is part of the name of the D.C. law firm Skadden, Arps, Slate, Meagher & Flom.)

The best solutions for searching will probably result from a combination of humans and computers. If AltaVista’s list of search refinements was generated by a human, for instance, it might be more helpful. If the producers of every site cataloged it themselves, then Yahoo! wouldn’t have a hard time keeping up with them. Of course, everyone would have to agree on standard ways to do this, and if everyone agreed, for-profit search sites like Yahoo! probably wouldn’t be necessary. So don’t be surprised if the status quo lasts just a little while longer.