Technology

Speak English, Stupid Computer!

Finally, a search engine that understands you. But don’t ditch Google yet.

We’ve gotten used to searching the Web like robots. Rather than talking to the search engine as we would speak to a person, we throw it a bunch of keywords (“John Lennon shot”) and expect it to know what we’re looking for (Mark David Chapman). How would life be different if our search engines were more human? A new piece of software called Powerset will give you some idea. Unlike the Googles or Yahoos of the world, Powerset “reads” the sites it crawls, parsing sentences for meaning with a lot of complicated algorithms. The end result is that Powerset allows you to express yourself conversationally. Want to know who shot John Lennon? Just ask: Who shot John Lennon? Mark David Chapman, Powerset replies.

In theory, this is better than keyword searching. If you have a specific question, it’s convenient to ask your browser the same way you’d ask a reference librarian. But does it work that way in practice?

Not quite yet. As of now, Powerset searches only two sites: Wikipedia and Freebase, a giant database of user-generated information. Let’s use Rudy Giuliani, someone with a robust Wikipedia presence, as our first guinea pig.

I’ll start at the beginning: “Where was Rudy Giuliani born?” “Brooklyn,” Powerset answers in large type. This information comes from Freebase, which takes what’s called a “bottom-up” approach to content. Each fact in the database is compartmentalized into a particular category—Giuliani’s Freebase page lists “Brooklyn” in the slot for “Place of birth.”

So far, so good. But that was an easy one that Google can answer in the same number of clicks. Now for a slightly tougher question: “Who did Rudy Giuliani defeat?” The search engine returns this sentence at the top of its results: “In late 1993, David Dinkins was defeated by Rudolph Giuliani in his bid for reelection.” Impressive—Powerset scores points for understanding the passive voice in the original phrase and recognizing it as an appropriate response to my active sentence. By contrast, a Google search for that specific query—”Who did Rudy Giuliani defeat”—returns nothing very useful. A more typical Google keyword search—”Rudy Giuliani defeat“—returns similarly scattered pages. Several of the top results are stories asking whether Rudy can defeat Hillary Clinton in 2008. (We don’t need a search engine to know the answer to that question.) Google also retrieves a lot of  random news articles that just happen to contain Giuliani’s name and the word defeat.

Clicking on Rudy Giuliani’s name in Powerset’s search results takes us to the site’s enhanced version of his Wikipedia page. Flip the switch on the “Article Outline” box in the upper right to “Show Factz,” which produces a list of the article’s subject-verb-object statements: “Giuliani served term,” “Giuliani practiced law,” “Giuliani indicted figures.” The philosophy here is clear: By studying the relationships between words on a page, Powerset can unearth facts that you’d have to dig for on a traditional search engine.

The problem, as any ESL student will tell you, is that the English language is extremely difficult to parse. That means Powerset spits out a lot of garbage. Here are a few other “factz” that Powerset has culled on Rudy: “Giuliani patented walk,” “vote votes gains,” “grounds send children.”

While Powerset is a neat demo, it’s nowhere close to an improvement over any of the current titans in the industry, particularly when you consider that it works the best when you’re asking for very basic information. While it can field search queries phrased as questions, the results aren’t that different from what Google turns up if you limit it to searching Wikipedia. For Powerset, Wikipedia is an ideal testing ground due to its homogeneity, breadth, and familiarity. It’s not a space, though, that really requires a better search engine. (Jimmy Wales agrees.)

Powerset’s limited scope is, in part, a matter of resources. The site’s general manager, Scott Prevost, told me that because it takes so much time to parse the grammar of individual Web pages, it takes Powerset much longer to build its index than it takes Google or Microsoft’s Live Search to crawl the Web. Only in the last few years has computing power become cheap enough to make even a modest semantic search engine feasible.

We should have sympathy for anyone who tries to improve the way humans communicate with computers. Years of trial and error have made people very skilled at constructing Google searches. Most Web users now have a stable of basic tricks—putting phrases in quotes, limiting searches to individual domains—and have learned to pick out quickly what they’re looking for from a long page of results. Because we’re so good at Googling, a natural-language search engine has a high bar. Sure, Powerset gives me the right answer when I ask, “Who wrote The Godfather?” But so long as I can Google “godfather author” and get CNN’s obituary of Mario Puzo as the first result, I’m not about to become a Google apostate.

As Powerset begins to trawl the Web beyond Wikipedia, it needs to find a niche in which keyword searches are insufficient. One thing Powerset does have in its pocket is an enormous encyclopedia of synonyms. To show off the technology, Prevost asked me to type in, “What did Al Gore say?” The results include “Gore stated,” “Gore argued,” and even “a bill created and introduced by then Senator Al Gore.” A similar Google search is a lot less coherent.

There are a bunch of potential uses for what I’ll call “approximate search.” I wrote a rather tortured paper in college about how Mark Twain is quoted in modern media. A Powerset-like news search would have been really helpful—it would have saved me from having to think up all the different phrases you could possibly use to invoke Twain, and then conduct an individual Nexis search for each one.

Perhaps this is the best that Powerset and other semantic search engines can hope for: We’ll continue to Google for most things but use specialized search engines for the nooks and crannies where keyword search fails to reach. These niche sites are commonly called “vertical search engines,” as they focus on one area in much more depth than a broad, “horizontal” search engine ever could. For a few examples, check out a people-search product called Spock or the travel-oriented Kayak.

For a Web startup, it’s expensive to go vertical. Jay Bhatti, the co-founder of Spock, estimates that any site with aspirations to take on Google would require $30 million to $40 million to get off the ground, with a good chunk of that going to the thousands of servers needed to crawl the entire Web. Until Powerset can scrape that kind of money together—plus a few more million for the extra bandwidth to parse everything it retrieves—it’ll have to be content to work at the margins. So long as Google directs its energies toward improving its universal search, this coexistence might even be productive—particularly if Powerset starts to crawl something more interesting than Wikipedia.