HOME / the good word: Language and how we use it.

Who Checks the Spell-Checkers?Microsoft Word's dictionary is old and outdated. Here's how to fix it.

(Continued from page 1)

Google's system relies heavily on word data gathered from the Web itself. As tech staff member Pandu Nayak explained to me recently, Google tries to determine proper spelling algorithmically. While Nayak was unable to look up exactly when Barack Obama entered the lexicon, he predicted that the president-elect was in there well prior to his 2004 convention speech, when even local attention would have produced a substantial online footprint. As soon as a word starts showing up on the Web with any appreciable frequency, it becomes a candidate for a spelling suggestion. Take a very obscure academic term like theothanatology—the study of the death of God—which returns all of 829 results as of this writing. Not only does Google recognize the word, it gets you there from a close misspelling like theotanatalegy. (Live Search is a little behind here. It returns 103 results but can't correct a misspelling that's even one letter off.)

Spell check.Google's process is wholly automated, which generates a natural set of challenges. The correct spelling of a word is usually more frequent than its incorrect permutations, but there are exceptions. Dalmation, for example, is such a common misspelling of Dalmatian that it can trip up the algorithms. The best search-engine spelling models look at the other words in the query for clues. A search for Sasha Baron Cohen automatically corrects to Sacha, since that spelling of the first name is heavily associated with the latter two. The best algorithms can identify a mistake even when each individual word is spelled correctly—a Google search for golf war returns some results for Gulf war as well.

What would happen if Google's search technology was ported into a word processor? First, the spell-checker would recognize the bulk of any document's proper nouns (no more squiggly red line under DiCaprio) as well as any new terms the kids are using these days (Urban Dictionary tells me, for example, that overchicked is an adjective used to describe a man who is significantly less attractive than his female companion. A word processor powered by search-engine spelling could handle overchicked just fine.)

I also suspect the search-engine model would do a better job at suggesting the right word when you really did make an error. Most word processors make suggestions using the concept of "edit distance"—basically the number of letters you have to change, add, delete, or switch to transform one word into another. Duck has an edit distance of one from luck, and trial and trail are also just one edit away. (For the nitty-gritty on this, see Google research director Peter Norvig's paper on how to write a spell-check program.) While edit distance usually works pretty well for word processors, it can produce some funny suggestions, like Boatman for Obama. (The edit distance there is three; just switch the b and o, add a t, add an n.) Most search engines, by comparison, complement the edit-distance method with a huge amount of data on common mistakes. Given the complexity of the English language, this real-world information is a tremendous spell-checking boon.

The search-engine method does have drawbacks. People have faith that Microsoft Word won't mislead them spellingwise. Perhaps because those red squigglies are so quietly reprimanding, we do anything we can to avoid them. In that last sentence, I originally wrote reprimatory, which is not a real word. Microsoft suggested respiratory. I appealed that verdict to Google, which returned this blog post in which someone uses the word in a comment, plus a bunch of Italian pages with reprimatori. So even though reprimatory isn't a bona fide word, Google found it often enough that it didn't return an error. Relying on Web users for your dictionary does have its perils.

Because it is guided by humans, the Word dictionary is full of words that Microsoft thinks you should be using—it's "prescriptive" instead of "descriptive," to use the lexicographer's parlance. Microsoft will tolerate a few FCC violations in your copy, but damned if it will ever suggest one. Just watch what it does with "siht."

While New Yorker critic Louis Menand has written movingly about Word's hijacking of the writing process, there is something to be said for steering people toward basic literacy. If Microsoft Office's core dictionary becomes a creation of the Web, we'll be handing the keys to a bunch of people who often wield the language clumsily. This clumsiness may be the parent of linguistic evolution, but it's going to make for some rocky spelling suggestions.

Some of these problems could be solved algorithmically, such that a minor word like reprimatory returns an error if it fails to meet a certain frequency in the index. At the very least, Microsoft could give Word a supplemental online dictionary, to ensure that its words are always up-to-date. (Google Docs, too, should take a few hints from the Google search engine.) Eventually, a spell-check based on Web data will be the way to go. Sure, we would see a few more naughty words and Dalmations in our Word documents, but the end product will be something that resembles the way people use language in the present day. Tally it up as one more victory for the pragmatists in the language wars.

Print This ArticlePRINTEmail to a FriendE-MAILShare This ArticleRECOMMEND...Get Slate RSS FeedsRSS
Chris Wilson is an assistant editor at Slate in Washington, D.C. Follow him on Twitter.
Illustration by Mark Alan Stamaty.
What did you think of this article?
Join The Fray: Our Reader Discussion Forum
POST A MESSAGE | READ MESSAGES
TODAY'S PICTURES
TODAY'S CARTOONS
TODAY'S DOONESBURY
TODAY'S VIDEO
Very superstitious.90/091113_TP.jpg
Cartoonists' take on unemployment.50/091113_TC.jpg
Streep 2.0-8.0. 1/122939/2183724/DoonesburyPlaceholder.jpg