Who Checks the Spell-Checkers?

Microsoft Word’s dictionary is old and outdated. Here’s how to fix it.

Dec 31, 20086:55 AM

Spell checking Obama in Outlook

On April 30, 2007, with all the usual fanfare that accompanies a software update, Microsoft added Barack and Obama to Office’s dictionary. It was a fairly quick canonization for the Illinois senator. His surname had been on Microsoft’s candidate list for new words since Jan. 5 of that year, and his first name followed three days later, in the same recruiting class as Zune, Klum, and Friendster. Three months later, it was official—no longer would Microsoft suggest Boatman as a replacement for the future president’s last name.

Of course, by April 2007 Obama was already a figure of some renown. He’d announced his bid for the Democratic nomination in mid-January and had been an object of intense fascination since his July 2004 speech at the Democratic Convention. But escaping the shackles of Microsoft Word’s red corrugated line is no small feat, and the list of those who’ve made the cut can seem arbitrary: Why does it recognize the surnames of Matthew Broderick and Susan Sarandon but trip over DiCaprio and Blanchett? They’ve heard of Friendster, but not Facebook? Does Microsoft really want to start something with Mark Wahlberg? (Or, speaking of Entourage, with Jeremy Piven?)

There’s no reason why spell-check dictionaries need to be so behind the times. All the technology to build a relevant, timely spelling database already exists in search engines like Google and Microsoft’s own Live Search, which have a vast vocabulary of words and names and update their dictionaries in near real time. Microsoft Word may not have heard of Marky Mark, but a Live Search or a Google query for Mark Walberg includes results for the actor, who has an “h” in his last name.

For another example, take a reasonably new tech neologism like pharming. Neither Microsoft Word nor the Google Docs spell-checker, the latter of which is based on an open-source tool called GNU Aspell, have heard of the word. Live and Google recognize the term just fine, however, and can retrieve it as a correction for a basic misspelling like pharmung.

What’s behind this disparity? Word processors and search engines have different goals. The latter has to field queries as broad and varied as the Internet itself, so it needs a very large vocabulary in order to differentiate spelling mistakes from legitimate search terms. Word processors are much more conservative, limiting their lexicon to words that are definitely legitimate. This way, a program like Word can catch virtually every typo, even if it means misidentifying some proper names and newer words. In other words, search engines put breadth first and spelling accuracy second while word processors are the other way around. If you type in Monkees, Google will assume you’re searching for the band; Word will give you a red squiggly line, thinking you’ve screwed up the word monkeys.

Not surprisingly, search engines and word processors build their dictionaries differently. A search engine’s lexicon is typically put together using words gathered from Web pages or old search queries—a huge corpus of real-world data that constitutes a list of valid words and their frequency in the language. Word-processing lexicons are more heavily chaperoned, and the pace at which new terms enter the dictionary is much slower.

Microsoft is beginning to incorporate more natural-language detection into its Office products, though. Ten years ago, they kept candidate words on a single Excel sheet for review by a higher-up. Mike Calcagno, a member of Microsoft’s Natural Language Group, says the company now scans through trillions of words, including anonymized text from Hotmail messages, in the hunt for dictionary candidates. On top of this, they monitor words that people manually instruct Word to recognize. “It’s becoming rarer and rarer that anything that comes to us ad hoc isn’t already on our list” from Hotmail or user data, Calcagno says. According to a July 14, 2006, bug report, for example, the Natural Language Group harvested the following words that had appeared more than 10 times in Hotmail user dictionaries: Netflix, Radiohead, Lipitor, glucosamine, waitressing, taekwondo, and all-nighter.

Incorporating user data is a huge step in the right direction for Word, but the process is still sluggish compared with search engines. Google and Live Search generate dictionaries that approach real-time models of language. In a fascinating paper (PDF), two Microsoft researchers explain that a stream of previous search queries can be used to maintain an up-to-date lexicon capable of correcting a high percentage of mistakes, even when 10 or 15 percent of your searches have errors. This purely statistical approach is much timelier than any involving human editors and has far fewer biases. When it comes to fixing errors, the researchers write, “the actual language in which the web queries are expressed becomes less important than the query-log data.”

Google’s system relies heavily on word data gathered from the Web itself. As tech staff member Pandu Nayak explained to me recently, Google tries to determine proper spelling algorithmically. While Nayak was unable to look up exactly when Barack Obama entered the lexicon, he predicted that the president-elect was in there well prior to his 2004 convention speech, when even local attention would have produced a substantial online footprint. As soon as a word starts showing up on the Web with any appreciable frequency, it becomes a candidate for a spelling suggestion. Take a very obscure academic term like theothanatology—the study of the death of God—which returns all of 829 results as of this writing. Not only does Google recognize the word, it gets you there from a close misspelling like theotanatalegy. (Live Search is a little behind here. It returns 103 results but can’t correct a misspelling that’s even one letter off.)

Spell checking Obama in Firefox

Google’s process is wholly automated, which generates a natural set of challenges. The correct spelling of a word is usually more frequent than its incorrect permutations, but there are exceptions. Dalmation, for example, is such a common misspelling of Dalmatian that it can trip up the algorithms. The best search-engine spelling models look at the other words in the query for clues. A search for Sasha Baron Cohen automatically corrects to Sacha, since that spelling of the first name is heavily associated with the latter two. The best algorithms can identify a mistake even when each individual word is spelled correctly—a Google search for golf war returns some results for Gulf war as well.

What would happen if Google’s search technology was ported into a word processor? First, the spell-checker would recognize the bulk of any document’s proper nouns (no more squiggly red line under DiCaprio) as well as any new terms the kids are using these days (Urban Dictionary tells me, for example, that overchicked is an adjective used to describe a man who is significantly less attractive than his female companion. A word processor powered by search-engine spelling could handle overchicked just fine.)

I also suspect the search-engine model would do a better job at suggesting the right word when you really did make an error. Most word processors make suggestions using the concept of “edit distance”—basically the number of letters you have to change, add, delete, or switch to transform one word into another. Duck has an edit distance of one from luck, and trial and trail are also just one edit away. (For the nitty-gritty on this, see Google research director Peter Norvig’s paper on how to write a spell-check program.) While edit distance usually works pretty well for word processors, it can produce some funny suggestions, like Boatman for Obama. (The edit distance there is three; just switch the b and o, add a t, add an n.) Most search engines, by comparison, complement the edit-distance method with a huge amount of data on common mistakes. Given the complexity of the English language, this real-world information is a tremendous spell-checking boon.

The search-engine method does have drawbacks. People have faith that Microsoft Word won’t mislead them spellingwise. Perhaps because those red squigglies are so quietly reprimanding, we do anything we can to avoid them. In that last sentence, I originally wrote reprimatory, which is not a real word. Microsoft suggested respiratory. I appealed that verdict to Google, which returned this blog post in which someone uses the word in a comment, plus a bunch of Italian pages with reprimatori. So even though reprimatory isn’t a bona fide word, Google found it often enough that it didn’t return an error. Relying on Web users for your dictionary does have its perils.

Because it is guided by humans, the Word dictionary is full of words that Microsoft thinks you should be using—it’s “prescriptive” instead of “descriptive,” to use the lexicographer’s parlance. Microsoft will tolerate a few FCC violations in your copy, but damned if it will ever suggest one. Just watch what it does with “siht.”

While New Yorker critic Louis Menand has written movingly about Word’s hijacking of the writing process, there is something to be said for steering people toward basic literacy. If Microsoft Office’s core dictionary becomes a creation of the Web, we’ll be handing the keys to a bunch of people who often wield the language clumsily. This clumsiness may be the parent of linguistic evolution, but it’s going to make for some rocky spelling suggestions.

Some of these problems could be solved algorithmically, such that a minor word like reprimatory returns an error if it fails to meet a certain frequency in the index. At the very least, Microsoft could give Word a supplemental online dictionary, to ensure that its words are always up-to-date. (Google Docs, too, should take a few hints from the Google search engine.) Eventually, a spell-check based on Web data will be the way to go. Sure, we would see a few more naughty words and Dalmations in our Word documents, butthe end product will be something that resembles the way people use language in the present day. Tally it up as one more victory for the pragmatists in the language wars.