Who Checks the Spell-Checkers?
Microsoft Word's dictionary is old and outdated. Here's how to fix it.
On April 30, 2007, with all the usual fanfare that accompanies a software update, Microsoft added Barack and Obama to Office's dictionary. It was a fairly quick canonization for the Illinois senator. His surname had been on Microsoft's candidate list for new words since Jan. 5 of that year, and his first name followed three days later, in the same recruiting class as Zune, Klum, and Friendster. Three months later, it was official—no longer would Microsoft suggest Boatman as a replacement for the future president's last name.
Of course, by April 2007 Obama was already a figure of some renown. He'd announced his bid for the Democratic nomination in mid-January and had been an object of intense fascination since his July 2004 speech at the Democratic Convention. But escaping the shackles of Microsoft Word's red corrugated line is no small feat, and the list of those who've made the cut can seem arbitrary: Why does it recognize the surnames of Matthew Broderick and Susan Sarandon but trip over DiCaprio and Blanchett? They've heard of Friendster, but not Facebook? Does Microsoft really want to start something with Mark Wahlberg? (Or, speaking of Entourage, with Jeremy Piven?)
There's no reason why spell-check dictionaries need to be so behind the times. All the technology to build a relevant, timely spelling database already exists in search engines like Google and Microsoft's own Live Search, which have a vast vocabulary of words and names and update their dictionaries in near real time. Microsoft Word may not have heard of Marky Mark, but a Live Search or a Google query for Mark Walberg includes results for the actor, who has an "h" in his last name.
For another example, take a reasonably new tech neologism like pharming. Neither Microsoft Word nor the Google Docs spell-checker, the latter of which is based on an open-source tool called GNU Aspell, have heard of the word. Live and Google recognize the term just fine, however, and can retrieve it as a correction for a basic misspelling like pharmung.
What's behind this disparity? Word processors and search engines have different goals. The latter has to field queries as broad and varied as the Internet itself, so it needs a very large vocabulary in order to differentiate spelling mistakes from legitimate search terms. Word processors are much more conservative, limiting their lexicon to words that are definitely legitimate. This way, a program like Word can catch virtually every typo, even if it means misidentifying some proper names and newer words. In other words, search engines put breadth first and spelling accuracy second while word processors are the other way around. If you type in Monkees, Google will assume you're searching for the band; Word will give you a red squiggly line, thinking you've screwed up the word monkeys.
Not surprisingly, search engines and word processors build their dictionaries differently. A search engine's lexicon is typically put together using words gathered from Web pages or old search queries—a huge corpus of real-world data that constitutes a list of valid words and their frequency in the language. Word-processing lexicons are more heavily chaperoned, and the pace at which new terms enter the dictionary is much slower.
Microsoft is beginning to incorporate more natural-language detection into its Office products, though. Ten years ago, they kept candidate words on a single Excel sheet for review by a higher-up. Mike Calcagno, a member of Microsoft's Natural Language Group, says the company now scans through trillions of words, including anonymized text from Hotmail messages, in the hunt for dictionary candidates. On top of this, they monitor words that people manually instruct Word to recognize. "It's becoming rarer and rarer that anything that comes to us ad hoc isn't already on our list" from Hotmail or user data, Calcagno says. According to a July 14, 2006, bug report, for example, the Natural Language Group harvested the following words that had appeared more than 10 times in Hotmail user dictionaries: Netflix, Radiohead, Lipitor, glucosamine, waitressing, taekwondo, and all-nighter.
Incorporating user data is a huge step in the right direction for Word, but the process is still sluggish compared with search engines. Google and Live Search generate dictionaries that approach real-time models of language. In a fascinating paper (PDF), two Microsoft researchers explain that a stream of previous search queries can be used to maintain an up-to-date lexicon capable of correcting a high percentage of mistakes, even when 10 or 15 percent of your searches have errors. This purely statistical approach is much timelier than any involving human editors and has far fewer biases. When it comes to fixing errors, the researchers write, "the actual language in which the web queries are expressed becomes less important than the query-log data."
Chris Wilson is a Slate contributor.
Illustration by Mark Alan Stamaty.