How to fix Microsoft Word's spell-checker.

Language and how we use it.
Dec. 31 2008 6:55 AM

Who Checks the Spell-Checkers?

Microsoft Word's dictionary is old and outdated. Here's how to fix it.

Spell check. Click image to expand.
Spell checking Obama in Outlook

On April 30, 2007, with all the usual fanfare that accompanies a software update, Microsoft added Barack and Obama to Office's dictionary. It was a fairly quick canonization for the Illinois senator. His surname had been on Microsoft's candidate list for new words since Jan. 5 of that year, and his first name followed three days later, in the same recruiting class as Zune, Klum, and Friendster. Three months later, it was official—no longer would Microsoft suggest Boatman as a replacement for the future president's last name.

Of course, by April 2007 Obama was already a figure of some renown. He'd announced his bid for the Democratic nomination in mid-January and had been an object of intense fascination since his July 2004 speech at the Democratic Convention. But escaping the shackles of Microsoft Word's red corrugated line is no small feat, and the list of those who've made the cut can seem arbitrary: Why does it recognize the surnames of Matthew Broderick and Susan Sarandon but trip over DiCaprio and Blanchett? They've heard of Friendster, but not Facebook? Does Microsoft really want to start something with Mark Wahlberg? (Or, speaking of Entourage, with Jeremy Piven?)


There's no reason why spell-check dictionaries need to be so behind the times. All the technology to build a relevant, timely spelling database already exists in search engines like Google and Microsoft's own Live Search, which have a vast vocabulary of words and names and update their dictionaries in near real time. Microsoft Word may not have heard of Marky Mark, but a Live Search or a Google query for Mark Walberg includes results for the actor, who has an "h" in his last name.

For another example, take a reasonably new tech neologism like pharming. Neither Microsoft Word nor the Google Docs spell-checker, the latter of which is based on an open-source tool called GNU Aspell, have heard of the word. Live and Google recognize the term just fine, however, and can retrieve it as a correction for a basic misspelling like pharmung.

What's behind this disparity? Word processors and search engines have different goals. The latter has to field queries as broad and varied as the Internet itself, so it needs a very large vocabulary in order to differentiate spelling mistakes from legitimate search terms. Word processors are much more conservative, limiting their lexicon to words that are definitely legitimate. This way, a program like Word can catch virtually every typo, even if it means misidentifying some proper names and newer words. In other words, search engines put breadth first and spelling accuracy second while word processors are the other way around. If you type in Monkees, Google will assume you're searching for the band; Word will give you a red squiggly line, thinking you've screwed up the word monkeys.

Illustration by Mark Alan Stamaty. Click image to expand.

Not surprisingly, search engines and word processors build their dictionaries differently. A search engine's lexicon is typically put together using words gathered from Web pages or old search queries—a huge corpus of real-world data that constitutes a list of valid words and their frequency in the language. Word-processing lexicons are more heavily chaperoned, and the pace at which new terms enter the dictionary is much slower.

Microsoft is beginning to incorporate more natural-language detection into its Office products, though. Ten years ago, they kept candidate words on a single Excel sheet for review by a higher-up. Mike Calcagno, a member of Microsoft's Natural Language Group, says the company now scans through trillions of words, including anonymized text from Hotmail messages, in the hunt for dictionary candidates. On top of this, they monitor words that people manually instruct Word to recognize. "It's becoming rarer and rarer that anything that comes to us ad hoc isn't already on our list" from Hotmail or user data, Calcagno says. According to a July 14, 2006, bug report, for example, the Natural Language Group harvested the following words that had appeared more than 10 times in Hotmail user dictionaries: Netflix, Radiohead, Lipitor, glucosamine, waitressing, taekwondo, and all-nighter.

Incorporating user data is a huge step in the right direction for Word, but the process is still sluggish compared with search engines. Google and Live Search generate dictionaries that approach real-time models of language. In a fascinating paper (PDF), two Microsoft researchers explain that a stream of previous search queries can be used to maintain an up-to-date lexicon capable of correcting a high percentage of mistakes, even when 10 or 15 percent of your searches have errors. This purely statistical approach is much timelier than any involving human editors and has far fewer biases. When it comes to fixing errors, the researchers write, "the actual language in which the web queries are expressed becomes less important than the query-log data."


The World

The Budget Disaster that Sabotaged the WHO’s Response to Ebola

How Movies Like Contagion and Outbreak Distort Our Response to Real Epidemics

PowerPoint Is the Worst, and Now It’s the Latest Way to Hack Into Your Computer

Everything You Should Know About Today’s Eclipse

An Unscientific Ranking of Really, Really Old German Beers


Welcome to 13th Grade!

Some high schools are offering a fifth year. That’s a great idea.


The Actual World

“Mount Thoreau” and the naming of things in the wilderness.

Want Kids to Delay Sex? Let Planned Parenthood Teach Them Sex Ed.

Can Democratic Sen. Mary Landrieu Pull Off One More Louisiana Miracle?

  News & Politics
Oct. 22 2014 9:42 PM Landslide Landrieu Can the Louisiana Democrat use the powers of incumbency to save herself one more time?
Oct. 23 2014 11:51 AM It Seems No One Is Rich or Happy: I Looked
The Vault
Oct. 23 2014 12:02 PM Delightfully Awkward Studio Action Shots of Players, Used on Early Baseball Cards
  Double X
The XX Factor
Oct. 23 2014 11:33 AM Watch Little Princesses Curse for the Feminist Cause
  Slate Plus
Oct. 23 2014 11:28 AM Slate’s Working Podcast: Episode 2 Transcript Read what David Plotz asked Dr. Meri Kolbrener about her workday.
Brow Beat
Oct. 23 2014 12:01 PM Who Is Constantine, and Should You Watch His New Show?
Oct. 23 2014 11:45 AM The United States of Reddit  How social media is redrawing our borders. 
  Health & Science
Bad Astronomy
Oct. 23 2014 7:30 AM Our Solar System and Galaxy … Seen by an Astronaut
Sports Nut
Oct. 20 2014 5:09 PM Keepaway, on Three. Ready—Break! On his record-breaking touchdown pass, Peyton Manning couldn’t even leave the celebration to chance.