Jes’ Talkin’ to Mah Ah-Phone

How does Apple’s speech-recognition software handle accents?

Oct 05, 20116:01 PM

Apple's Eddy Cue introduces the iPhone 4S on Oct. 4. — Apple’s Eddy Cue introduces features of the iPhone’s new 4S model

Photograph by Kevork Djansezian/Getty Images.

The new iPhone 4S, unveiled on Tuesday, takes spoken commands and talks back. The new product can also switch between GSM and CDMA networks, which means people all over the world will be able to use it. Will the phone be able to understand people with different accents?

Yes. Speech-recognition software is designed to understand multiple pronunciations of each phoneme. Programmers “train” the system by playing many hours of speech into the microphone, and then typing in what can be heard on the tapes. During this process, the software learns that the same part of a word can be pronounced in many different ways. Take, for example, the plosive consonant T, which sounds one way in the word tree and another way in the word plate—and that’s just in one dialect. When software engineers are working on a product that will be used by people around the world, they include recordings in different dialects and from non-native speakers of English in the training. To stick with the T example: British people tend to pronounce the T sound in butter much more clearly than Americans, who swallow it. Eventually, the program establishes a kind of bell curve for the phoneme, and it will interpret any sound whose frequencies and other physical characteristics fall within the parameters of that curve as a possible attempt to produce that phoneme.

When the computer thinks it has heard a phoneme, it produces a confidence estimate to express its certainty. (Watson, the computer that defeated game-show legend and Slate contributor Ken Jennings at Jeopardy, used the same process to decide whether it would buzz in.) The software then combines this guess with contextual information to bolster its confidence. If a user says “The president of the United States of America,” but badly mispronounces the word “United,” the software should be able to interpret the word based on the context. Better programs can learn from such incidents. If a user—the same person or another person, depending on the sophistication of the program—later pronounces the united the same way, the software becomes more confident in its guess. The more flexible the program becomes, the more likely it is to stumble when it encounters overlapping sounds.

Apple has been characteristically tight-lipped about the training behind Siri, its speech recognition program. Google’s Voice Search, however, has been available on Android and iPhones for more than two years. When using that feature, users can adjust a dialect setting so the software knows to expect a British-accented version of English as opposed to standard American. Other settings include Australian, Indian, and South African forms of English. If you’re American, there is no setting that might distinguish a Southern drawl, say, from a West Texas twang. There are multiple settings for other languages, too: Spanish comes in Argentine and Mexican versions, for example.

Got a question about today’s news? Ask the Explainer.

Explainer thanks Nadja Blagojevic of Google, Jim Glass of the MIT Computer Science and Artificial Intelligence Laboratory, Karen Livescu of the Toyota Technological Institute at Chicago, and Nelson Morgan of the International Computer Science Institute and UC Berkeley.