Now You're Talking!
Google has developed speech-recognition technology that actually works.
"In the 1970s, there were two camps that didn't really talk to one another," Cohen says. The linguists believed, more or less, that the number of distinct sounds in human speech could ultimately be analyzed and turned into a set of computational rules. All you needed to do, they thought, was listen to enough human speech and then map, in painstaking detail, the frequencies of the sounds you heard. Once all the different sounds were analyzed and stored in a reference library, a computer would be able to recognize a given sound just by looking it up.
While this seemed to make intuitive sense, the engineers saw one glaring problem: It would never scale. The engineers believed they could get much further with computational analysis—if you gave a powerful computer enough audio samples, it would eventually be able to find all sorts of nuances that human linguists would never be able to identify. This was best expressed in a famous quote by Frederick Jelinek, one of the field's pioneering computer scientists: "Every time I fire a linguist, the performance of the speech recognizer goes up."
Over the years both sides bridged their differences, Cohen says, and today's speech-recognition systems use deep insights from linguistics and engineering. Still, it turned out that the engineers were right on the fundamental problem: There are too many different possible sounds in human speech to be described by explicit linguistic rules. Cohen points out one small example. To most people the a sound in the words map, tap, and cat seems identical. In fact, there are very subtle differences. To create the M sound in map, you bring your lips together, forming a long closed tube in your vocal tract. This affects the a sound that follows—since your throat is transitioning from the low-frequency m sound, the first 10 to 30 milliseconds of the a in map includes many low-frequency notes that aren't found in the early part of the a in tap. Now imagine how many such nuances there are for all the different words and combinations of words in every different language. "There's no way we could do it by writing explicit rules," Cohen says. The only way to find all these differences is through large-scale data analysis—by having lots of computers scrutinize lots and lots of examples of human speech.
But where to get all that speech? "A big bottleneck in the field has been data," Cohen says. For many years researchers knew the theoretical process for building speech-recognition systems, but they had no idea how to get enough human chatter, or enough computing power, to actually do it. Then came Google. It turns out that the very same infrastructure that Google needed to build a fantastic search engine—acres and acres of data centers to store and analyze Web sites, and a range of internal processes that are specifically tuned to managing large amounts of information—would also be effective for solving speech recognition and other artificial intelligence problems, Cohen says.
There's a lot of overlap between search and speech. To decipher your speech, Google's system doesn't just use recorded voices. It also relies on a host of other data, including billions of written search queries that it uses to predict the words you're most probably saying. If you say 33rd and Sixth, NYC," your NYC might sound like and I see, but Google knows that you're probably saying NYC, because that's what a lot of other people mean when they say that phrase. Altogether, Google's speech recognition program comprises many billions of pieces of text and audio; Cohen says that building just one part of the speech-recognition system required "roughly 70 CPU-years" of computer time. Google's cloud of processors can do that amount of crunching in a single day. "This is one of the things that brought me to Google," Cohen says. "We can now iterate much more quickly, experiment much more quickly, to train these enormous models and see what works."
Speech recognition is still a very young field. "We don't do well enough at anything right now," Cohen says. He notes that the system keeps getting better—and more and more people keep using Android's voice search—but we're still many years (and maybe even decades) away from what Cohen says is Google's long-term vision for speech-recognition. "We want it to be totally ubiquitous," he says. "No matter what the application is, no matter what you're trying to do with your phone, we want you to be able to talk to your phone."
Farhad Manjoo is Slate's technology columnist and the author of True Enough: Learning To Live in a Post-Fact Society. You can email him at firstname.lastname@example.org and follow him on Twitter.