Why Speech Recognition Technology Isn’t Very Good (Yet)

How to understand your data
May 5 2014 4:08 PM

You Just Don’t Understand

Speech recognition needs to get a lot better before it’s really useful.

Captain Jean-Luc Picard.
Jean-Luc Picard demands an explanation for the faulty voice recognition device his crew has given him.

Photo illustration by Juliana Jiménez Jaramillo. Photos via CBS/Matthew Yohe/Creative Commons

I live in a very connected house, and increasingly I spend as much time talking to devices as family members. I can command my Android phone entirely by voice, even if the phone isn’t actively being used, by saying, “OK Google,” and then issuing an order: “Show me my schedule for the weekend.” I can also ask it a question, like “What time does the MoMA close?” It’s the same for Google Glass. With the headset on, just say, “OK Google. Take a photo.” And it will, even if you have a stuffy nose or sore throat. Google’s voice interface is incredibly smart: It hears past accents and dialects to process what I’m saying.

Hold down the home button on an iPhone and you’ll get Siri, Apple’s slick-sounding (though, as it turns out, not so smart) personal assistant. Cortana, who will soon live inside Windows phones, will listen to your conversation if you let her, and she’ll make suggestions and phone calls, and essentially act as your very own personal assistant.

We have an Ivee clock in the bathroom. She’ll tell us an interesting fact anytime we ask her, but only if we’re extremely polite. If we don’t first greet her with “Hello, Ivee,” she ignores us. If we’re nice to her, speaking slowly and in a quiet tone, she might use her soothing voice to tell us that a butterfly cannot fly if its body temperature is less than 86 degrees. When the controller for our Xbox gets lost between the couch cushions, we can still get the console working using our voices. Unlike Ivee, the Xbox doesn’t require pleasantries. It prefers that we just get right down to business: A simple “Xbox on” is all it needs to initiate action.


Clive (that’s the name we gave to our Garmin navigation system) is always there, passively making inferences and decisions, and sometimes he makes suggestions that we don’t necessarily agree with. Mostly, he’s innocuous, simply telling us where to go using his dapper British voice. We run into problems when Clive overhears the voice of Marketplace’s Kai Ryssdal on the radio. Maybe it’s a shared kinship for the years Kai spent in Europe or just his clever turns of phrase, but when Kai’s on the radio, Clive assumes that we want to stop our car and listen to him. “Speak a command,” Clive will admonish us, and if we don’t respond quickly, he’ll cancel our route and send us home.

All voice-activated technology, in fact, seems to cause confusion. The other day, I asked my 4-year-old daughter to retrieve the weather forecast while we were getting ready for school. She walked up to Ivee and said, “OK Ivee, what’s the weather?” Ivee flashed a bit of blue light, then refused to talk. “OK! IVEE!” my daughter said, louder this time. My daughter rolled her eyes. Ivee blinked blue sparkles back at her, unfazed.

“It’s HELLO Ivee and OK Google,” I reminded her.

So she looked out the window instead. “The forecast is sunny today. I’m wearing the purple dress. No socks!”

To be fair, a few years ago we would have had to type our request into the Weather Channel’s website and wait two seconds for a result, and 25 years ago, forget about it. That we can now use our voices to talk to machines is as exciting as it is inspiring. This isn’t sci-fi anymore. You really can tell your house to turn its lights on, your coffeemaker to get an espresso brewing, and your thermostat to crank up the air conditioning without knowing much tech at all. In the next few years, voice-controlled technologies will become more and more ubiquitous, not just in our homes, but in our cars, offices, banks, schools, and grocery stores.

But this means we’re on the cusp of a serious problem. Star Trek posited a Universal Translator, crowdsourced by developers from all around the galaxy, who designed the complex artificial intelligence and machine learning algorithms necessary to immediately listen, infer, and respond to thousands of indigenous languages. The Universal Translator didn’t merely translate human to alien language in real time: It also acted as an interface between the humans and the computers they used. 

Unfortunately, we have no single federation of developers and linguists contributing to a gigantic matrix of standard human-machine language. The people working on this can’t even decide on an acronym. Depending on what researcher is talking and about what, it could be called SR (speech recognition), or STT (speech to text), or ASR (automatic speech recognition). Voice controls are being developed independently by entrepreneurs and large corporations. There’s a push to get more uniformity across platforms, but for the most part that kind of standardization is only within a company, such as Google or Microsoft, not across all the platforms and devices that are coming into existence.

Even if there were a move to standardize voice interfaces, and we could do away with varied commands, how would each device know I was talking to it, specifically? I’ve seen mirrors that recognize faces, offer information, and allow video calling. If I simply say, “Take a photo,” what if the phone on my nightstand hears me too? Or the mirror, and my phone and the tablet in my closet? I can imagine an annoying future hashtag: #AccidentalSelfie.

If you haven’t experienced a translation problem just yet, you will soon. Computers must wade through all of our “ums” and “ahs,” account for different melodic patterns when we talk, and try to determine the correct meaning for monosyllabic words that sound alike but have completely different definitions. At some point artificial intelligence will make machines faster and smarter, better able to make correct inferences. We probably won’t need to greet some machines with idle chitchat or have to remember three distinctly different ways to ask about the weather. But for now, we’re stuck with digital devices whose languages we haven’t quite yet mastered.

We need a better Universal Translator. Make it so.

Amy Webb writes a column about data for Slate. She's the head of Webbmedia Group, a digital strategy agency, the author of Data, A Love Story and the co-founder of Spark Camp.



Smash and Grab

Will competitive Senate contests in Kansas and South Dakota lead to more late-breaking races in future elections?

I Am 25. I Don’t Work at Facebook. My Doctors Want Me to Freeze My Eggs.

The XX Factor
Oct. 20 2014 6:17 PM I Am 25. I Don’t Work at Facebook. My Doctors Want Me to Freeze My Eggs.

Republicans Want the Government to Listen to the American Public on Ebola. That’s a Horrible Idea.

The Most Ingenious Teaching Device Ever Invented

Tom Hanks Has a Short Story in the New Yorker. It’s Not Good.

Brow Beat

Marvel’s Civil War Is a Far-Right Paranoid Fantasy

It’s also a mess. Can the movies do better?

Watching Netflix in Bed. Hanging Bananas. Is There Anything These Hooks Can’t Solve?

The Procedural Rule That Could Prevent Gay Marriage From Reaching SCOTUS Again

  News & Politics
Oct. 20 2014 8:14 PM You Should Be Optimistic About Ebola Don’t panic. Here are all the signs that the U.S. is containing the disease.
Oct. 20 2014 7:23 PM Chipotle’s Magical Burrito Empire Keeps Growing, Might Be Slowing
Oct. 20 2014 3:16 PM The Catholic Church Is Changing, and Celibate Gays Are Leading the Way
  Double X
The XX Factor
Oct. 20 2014 6:17 PM I Am 25. I Don't Work at Facebook. My Doctors Want Me to Freeze My Eggs.
  Slate Plus
Tv Club
Oct. 20 2014 7:15 AM The Slate Doctor Who Podcast: Episode 9 A spoiler-filled discussion of "Flatline."
Brow Beat
Oct. 20 2014 9:13 PM The Smart, Talented, and Utterly Hilarious Leslie Jones Is SNL’s Newest Cast Member
Future Tense
Oct. 20 2014 4:59 PM Canadian Town Cancels Outdoor Halloween Because Polar Bears
  Health & Science
Medical Examiner
Oct. 20 2014 11:46 AM Is Anybody Watching My Do-Gooding? The difference between being a hero and being an altruist.
Sports Nut
Oct. 20 2014 5:09 PM Keepaway, on Three. Ready—Break! On his record-breaking touchdown pass, Peyton Manning couldn’t even leave the celebration to chance.