Why Speech Recognition Technology Isn’t Very Good (Yet)

How to understand your data
May 5 2014 4:08 PM

You Just Don’t Understand

Speech recognition needs to get a lot better before it’s really useful.

Captain Jean-Luc Picard.
Jean-Luc Picard demands an explanation for the faulty voice recognition device his crew has given him.

Photo illustration by Juliana Jiménez Jaramillo. Photos via CBS/Matthew Yohe/Creative Commons

I live in a very connected house, and increasingly I spend as much time talking to devices as family members. I can command my Android phone entirely by voice, even if the phone isn’t actively being used, by saying, “OK Google,” and then issuing an order: “Show me my schedule for the weekend.” I can also ask it a question, like “What time does the MoMA close?” It’s the same for Google Glass. With the headset on, just say, “OK Google. Take a photo.” And it will, even if you have a stuffy nose or sore throat. Google’s voice interface is incredibly smart: It hears past accents and dialects to process what I’m saying.

Hold down the home button on an iPhone and you’ll get Siri, Apple’s slick-sounding (though, as it turns out, not so smart) personal assistant. Cortana, who will soon live inside Windows phones, will listen to your conversation if you let her, and she’ll make suggestions and phone calls, and essentially act as your very own personal assistant.

We have an Ivee clock in the bathroom. She’ll tell us an interesting fact anytime we ask her, but only if we’re extremely polite. If we don’t first greet her with “Hello, Ivee,” she ignores us. If we’re nice to her, speaking slowly and in a quiet tone, she might use her soothing voice to tell us that a butterfly cannot fly if its body temperature is less than 86 degrees. When the controller for our Xbox gets lost between the couch cushions, we can still get the console working using our voices. Unlike Ivee, the Xbox doesn’t require pleasantries. It prefers that we just get right down to business: A simple “Xbox on” is all it needs to initiate action.

Advertisement

Clive (that’s the name we gave to our Garmin navigation system) is always there, passively making inferences and decisions, and sometimes he makes suggestions that we don’t necessarily agree with. Mostly, he’s innocuous, simply telling us where to go using his dapper British voice. We run into problems when Clive overhears the voice of Marketplace’s Kai Ryssdal on the radio. Maybe it’s a shared kinship for the years Kai spent in Europe or just his clever turns of phrase, but when Kai’s on the radio, Clive assumes that we want to stop our car and listen to him. “Speak a command,” Clive will admonish us, and if we don’t respond quickly, he’ll cancel our route and send us home.

All voice-activated technology, in fact, seems to cause confusion. The other day, I asked my 4-year-old daughter to retrieve the weather forecast while we were getting ready for school. She walked up to Ivee and said, “OK Ivee, what’s the weather?” Ivee flashed a bit of blue light, then refused to talk. “OK! IVEE!” my daughter said, louder this time. My daughter rolled her eyes. Ivee blinked blue sparkles back at her, unfazed.

“It’s HELLO Ivee and OK Google,” I reminded her.

So she looked out the window instead. “The forecast is sunny today. I’m wearing the purple dress. No socks!”

To be fair, a few years ago we would have had to type our request into the Weather Channel’s website and wait two seconds for a result, and 25 years ago, forget about it. That we can now use our voices to talk to machines is as exciting as it is inspiring. This isn’t sci-fi anymore. You really can tell your house to turn its lights on, your coffeemaker to get an espresso brewing, and your thermostat to crank up the air conditioning without knowing much tech at all. In the next few years, voice-controlled technologies will become more and more ubiquitous, not just in our homes, but in our cars, offices, banks, schools, and grocery stores.

But this means we’re on the cusp of a serious problem. Star Trek posited a Universal Translator, crowdsourced by developers from all around the galaxy, who designed the complex artificial intelligence and machine learning algorithms necessary to immediately listen, infer, and respond to thousands of indigenous languages. The Universal Translator didn’t merely translate human to alien language in real time: It also acted as an interface between the humans and the computers they used. 

Unfortunately, we have no single federation of developers and linguists contributing to a gigantic matrix of standard human-machine language. The people working on this can’t even decide on an acronym. Depending on what researcher is talking and about what, it could be called SR (speech recognition), or STT (speech to text), or ASR (automatic speech recognition). Voice controls are being developed independently by entrepreneurs and large corporations. There’s a push to get more uniformity across platforms, but for the most part that kind of standardization is only within a company, such as Google or Microsoft, not across all the platforms and devices that are coming into existence.

Even if there were a move to standardize voice interfaces, and we could do away with varied commands, how would each device know I was talking to it, specifically? I’ve seen mirrors that recognize faces, offer information, and allow video calling. If I simply say, “Take a photo,” what if the phone on my nightstand hears me too? Or the mirror, and my phone and the tablet in my closet? I can imagine an annoying future hashtag: #AccidentalSelfie.

If you haven’t experienced a translation problem just yet, you will soon. Computers must wade through all of our “ums” and “ahs,” account for different melodic patterns when we talk, and try to determine the correct meaning for monosyllabic words that sound alike but have completely different definitions. At some point artificial intelligence will make machines faster and smarter, better able to make correct inferences. We probably won’t need to greet some machines with idle chitchat or have to remember three distinctly different ways to ask about the weather. But for now, we’re stuck with digital devices whose languages we haven’t quite yet mastered.

We need a better Universal Translator. Make it so.

Amy Webb writes a column about data for Slate. She's the head of Webbmedia Group, a digital strategy agency, the author of Data, A Love Story and the co-founder of Spark Camp.

  Slate Plus
Working
Nov. 27 2014 12:31 PM Slate’s Working Podcast: Episode 11 Transcript Read what David Plotz asked a helicopter paramedic about his workday.