Why Speech Recognition Technology Isn’t Very Good (Yet)

How to understand your data
May 5 2014 4:08 PM

You Just Don’t Understand

Speech recognition needs to get a lot better before it’s really useful.

Captain Jean-Luc Picard.
Jean-Luc Picard demands an explanation for the faulty voice recognition device his crew has given him.

Photo illustration by Juliana Jiménez Jaramillo. Photos via CBS/Matthew Yohe/Creative Commons

I live in a very connected house, and increasingly I spend as much time talking to devices as family members. I can command my Android phone entirely by voice, even if the phone isn’t actively being used, by saying, “OK Google,” and then issuing an order: “Show me my schedule for the weekend.” I can also ask it a question, like “What time does the MoMA close?” It’s the same for Google Glass. With the headset on, just say, “OK Google. Take a photo.” And it will, even if you have a stuffy nose or sore throat. Google’s voice interface is incredibly smart: It hears past accents and dialects to process what I’m saying.

Hold down the home button on an iPhone and you’ll get Siri, Apple’s slick-sounding (though, as it turns out, not so smart) personal assistant. Cortana, who will soon live inside Windows phones, will listen to your conversation if you let her, and she’ll make suggestions and phone calls, and essentially act as your very own personal assistant.

We have an Ivee clock in the bathroom. She’ll tell us an interesting fact anytime we ask her, but only if we’re extremely polite. If we don’t first greet her with “Hello, Ivee,” she ignores us. If we’re nice to her, speaking slowly and in a quiet tone, she might use her soothing voice to tell us that a butterfly cannot fly if its body temperature is less than 86 degrees. When the controller for our Xbox gets lost between the couch cushions, we can still get the console working using our voices. Unlike Ivee, the Xbox doesn’t require pleasantries. It prefers that we just get right down to business: A simple “Xbox on” is all it needs to initiate action.

Advertisement

Clive (that’s the name we gave to our Garmin navigation system) is always there, passively making inferences and decisions, and sometimes he makes suggestions that we don’t necessarily agree with. Mostly, he’s innocuous, simply telling us where to go using his dapper British voice. We run into problems when Clive overhears the voice of Marketplace’s Kai Ryssdal on the radio. Maybe it’s a shared kinship for the years Kai spent in Europe or just his clever turns of phrase, but when Kai’s on the radio, Clive assumes that we want to stop our car and listen to him. “Speak a command,” Clive will admonish us, and if we don’t respond quickly, he’ll cancel our route and send us home.

All voice-activated technology, in fact, seems to cause confusion. The other day, I asked my 4-year-old daughter to retrieve the weather forecast while we were getting ready for school. She walked up to Ivee and said, “OK Ivee, what’s the weather?” Ivee flashed a bit of blue light, then refused to talk. “OK! IVEE!” my daughter said, louder this time. My daughter rolled her eyes. Ivee blinked blue sparkles back at her, unfazed.

“It’s HELLO Ivee and OK Google,” I reminded her.

So she looked out the window instead. “The forecast is sunny today. I’m wearing the purple dress. No socks!”

To be fair, a few years ago we would have had to type our request into the Weather Channel’s website and wait two seconds for a result, and 25 years ago, forget about it. That we can now use our voices to talk to machines is as exciting as it is inspiring. This isn’t sci-fi anymore. You really can tell your house to turn its lights on, your coffeemaker to get an espresso brewing, and your thermostat to crank up the air conditioning without knowing much tech at all. In the next few years, voice-controlled technologies will become more and more ubiquitous, not just in our homes, but in our cars, offices, banks, schools, and grocery stores.

But this means we’re on the cusp of a serious problem. Star Trek posited a Universal Translator, crowdsourced by developers from all around the galaxy, who designed the complex artificial intelligence and machine learning algorithms necessary to immediately listen, infer, and respond to thousands of indigenous languages. The Universal Translator didn’t merely translate human to alien language in real time: It also acted as an interface between the humans and the computers they used. 

Unfortunately, we have no single federation of developers and linguists contributing to a gigantic matrix of standard human-machine language. The people working on this can’t even decide on an acronym. Depending on what researcher is talking and about what, it could be called SR (speech recognition), or STT (speech to text), or ASR (automatic speech recognition). Voice controls are being developed independently by entrepreneurs and large corporations. There’s a push to get more uniformity across platforms, but for the most part that kind of standardization is only within a company, such as Google or Microsoft, not across all the platforms and devices that are coming into existence.

Even if there were a move to standardize voice interfaces, and we could do away with varied commands, how would each device know I was talking to it, specifically? I’ve seen mirrors that recognize faces, offer information, and allow video calling. If I simply say, “Take a photo,” what if the phone on my nightstand hears me too? Or the mirror, and my phone and the tablet in my closet? I can imagine an annoying future hashtag: #AccidentalSelfie.

If you haven’t experienced a translation problem just yet, you will soon. Computers must wade through all of our “ums” and “ahs,” account for different melodic patterns when we talk, and try to determine the correct meaning for monosyllabic words that sound alike but have completely different definitions. At some point artificial intelligence will make machines faster and smarter, better able to make correct inferences. We probably won’t need to greet some machines with idle chitchat or have to remember three distinctly different ways to ask about the weather. But for now, we’re stuck with digital devices whose languages we haven’t quite yet mastered.

We need a better Universal Translator. Make it so.

Amy Webb writes a column about data for Slate. She's the head of Webbmedia Group, a digital strategy agency, the author of Data, A Love Story and the co-founder of Spark Camp.

TODAY IN SLATE

Politics

The Democrats’ War at Home

How can the president’s party defend itself from the president’s foreign policy blunders?

Congress’ Public Shaming of the Secret Service Was Political Grandstanding at Its Best

Michigan’s Tradition of Football “Toughness” Needs to Go—Starting With Coach Hoke

A Plentiful, Renewable Resource That America Keeps Overlooking

Animal manure.

Windows 8 Was So Bad That Microsoft Will Skip Straight to Windows 10

Politics

Cringing. Ducking. Mumbling.

How GOP candidates react whenever someone brings up reproductive rights or gay marriage.

Building a Better Workplace

You Deserve a Pre-cation

The smartest job perk you’ve never heard of.

Hasbro Is Cracking Down on Scrabble Players Who Turn Its Official Word List Into Popular Apps

Florida State’s New President Is Underqualified and Mistrusted. He Just Might Save the University.

  News & Politics
Politics
Sept. 30 2014 9:33 PM Political Theater With a Purpose Darrell Issa’s public shaming of the head of the Secret Service was congressional grandstanding at its best.
  Business
Moneybox
Sept. 30 2014 7:02 PM At Long Last, eBay Sets PayPal Free
  Life
Gaming
Sept. 30 2014 7:35 PM Who Owns Scrabble’s Word List? Hasbro says the list of playable words belongs to the company. Players beg to differ.
  Double X
The XX Factor
Sept. 30 2014 12:34 PM Parents, Get Your Teenage Daughters the IUD
  Slate Plus
Behind the Scenes
Sept. 30 2014 3:21 PM Meet Jordan Weissmann Five questions with Slate’s senior business and economics correspondent.
  Arts
Brow Beat
Sept. 30 2014 8:54 PM Bette Davis Talks Gender Roles in a Delightful, Animated Interview From 1963
  Technology
Future Tense
Sept. 30 2014 7:00 PM There’s Going to Be a Live-Action Tetris Movie for Some Reason
  Health & Science
Medical Examiner
Sept. 30 2014 11:51 PM Should You Freeze Your Eggs? An egg freezing party is not a great place to find answers to this or other questions.
  Sports
Sports Nut
Sept. 30 2014 5:54 PM Goodbye, Tough Guy It’s time for Michigan to fire its toughness-obsessed coach, Brady Hoke.