How we strain to make voice-recognition systems understand us.

The Odd Ways We Twist Our Speech to Make Computers Understand Us

The Odd Ways We Twist Our Speech to Make Computers Understand Us

Future Tense
The Citizen's Guide to the Future
June 30 2015 2:45 PM

The Odd Ways We Twist Our Speech to Make Computers Understand Us

137422577-republican-presidential-candidate-former-massachusetts
Mitt Romney dictates an email using Siri on his iPhone as he rides on his 2012 presidential campaign bus.

Photo by Joe Raedle/Getty Images

In Christopher Nolan's Interstellar, a super-Siri-like technology allows artificial intelligences called CASE and TARS to conduct seamless conversations with humans. The contrast between their hulking box shapes and agreeable verbal dispositions is charming; they speak with the cadence, diction, and emotion of the human astronauts who engage them.

When it comes to speaking to technology, the present is more complicated than the futuristic fiction of Interstellar. Contemporary speech-activated devices—like Amazon Echo, Ford’s Sync 3, Google Now, Apple’s Siri, and Microsoft’s Cortana—are uneven and finicky. With Siri, it’s better to take a conversational approach (“How do I know if I have strep throat?”). For Google Now, speak in search terms (“symptoms of strep throat”). Neither works flawlessly. However, if the proliferation of voice-activated devices is any indication of mass appeal and market competition, consumers have a real interest in talking to and hearing from computers. How well do these devices listen, and what are they listening to or for? Here’s an average voice-query:

Advertisement

There is nothing conversational about this particular way of talking to computers—it’s drained of intonation and cadence. Users speak more Google-y when addressing a formal query to a voice-activated system. We strain to enunciate every syllable while holding our phones at (what we hope is) the optimal angle for it to hear us. Humans speak like computers so that they might better talk to their computers—which are designed to communicate more like humans.

In the video of a man asking Siri to find restaurants, it seems that when people speak to voice-activated systems that mimic human speech, they mirror the systems’ constraints, becoming less conversational. So are voice-activated technologies changing how humans speak?

Consider the following examples of speaking English to computers, which should seem eerily familiar:

And, here, the Scottish sketch comedy show Burnistoun lampoons the inability of speech recognition technologies to understand accents:

Advertisement

When you speak to your car, your phone, or your home as if there's a good chance it won't understand you, it alters your perception of speech. Ford Sync 3, Cortana, Siri—each offers consumers the appearance of a level of interactivity that imposes some limitations (no mumbling!) even as they release us from others (no more fussing with printed-out directions!).

The limitations, even demands, of voice recognition systems have a profound impact on how we communicate with machines. Since the 19th century, writers, scientists, and technologists have imagined machines that respond to conversational speech. But they have always paid much more attention to the ways that machines accommodate people, not the other way around.

Some of the most transformative human-computer interactive technologies in our time speak with crisp monotones. Currently, voice recognition systems do not translate human speech into machine-readable forms any more than keyboards translate human writing into data. Just as you have to learn how to type, you have to get a feeling for how to coax our desired response from a listening computer.

Companies may advertise “natural speech” to appeal to our want of conversational ease, but there is no single “natural speech.” Cadence, intonation, pronunciation, or other factors change depending on social class, geographic differences, speech community, etc. Contextual, social, and linguistic factors change over time and vary across the globe. People who live not too far away from each other may pronounce and use words differently. Even the Burnistoun sketch contains a transcript, in the description, “[f]or those having trouble with the accent.” There is no “natural” speech, because there is no universal speech.

Just as dialects within human languages emerge from a complex interplay of social, cultural, and historical factors, the way we speak to computers is fast becoming a dialect of our technological present.

Future Tense is a partnership of SlateNew America, and Arizona State University.

Ethel Hazard is an affiliated researcher with CLACS at the University of Illinois, Urbana-Champaign and the Women's Entrepreneurial Opportunity Project in Atlanta.

Michael Simeone is the director of the IHR Nexus Lab at Arizona State University.