Why computer voices still don't sound human.

Innovation, the Internet, gadgets, and more.
March 3 2009 3:40 PM

Read Me a Story, Mr. Roboto

Why computer voices still don't sound human.

Amazon Kindle 2. Click image to expand.
Amazon's Kindle 2

When Amazon's new Kindle debuted a month ago, Jeff Bezos proudly showed off a killer new feature—a robotic voice that can read back any passage from any book, like an automatic audiobook. The company sees the feature as a way for busy readers to catch up on books while driving or making dinner; the publishing industry saw it as lost opportunity for revenue. The Authors Guild argued that if an e-book could be turned into an audiobook, authors should get an extra fee from each sale. On Friday, Amazon relented, agreeing to let publishers turn off the text-to-speech feature on any e-books published on the Kindle.

There's an interesting legal tussle over whether Amazon's audiobook function really creates a new right that authors might charge for. But anyone who's listened to the Kindle read a book might regard that discussion as wholly beside the point. The Kindle has a pretty awful voice. Imagine Gilbert Gottfried laid up with a tuberculin cough. No, that would still be more pleasant than listening to the Kindle, which sounds like a dyslexic robot who spent his formative years in Eastern Europe.


This wasn't a surprise. Modern text-to-speech systems are incredibly complex, and they're improving rapidly. But reading a book with anything near the expressiveness of an actual human voice is an enormously difficult computational task—the pinnacle of speech synthesis research. At the moment, text-to-speech programs are found in much simpler applications—customer-service phone lines and GPS navigators, for example. In these situations, you hear the computer's voice in short bursts, so it's easy to forgive its odd intonations and suspicious speech rhythms. But when listening to long passages, you can't help but compare the computer's voice with a human's—and the computer shrinks in the comparison.

Over the last week, I tried the Kindle's text-to-speech feature on a variety of books, newspapers, and magazines. Not once could I stand listening for more than about a minute. The Kindle pauses at unusual moments in the text, it mis-emphasizes parts of sentences, it can't adjust its intonation when reading quotations, and it has a hell of a time pronouncing proper nouns. To get what I mean, listen to this clip of the Kindle orating a passage from the easiest-to-read book I could think of, Dan Brown's The Da Vinci Code.

Here's the text if you want to follow along:

Only 15 feet away, outside the sealed gate, the mountainous silhouette of his attacker stared through the iron bars. He was broad and tall, with ghost-pale skin and thinning white hair. His irises were pink with dark red pupils. The albino drew a pistol from his coat and aimed the barrel through the bars, directly at the curator. "You should not have run." His accent was not easy to place. "Now tell me where it is."

"I told you already," the curator stammered, kneeling defenseless on the floor of the gallery. "I have no idea what you are talking about!"

"You are lying." The man stared at him, perfectly immobile except for the glint in his ghostly eyes. "You and your brethren possess something that is not yours."

Notice how the Kindle pronounces "mountainous silhouette"—it jams the words together: mountnousilwet. Iron becomes i-ron, curator is guraytor, and idea is i-dee-ay. And when the curator tells the albino that he's got no i-dee-ay what the guy's talking about, he's supposed to be yelling—after all, a pistol has been drawn. But as voiced by the Kindle, the exchange reads more like a pleasant disagreement over correct change.

Why is Amazon's text-to-speech system so bad? Because human speech is extremely varied, too complex and subtle for computers to understand and replicate. Researchers can get computers to read words as they appear on the page, but because machines don't understand what they're reading, they can't infuse the speech with necessary emotion and emphasis.

Consider this simple exchange:

I'm going to ace this test.
Yeah, right.

A human reader would understand that the second sentence is meant sarcastically. So would a duplicitous machine like HAL 9000. But today's computers wouldn't get it; a robot would think the guy really was going to ace that test. Andy Aaron, a text-to-speech researcher at IBM's Watson Research Center in New York, gave me another scenario. Imagine that we learn near the end of a book that something that an obscure character said in Chapter 1 had come true. "How is a computer going to understand that—to know that it's got to pause there for dramatic effect?" Aaron asks. "I'm not saying it's impossible," he adds, "but I would say it's very far off to have an automatic system read a book as well as a professional actor. It's not on the horizon. I would say it's many, many years off—there are many hurdles between now and then."

Still, text-to-speech machines have come a long way since the 1970s, when they were first invented. The earliest systems, known as "formant synthesizers," reproduced speech by mimicking the varying resonances of a human voice. (The process is similar to how synths can ape a variety of musical instruments.) In 1978, Texas Instruments released the Speak & Spell, the first mainstream product to rely on this method of synthesis. The machine's voice was distorted and mechanical-sounding, but you could make it out; when it said a word, you could usually recognize it well enough to spell it. (Play along with a demo of Speak & Spell here.)

In 1982, Mark Barton and Joseph Katz, two software engineers, used formant synthesis to produce the first commercial program that could make your computer talk. That program, called Software Automatic Mouth, ran on Apple, Atari, and Commodore machines. Apple liked the program so much that it asked Barton and Katz to help build a text-to-speech system into the company's new Macintosh computer. "For the first time ever, I'd like to let Macintosh speak for itself," Steve Jobs crowed at the Mac's unveiling in 1984. And then, to gasps in the crowd, the computer began to talk.



Talking White

Black people’s disdain for “proper English” and academic achievement is a myth.

Hong Kong’s Protesters Are Ridiculously Polite. That’s What Scares Beijing So Much.

The One Fact About Ebola That Should Calm You: It Spreads Slowly

Operation Backbone

How White Boy Rick, a legendary Detroit cocaine dealer, helped the FBI uncover brazen police corruption.

A Jaw-Dropping Political Ad Aimed at Young Women, Apparently

The XX Factor
Oct. 1 2014 4:05 PM Today in GOP Outreach to Women: You Broads Like Wedding Dresses, Right?

How Even an Old Hipster Can Age Gracefully

On their new albums, Leonard Cohen, Robert Plant, and Loudon Wainwright III show three ways.

How Tattoo Parlors Became the Barber Shops of Hipster Neighborhoods

This Gargantuan Wind Farm in Wyoming Would Be the Hoover Dam of the 21st Century

Oct. 1 2014 8:34 AM This Gargantuan Wind Farm in Wyoming Would Be the Hoover Dam of the 21st Century To undertake a massively ambitious energy project, you don’t need the government anymore.
  News & Politics
Oct. 1 2014 7:26 PM Talking White Black people’s disdain for “proper English” and academic achievement is a myth.
Oct. 2 2014 8:07 AM The Dark Side of Techtopia
Oct. 2 2014 8:27 AM How Do Teachers Kill the Joy of Reading for Students?
  Double X
The XX Factor
Oct. 1 2014 5:11 PM Celebrity Feminist Identification Has Reached Peak Meaninglessness
  Slate Plus
Behind the Scenes
Oct. 1 2014 3:24 PM Revelry (and Business) at Mohonk Photos and highlights from Slate’s annual retreat.
Brow Beat
Oct. 1 2014 9:39 PM Tom Cruise Dies Over and Over Again in This Edge of Tomorrow Supercut
Future Tense
Oct. 1 2014 6:59 PM EU’s Next Digital Commissioner Thinks Keeping Nude Celeb Photos in the Cloud Is “Stupid”
  Health & Science
Bad Astronomy
Oct. 2 2014 7:30 AM What Put the Man in the Moon in the Moon?
Sports Nut
Oct. 1 2014 5:19 PM Bunt-a-Palooza! How bad was the Kansas City Royals’ bunt-all-the-time strategy in the American League wild-card game?