Why computer voices still don't sound human.

Innovation, the Internet, gadgets, and more.
March 3 2009 3:40 PM

Read Me a Story, Mr. Roboto

Why computer voices still don't sound human.

Amazon Kindle 2. Click image to expand.
Amazon's Kindle 2

When Amazon's new Kindle debuted a month ago, Jeff Bezos proudly showed off a killer new feature—a robotic voice that can read back any passage from any book, like an automatic audiobook. The company sees the feature as a way for busy readers to catch up on books while driving or making dinner; the publishing industry saw it as lost opportunity for revenue. The Authors Guild argued that if an e-book could be turned into an audiobook, authors should get an extra fee from each sale. On Friday, Amazon relented, agreeing to let publishers turn off the text-to-speech feature on any e-books published on the Kindle.

There's an interesting legal tussle over whether Amazon's audiobook function really creates a new right that authors might charge for. But anyone who's listened to the Kindle read a book might regard that discussion as wholly beside the point. The Kindle has a pretty awful voice. Imagine Gilbert Gottfried laid up with a tuberculin cough. No, that would still be more pleasant than listening to the Kindle, which sounds like a dyslexic robot who spent his formative years in Eastern Europe.


This wasn't a surprise. Modern text-to-speech systems are incredibly complex, and they're improving rapidly. But reading a book with anything near the expressiveness of an actual human voice is an enormously difficult computational task—the pinnacle of speech synthesis research. At the moment, text-to-speech programs are found in much simpler applications—customer-service phone lines and GPS navigators, for example. In these situations, you hear the computer's voice in short bursts, so it's easy to forgive its odd intonations and suspicious speech rhythms. But when listening to long passages, you can't help but compare the computer's voice with a human's—and the computer shrinks in the comparison.

Over the last week, I tried the Kindle's text-to-speech feature on a variety of books, newspapers, and magazines. Not once could I stand listening for more than about a minute. The Kindle pauses at unusual moments in the text, it mis-emphasizes parts of sentences, it can't adjust its intonation when reading quotations, and it has a hell of a time pronouncing proper nouns. To get what I mean, listen to this clip of the Kindle orating a passage from the easiest-to-read book I could think of, Dan Brown's The Da Vinci Code.

Here's the text if you want to follow along:

Only 15 feet away, outside the sealed gate, the mountainous silhouette of his attacker stared through the iron bars. He was broad and tall, with ghost-pale skin and thinning white hair. His irises were pink with dark red pupils. The albino drew a pistol from his coat and aimed the barrel through the bars, directly at the curator. "You should not have run." His accent was not easy to place. "Now tell me where it is."

"I told you already," the curator stammered, kneeling defenseless on the floor of the gallery. "I have no idea what you are talking about!"

"You are lying." The man stared at him, perfectly immobile except for the glint in his ghostly eyes. "You and your brethren possess something that is not yours."

Notice how the Kindle pronounces "mountainous silhouette"—it jams the words together: mountnousilwet. Iron becomes i-ron, curator is guraytor, and idea is i-dee-ay. And when the curator tells the albino that he's got no i-dee-ay what the guy's talking about, he's supposed to be yelling—after all, a pistol has been drawn. But as voiced by the Kindle, the exchange reads more like a pleasant disagreement over correct change.

Why is Amazon's text-to-speech system so bad? Because human speech is extremely varied, too complex and subtle for computers to understand and replicate. Researchers can get computers to read words as they appear on the page, but because machines don't understand what they're reading, they can't infuse the speech with necessary emotion and emphasis.

Consider this simple exchange:

I'm going to ace this test.
Yeah, right.

A human reader would understand that the second sentence is meant sarcastically. So would a duplicitous machine like HAL 9000. But today's computers wouldn't get it; a robot would think the guy really was going to ace that test. Andy Aaron, a text-to-speech researcher at IBM's Watson Research Center in New York, gave me another scenario. Imagine that we learn near the end of a book that something that an obscure character said in Chapter 1 had come true. "How is a computer going to understand that—to know that it's got to pause there for dramatic effect?" Aaron asks. "I'm not saying it's impossible," he adds, "but I would say it's very far off to have an automatic system read a book as well as a professional actor. It's not on the horizon. I would say it's many, many years off—there are many hurdles between now and then."

Still, text-to-speech machines have come a long way since the 1970s, when they were first invented. The earliest systems, known as "formant synthesizers," reproduced speech by mimicking the varying resonances of a human voice. (The process is similar to how synths can ape a variety of musical instruments.) In 1978, Texas Instruments released the Speak & Spell, the first mainstream product to rely on this method of synthesis. The machine's voice was distorted and mechanical-sounding, but you could make it out; when it said a word, you could usually recognize it well enough to spell it. (Play along with a demo of Speak & Spell here.)

In 1982, Mark Barton and Joseph Katz, two software engineers, used formant synthesis to produce the first commercial program that could make your computer talk. That program, called Software Automatic Mouth, ran on Apple, Atari, and Commodore machines. Apple liked the program so much that it asked Barton and Katz to help build a text-to-speech system into the company's new Macintosh computer. "For the first time ever, I'd like to let Macintosh speak for itself," Steve Jobs crowed at the Mac's unveiling in 1984. And then, to gasps in the crowd, the computer began to talk.


Medical Examiner

The Most Terrifying Thing About Ebola 

The disease threatens humanity by preying on humanity.

I Bought the Huge iPhone. I’m Already Thinking of Returning It.

Scotland Is Just the Beginning. Expect More Political Earthquakes in Europe.

Students Aren’t Going to College Football Games as Much Anymore

And schools are getting worried.

Two Damn Good, Very Different Movies About Soldiers Returning From War

The XX Factor

Lifetime Didn’t Think the Steubenville Rape Case Was Dramatic Enough

So they added a little self-immolation.


Blacks Don’t Have a Corporal Punishment Problem

Americans do. But when blacks exhibit the same behaviors as others, it becomes part of a greater black pathology. 

Why a Sketch of Chelsea Manning Is Stirring Up Controversy

How Worried Should Poland, the Baltic States, and Georgia Be About a Russian Invasion?

Trending News Channel
Sept. 19 2014 1:11 PM Watch Flashes of Lightning Created in a Lab  
  News & Politics
Sept. 20 2014 11:13 AM -30-
Business Insider
Sept. 20 2014 6:30 AM The Man Making Bill Gates Richer
Sept. 20 2014 7:27 AM How Do Plants Grow Aboard the International Space Station?
  Double X
The XX Factor
Sept. 19 2014 11:33 AM Planned Parenthood Is About to Make It a Lot Easier to Get Birth Control
  Slate Plus
Slate Picks
Sept. 19 2014 12:00 PM What Happened at Slate This Week? The Slatest editor tells us to read well-informed skepticism, media criticism, and more.
Brow Beat
Sept. 20 2014 3:21 PM “The More You Know (About Black People)” Uses Very Funny PSAs to Condemn Black Stereotypes
Future Tense
Sept. 19 2014 5:03 PM White House Chief Information Officer Will Run U.S. Ebola Response
  Health & Science
Bad Astronomy
Sept. 20 2014 7:00 AM The Shaggy Sun
Sports Nut
Sept. 18 2014 11:42 AM Grandmaster Clash One of the most amazing feats in chess history just happened, and no one noticed.