When Amazon's new Kindle debuted a month ago, Jeff Bezos proudly showed off a killer new feature—a robotic voice that can read back any passage from any book, like an automatic audiobook. The company sees the feature as a way for busy readers to catch up on books while driving or making dinner; the publishing industry saw it as lost opportunity for revenue. The Authors Guild argued that if an e-book could be turned into an audiobook, authors should get an extra fee from each sale. On Friday, Amazon relented, agreeing to let publishers turn off the text-to-speech feature on any e-books published on the Kindle.
There's an interesting legal tussle over whether Amazon's audiobook function really creates a new right that authors might charge for. But anyone who's listened to the Kindle read a book might regard that discussion as wholly beside the point. The Kindle has a pretty awful voice. Imagine Gilbert Gottfried laid up with a tuberculin cough. No, that would still be more pleasant than listening to the Kindle, which sounds like a dyslexic robot who spent his formative years in Eastern Europe.
This wasn't a surprise. Modern text-to-speech systems are incredibly complex, and they're improving rapidly. But reading a book with anything near the expressiveness of an actual human voice is an enormously difficult computational task—the pinnacle of speech synthesis research. At the moment, text-to-speech programs are found in much simpler applications—customer-service phone lines and GPS navigators, for example. In these situations, you hear the computer's voice in short bursts, so it's easy to forgive its odd intonations and suspicious speech rhythms. But when listening to long passages, you can't help but compare the computer's voice with a human's—and the computer shrinks in the comparison.
Over the last week, I tried the Kindle's text-to-speech feature on a variety of books, newspapers, and magazines. Not once could I stand listening for more than about a minute. The Kindle pauses at unusual moments in the text, it mis-emphasizes parts of sentences, it can't adjust its intonation when reading quotations, and it has a hell of a time pronouncing proper nouns. To get what I mean, listen to this clip of the Kindle orating a passage from the easiest-to-read book I could think of, Dan Brown's The Da Vinci Code.
Here's the text if you want to follow along:
Only 15 feet away, outside the sealed gate, the mountainous silhouette of his attacker stared through the iron bars. He was broad and tall, with ghost-pale skin and thinning white hair. His irises were pink with dark red pupils. The albino drew a pistol from his coat and aimed the barrel through the bars, directly at the curator. "You should not have run." His accent was not easy to place. "Now tell me where it is."
"I told you already," the curator stammered, kneeling defenseless on the floor of the gallery. "I have no idea what you are talking about!"
"You are lying." The man stared at him, perfectly immobile except for the glint in his ghostly eyes. "You and your brethren possess something that is not yours."
Notice how the Kindle pronounces "mountainous silhouette"—it jams the words together: mountnousilwet. Iron becomes i-ron, curator is guraytor, and idea is i-dee-ay. And when the curator tells the albino that he's got no i-dee-ay what the guy's talking about, he's supposed to be yelling—after all, a pistol has been drawn. But as voiced by the Kindle, the exchange reads more like a pleasant disagreement over correct change.
Why is Amazon's text-to-speech system so bad? Because human speech is extremely varied, too complex and subtle for computers to understand and replicate. Researchers can get computers to read words as they appear on the page, but because machines don't understand what they're reading, they can't infuse the speech with necessary emotion and emphasis.
Consider this simple exchange:
I'm going to ace this test.
A human reader would understand that the second sentence is meant sarcastically. So would a duplicitous machine like HAL 9000. But today's computers wouldn't get it; a robot would think the guy really was going to ace that test. Andy Aaron, a text-to-speech researcher at IBM's Watson Research Center in New York, gave me another scenario. Imagine that we learn near the end of a book that something that an obscure character said in Chapter 1 had come true. "How is a computer going to understand that—to know that it's got to pause there for dramatic effect?" Aaron asks. "I'm not saying it's impossible," he adds, "but I would say it's very far off to have an automatic system read a book as well as a professional actor. It's not on the horizon. I would say it's many, many years off—there are many hurdles between now and then."
Still, text-to-speech machines have come a long way since the 1970s, when they were first invented. The earliest systems, known as "formant synthesizers," reproduced speech by mimicking the varying resonances of a human voice. (The process is similar to how synths can ape a variety of musical instruments.) In 1978, Texas Instruments released the Speak & Spell, the first mainstream product to rely on this method of synthesis. The machine's voice was distorted and mechanical-sounding, but you could make it out; when it said a word, you could usually recognize it well enough to spell it. (Play along with a demo of Speak & Spell here.)
In 1982, Mark Barton and Joseph Katz, two software engineers, used formant synthesis to produce the first commercial program that could make your computer talk. That program, called Software Automatic Mouth, ran on Apple, Atari, and Commodore machines. Apple liked the program so much that it asked Barton and Katz to help build a text-to-speech system into the company's new Macintosh computer. "For the first time ever, I'd like to let Macintosh speak for itself," Steve Jobs crowed at the Mac's unveiling in 1984. And then, to gasps in the crowd, the computer began to talk.