When Amazon's new Kindle debuted a month ago, Jeff Bezos proudly showed off a killer new feature—a robotic voice that can read back any passage from any book, like an automatic audiobook. The company sees the feature as a way for busy readers to catch up on books while driving or making dinner; the publishing industry saw it as lost opportunity for revenue. The Authors Guild argued that if an e-book could be turned into an audiobook, authors should get an extra fee from each sale. On Friday, Amazon relented, agreeing to let publishers turn off the text-to-speech feature on any e-books published on the Kindle.
There's an interesting legal tussle over whether Amazon's audiobook function really creates a new right that authors might charge for. But anyone who's listened to the Kindle read a book might regard that discussion as wholly beside the point. The Kindle has a pretty awful voice. Imagine Gilbert Gottfried laid up with a tuberculin cough. No, that would still be more pleasant than listening to the Kindle, which sounds like a dyslexic robot who spent his formative years in Eastern Europe.
This wasn't a surprise. Modern text-to-speech systems are incredibly complex, and they're improving rapidly. But reading a book with anything near the expressiveness of an actual human voice is an enormously difficult computational task—the pinnacle of speech synthesis research. At the moment, text-to-speech programs are found in much simpler applications—customer-service phone lines and GPS navigators, for example. In these situations, you hear the computer's voice in short bursts, so it's easy to forgive its odd intonations and suspicious speech rhythms. But when listening to long passages, you can't help but compare the computer's voice with a human's—and the computer shrinks in the comparison.
Over the last week, I tried the Kindle's text-to-speech feature on a variety of books, newspapers, and magazines. Not once could I stand listening for more than about a minute. The Kindle pauses at unusual moments in the text, it mis-emphasizes parts of sentences, it can't adjust its intonation when reading quotations, and it has a hell of a time pronouncing proper nouns. To get what I mean, listen to this clip of the Kindle orating a passage from the easiest-to-read book I could think of, Dan Brown's The Da Vinci Code.
Here's the text if you want to follow along:
Only 15 feet away, outside the sealed gate, the mountainous silhouette of his attacker stared through the iron bars. He was broad and tall, with ghost-pale skin and thinning white hair. His irises were pink with dark red pupils. The albino drew a pistol from his coat and aimed the barrel through the bars, directly at the curator. "You should not have run." His accent was not easy to place. "Now tell me where it is."
"I told you already," the curator stammered, kneeling defenseless on the floor of the gallery. "I have no idea what you are talking about!"
"You are lying." The man stared at him, perfectly immobile except for the glint in his ghostly eyes. "You and your brethren possess something that is not yours."
Notice how the Kindle pronounces "mountainous silhouette"—it jams the words together: mountnousilwet. Iron becomes i-ron, curator is guraytor, and idea is i-dee-ay. And when the curator tells the albino that he's got no i-dee-ay what the guy's talking about, he's supposed to be yelling—after all, a pistol has been drawn. But as voiced by the Kindle, the exchange reads more like a pleasant disagreement over correct change.
Why is Amazon's text-to-speech system so bad? Because human speech is extremely varied, too complex and subtle for computers to understand and replicate. Researchers can get computers to read words as they appear on the page, but because machines don't understand what they're reading, they can't infuse the speech with necessary emotion and emphasis.
Consider this simple exchange:
I'm going to ace this test.
A human reader would understand that the second sentence is meant sarcastically. So would a duplicitous machine like HAL 9000. But today's computers wouldn't get it; a robot would think the guy really was going to ace that test. Andy Aaron, a text-to-speech researcher at IBM's Watson Research Center in New York, gave me another scenario. Imagine that we learn near the end of a book that something that an obscure character said in Chapter 1 had come true. "How is a computer going to understand that—to know that it's got to pause there for dramatic effect?" Aaron asks. "I'm not saying it's impossible," he adds, "but I would say it's very far off to have an automatic system read a book as well as a professional actor. It's not on the horizon. I would say it's many, many years off—there are many hurdles between now and then."
Still, text-to-speech machines have come a long way since the 1970s, when they were first invented. The earliest systems, known as "formant synthesizers," reproduced speech by mimicking the varying resonances of a human voice. (The process is similar to how synths can ape a variety of musical instruments.) In 1978, Texas Instruments released the Speak & Spell, the first mainstream product to rely on this method of synthesis. The machine's voice was distorted and mechanical-sounding, but you could make it out; when it said a word, you could usually recognize it well enough to spell it. (Play along with a demo of Speak & Spell here.)
In 1982, Mark Barton and Joseph Katz, two software engineers, used formant synthesis to produce the first commercial program that could make your computer talk. That program, called Software Automatic Mouth, ran on Apple, Atari, and Commodore machines. Apple liked the program so much that it asked Barton and Katz to help build a text-to-speech system into the company's new Macintosh computer. "For the first time ever, I'd like to let Macintosh speak for itself," Steve Jobs crowed at the Mac's unveiling in 1984. And then, to gasps in the crowd, the computer began to talk.
As computers got more powerful, speech researchers found a different way to make machines talk. Rather than having computers synthesize human speech, you could record real people saying a lot of things and then use computers to splice together different parts of the recording to make new words and sentences. This method is known as concatenative speech synthesis, and it's now the dominant format for computerized speech. Under ideal circumstances, it can yield voices that sound eerily human. Consider this clip of IBM's concatenative text-to-speech system, Naxpress, reading a few lines from the Declaration of Independence.
If you listen closely, you can hear a few unusual pronunciations and a slightly unnatural rhythm—the two syllables of the word equal sound like they were spoken by different speakers, and there's a singsongy lilt to the clause "that among these are life." But that's if you listen closely; if you weren't on guard for a computer's voice, you might mistake the speaker for human.
To produce such a system, Aaron and his colleagues begin by recruiting professional voice actors to record a huge database of human speech. This is a difficult job. Actors are asked to read about 10,000 lines, which takes around two weeks. Because their words will be spliced together from different recording sessions, they've got to keep their voices consistent over the two-week session. What's more, many of the lines they're asked to read are nonsense—researchers pick the sentences not for their meaning but in order to get the actors to use many different phonemes, the basic linguistic units of sound. (There are about 40 phonemes in the English language; the word dollar, for example, contains four phonemes—D, AA, L, and ER.) During an interview, IBM's Andy Aaron read out some of the lines that actors are asked to read:
Says the cheeky thug.
There's a wood-burning stove.
Few love working at KGO now.
Did Michelangelo zap you?
When the actors are done, Aaron's software analyzes the recordings and chops up the words into different phonemes. Now the system can begin to convert text to speech: When it's called on to read a new line, it determines which phonemes are in the sentence and then searches its database for the best representations of those sounds. The system then splices the patchwork of sounds into a smooth sentence.
The main disadvantage of concatenative text-to-speech machines is that they require a great deal of storage space for their phoneme databases. This is fine for customer-service phone lines, which can run off huge computers in a server farm, but mobile systems—like GPS navigators—don't have as much onboard memory. Those systems usually ship with a shrunken database of sound, one that has fewer phoneme recordings. This degrades the quality of the speech; it's why the Kindle or your GPS doesn't sound as human as that computer reading the Declaration of Independence.
Aaron says the next great area of interest for text-to-speech researchers is emotion. IBM has made some rudimentary progress in this field. Recently, Aaron asked his voice actors to read some lines in one of several different intonations—cheerful, dejected, with emphasis, and as if they were asking a question. This gives the system a database of expressive speech. If you wanted a computer to say something cheerfully—say, "Good news, I've found an aisle seat for you on that flight!"—programmers can wrap the sentence in <goodnews> tags, and the system will know to search for cheerful phonemes.
To see how this might work, listen to IBM's system saying the phrase "These cookies are delicious" in a flat voice.
Now here's the IBM system saying the same thing in a happy voice.
The hard part is for programmers to know when they should tell the computer to use which expression. The computer, of course, can't decide for itself whether a line should be upbeat. That's the fundamental problem with the Kindle's audiobook function. One day its voice might resemble a human's. But we're still a long way from a computer being able to understand that when an albino points a pistol at you, you're supposed to scream.