Read Me a Story, Mr. Roboto
Why computer voices still don't sound human.
As computers got more powerful, speech researchers found a different way to make machines talk. Rather than having computers synthesize human speech, you could record real people saying a lot of things and then use computers to splice together different parts of the recording to make new words and sentences. This method is known as concatenative speech synthesis, and it's now the dominant format for computerized speech. Under ideal circumstances, it can yield voices that sound eerily human. Consider this clip of IBM's concatenative text-to-speech system, Naxpress, reading a few lines from the Declaration of Independence.
If you listen closely, you can hear a few unusual pronunciations and a slightly unnatural rhythm—the two syllables of the word equal sound like they were spoken by different speakers, and there's a singsongy lilt to the clause "that among these are life." But that's if you listen closely; if you weren't on guard for a computer's voice, you might mistake the speaker for human.
To produce such a system, Aaron and his colleagues begin by recruiting professional voice actors to record a huge database of human speech. This is a difficult job. Actors are asked to read about 10,000 lines, which takes around two weeks. Because their words will be spliced together from different recording sessions, they've got to keep their voices consistent over the two-week session. What's more, many of the lines they're asked to read are nonsense—researchers pick the sentences not for their meaning but in order to get the actors to use many different phonemes, the basic linguistic units of sound. (There are about 40 phonemes in the English language; the word dollar, for example, contains four phonemes—D, AA, L, and ER.) During an interview, IBM's Andy Aaron read out some of the lines that actors are asked to read:
Says the cheeky thug.
There's a wood-burning stove.
Few love working at KGO now.
Did Michelangelo zap you?
When the actors are done, Aaron's software analyzes the recordings and chops up the words into different phonemes. Now the system can begin to convert text to speech: When it's called on to read a new line, it determines which phonemes are in the sentence and then searches its database for the best representations of those sounds. The system then splices the patchwork of sounds into a smooth sentence.
The main disadvantage of concatenative text-to-speech machines is that they require a great deal of storage space for their phoneme databases. This is fine for customer-service phone lines, which can run off huge computers in a server farm, but mobile systems—like GPS navigators—don't have as much onboard memory. Those systems usually ship with a shrunken database of sound, one that has fewer phoneme recordings. This degrades the quality of the speech; it's why the Kindle or your GPS doesn't sound as human as that computer reading the Declaration of Independence.
Aaron says the next great area of interest for text-to-speech researchers is emotion. IBM has made some rudimentary progress in this field. Recently, Aaron asked his voice actors to read some lines in one of several different intonations—cheerful, dejected, with emphasis, and as if they were asking a question. This gives the system a database of expressive speech. If you wanted a computer to say something cheerfully—say, "Good news, I've found an aisle seat for you on that flight!"—programmers can wrap the sentence in <goodnews> tags, and the system will know to search for cheerful phonemes.
To see how this might work, listen to IBM's system saying the phrase "These cookies are delicious" in a flat voice.
Now here's the IBM system saying the same thing in a happy voice.
The hard part is for programmers to know when they should tell the computer to use which expression. The computer, of course, can't decide for itself whether a line should be upbeat. That's the fundamental problem with the Kindle's audiobook function. One day its voice might resemble a human's. But we're still a long way from a computer being able to understand that when an albino points a pistol at you, you're supposed to scream.
Farhad Manjoo is Slate's technology columnist and the author of True Enough: Learning To Live in a Post-Fact Society. You can email him at email@example.com and follow him on Twitter.
Photograph of Amazon's Kindle 2 from amazon.com.