Future Tense

Why Computers Still Can’t Translate Languages Automatically

We need to teach machines to understand the meaning of words. That’s really hard.

A soldier translator in Afghanistan
A soldier translator in Afghanistan

Photograph by Ted Aljibe/AFP/Getty Images.

Recently, on the eighth floor of an office building in Arlington, Va., Rachael held her finger down on a Dell Streak touchscreen and asked Aziz whether he knew the village elder. The handheld tablet beeped as if imitating R2-D2 and then said what sounded like, “Aya tai ahili che dev kali musha.” Aziz replied in Pashto, and the Streak said in a monotone: “Yes, I know.” Rachael asked: “Would you introduce me to him?” Aziz failed to understand the machine’s translation (though he does speak English), so she asked again: “Could you introduce me to the village elder?” This time, there was success, after a fashion. Aziz, via the device, replied: “Yes, I can introduce myself to you.” Aziz, who is at most middle-aged and was wearing a sweater vest, was not the village elder.*

The software running on the tablet was the culmination of TransTac, a five-year effort run by the Defense Advanced Research Projects Agency to create a system for “speech to speech” translation (as opposed to text-based systems) that concluded last year. Mari Maeda, a DARPA manager who ran the program, says that, by the end, TransTac achieved about 80 percent accuracy: enough to be interesting, but not enough to be useful. A few dozen users in Iraq and Afghanistan tried it—in addition to Pashto, the program worked for Arabic and Darió—but no one was impressed enough to want to keep it.

This doesn’t mean TransTac was a failure. It set out to do something very hard: getting a computer to listen to a person speak in one language, translate that speech into another language, and pronounce it aloud. The dream of using computers to translate human language goes back to the very early days of computing, when computers still used vacuum tubes. But it has consistently proved elusive.

DARPA is, of course, not the only organization funding research into what computer scientists call “machine translation.” (This includes both speech-to-speech systems like TransTac and systems that translate written texts, a simpler problem in some ways, though the core difficulties are the same in both.) But the agency has played a central role. The Defense Department spent nearly $700 million on a single translation contract (for human interpreters, mostly in Afghanistan) in one year, so the more than $80 million it is spending on BOLT, a successor program to TransTac, in the hopes of saving money on translators in the future is a relative bargain.

The central question guiding most of these projects is: How can you tell when a translation is any good? Even humans struggle to rank different translations. This makes the challenge of automating evaluations even starker. And if you don’t know or can’t assess how well you’re doing, it’s hard to improve.

For decades, researchers were unable to program computers to produce useful translations. Soldiers relied on phrasebooks with phonetic pronunciations (“VO ist NAWR-den?” is how a 1943 War Department pamphlet told GIs to say “Which way is north?” in German.) The “Phrasealator,” which the Army started using in 2004, wasn’t much more advanced—it was essentially a computerized phrasebook. But the last few years have seen the widespread adoption of statistical machine translation (SMT), a technique that has vastly improved quality.

Rather than trying to explicitly encode rules for translating from one language to another, the aim of SMT is to get algorithms to infer those rules from existing databases of translated texts. The most plentiful such databases are of texts that are legally required to be translated into many languages, like the proceedings of the European Union, which are translated by humans into the EU’s 23 official languages. If such databases aren’t already available, you have to make them. For TransTac, DARPA did this by recording skits between about 50 American soldiers and Marines and another 50 or so Arabic speakers. The participants play-acted various scenarios like a checkpoint or a house search (albeit in California).

The central challenge of SMT is how to use the information contained in such “parallel corpora” to build models of how each language works on its own, and of how languages interrelate. A model for a given language—say, English—is a way of figuring out how likely a string of words is to be a valid sentence. (“Translation logic green slate,” for instance, is an unlikely string.) SMT programs then work by interrelating the models of each language with one another. Typically on a sentence-by-sentence level, the program translates by finding words in the target language that both make sense together grammatically and are likely to match well with their analogues in the source language.

To do this, the models have to be able to properly align sentences. There isn’t necessarily a one-to-one correspondence between sentences in different languages. If you get thrown off by one and systematically misalign the subsequent sentences, you’ll get junk data. Then there’s the question of how to link the words in the source language with words in the target language—words don’t line up one-to-one, either, and word order can vary substantially between languages. But the idea is that, if you throw enough data at the problem, the “noise” of imperfect alignment will diminish in comparison to the signal of correlations between the same idea expressed in different languages.

The statistical approach changed the field. However, Bonnie Dorr, program manager for
BOLT, says that DARPA is now “very focused on moving beyond statistical models.” The reason is that, as you throw more and more parallel data at your algorithms, you “get diminishing returns. The payoff gets smaller, and you start to plateau with your results even if you increase the volume of training data.”

At first, that “something else” was syntax: trying to parse sentences to understand what word played what grammatical role and then attempting to match verbs to verbs and nouns to nouns. This helps deal with problems like radically different word orders. Incorporating syntactic information into statistical models seems to have helped improve performance. But they still haven’t helped researchers get at the basic question: Is a translation any good?

The best way, so far, for scoring machine translation programs is a metric devised at IBM. The metric, called BLEU, is not very good, but it is useful because it is consistent. BLEU works by comparing a translation of a particular text with a reference translation of that same text, done by a human, and figuring out how “distant” it is. It does this by coming up with a composite score based on how many words in the computer translation are also in the human reference translation, how many two-word phrases match, how many three-word phrases match, four-word, etc. (Long matching phrases are rare to nonexistent.) But as Philip Koehn, a prominent machine translation researcher, has written, no one knows what BLEU scores mean, and good human translations often do only negligibly better on the BLEU test than machine translations. Koehn gives the example of a sentence translated from Chinese. Which is better: “Israel is in charge of the security at this airport” or “Israeli officials are responsible for airport security”?

What you want to know is whether the translation got the meaning right, not if it used the same words. So DARPA hopes to create “semantic evaluation metrics” that measure the fidelity with which meaning was conveyed. One approach, which Dorr says DARPA is already taking, is using a human to compare meanings and determine how many words must be changed from a computer’s translation to match the meaning of a reference translation. But that kind of human intervention is slow and expensive. Such semantic evaluation metrics can be used to give you a sense of whether you’ve made progress over the long run, but they aren’t of much use in tweaking parameters of your model. To do that, you need to be able to capture meaning in an automated way.

Meaning is, of course, a slippery target, but it is not an all-or-nothing proposition. A program doesn’t have to (and can’t) get at all the layers of meaning in a sentence like “I love you.” It can help just to determine that “love” is not just a verb, but an emotionally valenced word, and that “you” is not just the object of the sentence, but also the beloved. This sort of shallow semantic knowledge isn’t interesting if you’re trying to find out what meaning means in some deeper sense. But it’s enough to be potentially useful. Attaching such signifiers to words or strings of words is known as “semantic tagging.”

This sort of tagging has been done manually for some time. FrameNet at the University of California-Berkeley, one of the oldest semantic databases, has been around since 1997—it now has 170,000 manually annotated sentences, like “I’ll get even with you for this!” But 170,000 sentences is a tiny data set compared with the databases of parallel, untagged texts that exist. The goal of current semantic translation efforts is to automatically do this sort of tagging and then use the result as input into statistical models.

Automatic semantic tagging is obviously hard. You have to deal with things like imprecise quantifier scope. Take the sentence “Every man admires some woman.” Now, this has two meanings. The first is that there exists a single woman who is admired by every man. (It tells you precisely when I hit puberty if I say that the first name that comes to mind is Cindy Crawford.) The second is that all men admire at least one woman. But how do you say this in Arabic? Ideally, you aim for a phrase that has the same levels of ambiguity. The point of the semantic approach is that rather than attempt to go straight from English to Arabic (or whatever your target language might be), you attempt to encode the ambiguity itself first. Then, the broader context might help your algorithm choose how to render the phrase in the target language.

A team at the University of Colorado, funded by DARPA, has built an open-source semantic tagger called ClearTK. They mention difficulties like dealing with the sentence: “The coach for Manchester United states that his team will win.” In that example, “United States” doesn’t mean what it usually does. Getting a program to recognize this and similar quirks of language is tricky.

The difficulty of knowing if a translation is good is not just a technical one: It’s fundamental. The only durable way to judge the faith of a translation is to decide if meaning was conveyed. If you have an algorithm that can make that judgment, you’ve solved a very hard problem indeed.

When and if a machine translation system eventually works well, when it “understands meaning,” its workings will be a mystery to its creators, almost as much so as they are to the village elder.

This article arises from Future Tense, a collaboration among Arizona State University, the New America Foundation, and Slate. Future Tense explores the ways emerging technologies affect society, policy, and culture. To read more, visit the Future Tense blog and the Future Tense home page. You can also follow us on Twitter.

Correction, May 11, 2012: This article originally misspelled the first name of a woman helping test TransTrac. She is Rachael, not Rachel. (Return to the corrected sentence.)