How Google translates without understanding

Original URL: https://www.theregister.com/2007/05/15/google_translation/

Most of the right words, in mostly the right order

Posted in Software, 15th May 2007 00:23 GMT

Column After just a couple years of practice, Google can claim to produce the best computer-generated language translations in the world - in languages their boffin creators don't even understand.

Last summer, Google took top honors at a bake-off competition sponsored by the American agency NIST between machine-translation engines, besting IBM in English-Arabic and English-Chinese. The crazy part is that no one on the Google team even understands those languages.... the automatic-translation engines they constructed triumphed by sheer brute-force statistical extrapolation rather than "understanding".

I spoke with Franz Och, Google's enthusiastic machine-translation guru, about this unusual new approach.

Sixty years of failure

Ever since the the Second World War there have been two competing approaches to automatic translation: expert rules vs. statistical deciphering.

Expert-rule buffs have tried to automate the grammar-school approach of diagramming sentences (using modifiers, phrases, and clauses): for example, "I visited (the house next to (the park) )." But like other optimistic software efforts, the exact rules foundered on the ambiguities of real human languages. (Think not? Try explaining this sentence: "Time flies like an arrow, but fruit flies like a banana.")

The competing statistical approach began with cryptography: treat the second language as an unknown code, and use statistical cues to find a mathematical formula to decode it, like the Allies did with Hitler's famous Enigma code. While those early "decipering" efforts foundered on a lack of computing power, they have been resurrected in the "Statistical Machine Translation" approach used by Google, which eschews strict rules in favor of noticing the statistical correlations between "white house" and "casa blanca." Statistics deals with ambiguity better than rules do, it turns out.

Under Google's hood

The Google approach is a lesson in practical software development: try things and see what sticks. It has just a few major steps:

Google starts with lots and lots of paired-example texts, like formal documents from the United Nations, in which identical content is expertly translated into many different languages. With these documents they can discover that "white house" tends to co-occur with "casa blanca," so that the next time they have to translate a text containing "white house" they will tend to use "casa blanca" in the output.
They have even more untranslated text in each language, which lets them make models of "well-formed" sentence fragments (for example, preferring "white house" to "house white"). So the raw output from the first translation step can be further massaged into (statistically) nicer-sounding text.
Their key for improving the system - and winning competitions - is an automated performance metric, which assigns a translation quality number to each translation attempt. More on this fatally weak link below.

This game needs loads of computational horsepower for learning and testing, and a software architecture which lets Google tweak code and parameters to improve upon its previous score. So given these ingredients, Google's machine-translation strategy should be familiar to any software engineer: load the statistics, translate the examples, evaluate the translations, twiddle the system parameters, and repeat.

What is clearly missing from this approach is any form of "understanding". The machine has no idea that "walk" is an action using "feet," except when its statistics tell it the text strings "walk" and "feet" sometimes show up together. Nor does it know the subtle differences between "to boycott" and "not to attend." Och emphasized that the system does not even represent nouns, verbs, modifiers, or any of the grammatical building blocks we think of as language. In fact, he says, "linguists think our structures are weird" - but he demurred on actually describing them. His machine contains only statistical correlations and relationships, no more or less than "what is in the data." Each word and phrase in the source votes for various phrases in the output, and the final result is a kind of tallying of those myriad votes.

Winning at chess, losing at language

This approach is much like computerized chess: make a statistical model of the domain and optimize the hell out of it, ultimately winning by sheer computational horsepower. Like chess (but unlike vision), language is a source of pride, something both complex and uniquely human. For chess, computational optimization worked brilliantly; the best chess-playing computers, like Deep Blue, are better than the best human players. But score-based optimization won't work for language in its current form, even though it does do two really important things right

The first good thing about statistical machine translation is the statistics. Human brains are statistical-inference engines, and our senses routinely make up for noisy data by interpolating and extrapolating whatever pixels or phonemes we can rely on. Statistical analysis makes better sense of more data than strict rules do, and statistical rules produce more robust outputs. So any ultimate human-quality translation engine must use statistics at its core.

The other good thing is the optimization. As I've argued earlier, the key to understanding and duplicating brain-like behavior lies in optimization, the evolutionary ratchet which lets an accumulation of small, even accidental adjustments slowly converge on a good result. Optimization doesn't need an Einstein, just the right quality metric and an army of engineers.

So Och's team (and their competitors) have the overall structure right: they converted text translation into an engineering problem, and have a software architecture allowing iterative improvement. So they can improve their Black Box - but what's inside it? Och hinted at various trendy algorithms (Discriminative Learning and Expectation Maximization, I'll bet Bayesian Inference too), although our ever-vigilant chaperon from Google Communications wouldn't let him speak in detail. But so what? The optimization architecture lets you swap out this month's algorithm for a better one, so algorithms will change as performance improves.

Or maybe not. The Achilles' Heel of optimization is that everything depends on the performance metric, which in this case clearly misses a lot. That's not a problem for winning contests - the NIST competition used the same "BLEU"(Bilingual Evaluation Understudy) metric as Google practiced on, so Google's dramatic win mostly proved that Google gamed the scoring system better than IBM did. But the worse the metric, the less likely the translations will make sense.

The gist of the problem is that because machines don't yet understand language - that's the original problem, right? - they can't be too good at automatically evaluating language translations either. So researchers have to bootstrap the BLEU score, taking a scheme like (which merely compares the similarity of two same-language documents) and verifying that on average humans prefer reading outputs with high scores. (They compare candidate translations against gold-standard human translations)

The BLEUs

But all BLEU really measures is word-by-word similarity: are the same words present in both documents, somewhere? The same word-pairs, triplets, quadruplets? In obviously extreme cases, BLEU works well; it gives a low score if the documents are completely different, and a perfect score if they're identical. But in between, it can produce some very screwy results.

The most obvious problem is that paraphrases and synonyms score zero; to get any credit with , you need to produce the exact same words as the reference translation has: "Wander" doesn't get partial credit for "stroll," nor "sofa" for "couch."

The complementary problem is that BLEU can give a high similarity score to nonsensical language which contains the right phrases in the wrong order. Consider first this typical, sensible output from a NIST contest:

"Appeared calm when he was taken to the American plane, which will to Miami, Florida"

Now here is a possible garbled output which would get the very same score:

"was being led to the calm as he was would take carry him seemed quite when taken"

The core problem is that word-counting scores like BLEU - the linchpin of the whole machine-translation competitions - don't even recognize well-formed language, much less real translated meaning. (A stinging academic critique of BLEU can be found here.)

A classic example of how the word-by-word translation approach fails comes from German, a language which is too "tough" for Och's team to translate yet (although Och himself is a native speaker). German's problem is its relative-to-English-tangled Wordorder; take this example from Mark Twain's essay "The Awful German Language":

"But when he, upon the street, the (in-satin-and-silk-covered-now-very-unconstrained-after-the-newest-fashioned-dressed) government counselor's wife met, etc"

Until computers deal with the actual language structure (the hyphens and parentheses above), they will have no hope of translating even as well as Mark Twain did here.

So why are computers so much worse at language than at chess? Chess has properties that computers like: a well-defined state and well-defined rules for play. Computers do win at chess, like at calculation, because they are so exact and fussy about rules. Language, on the other hand, needs approximation and inference to extract "meaning" (whatever that is) together from text, context, subject matter, tone, expectations, and so on - and the computer needs yet more approximation to produce a translated version of that meaning with all the right interlocking features. Unlike chess, the game of language is played on the human home-turf of multivariate inference and approximation, so we will continue to beat the machines.

But for Google's purposes, perfect translation may not even be necessary. Google succeeded in web-search partly by avoiding the exact search language of AltaVista in favor of a tool which was fast, easy to use, and displayed most of the right results in mostly the right order. Perhaps it will also be enough for Google to machine-translate most of the right words in mostly the right order, leaving to users the much harder task of extracting meaning from them. ®

Bill Softky has written a neat utility for Excel power users called FlowSheet: it turns cryptic formulae like "SUM(A4:A7)/D5" into pretty, intuitive diagrams. It's free, for now. Check it out.