How Google translates without understanding
Most of the right words, in mostly the right order
Column After just a couple years of practice, Google can claim to produce the best computer-generated language translations in the world - in languages their boffin creators don't even understand.
Last summer, Google took top honors at a bake-off competition sponsored by the American agency NIST between machine-translation engines, besting IBM in English-Arabic and English-Chinese. The crazy part is that no one on the Google team even understands those languages.... the automatic-translation engines they constructed triumphed by sheer brute-force statistical extrapolation rather than "understanding".
I spoke with Franz Och, Google's enthusiastic machine-translation guru, about this unusual new approach.
Sixty years of failure
Ever since the the Second World War there have been two competing approaches to automatic translation: expert rules vs. statistical deciphering.
Expert-rule buffs have tried to automate the grammar-school approach of diagramming sentences (using modifiers, phrases, and clauses): for example, "I visited (the house next to (the park) )." But like other optimistic software efforts, the exact rules foundered on the ambiguities of real human languages. (Think not? Try explaining this sentence: "Time flies like an arrow, but fruit flies like a banana.")
The competing statistical approach began with cryptography: treat the second language as an unknown code, and use statistical cues to find a mathematical formula to decode it, like the Allies did with Hitler's famous Enigma code. While those early "decipering" efforts foundered on a lack of computing power, they have been resurrected in the "Statistical Machine Translation" approach used by Google, which eschews strict rules in favor of noticing the statistical correlations between "white house" and "casa blanca." Statistics deals with ambiguity better than rules do, it turns out.
Under Google's hood
The Google approach is a lesson in practical software development: try things and see what sticks. It has just a few major steps:
- Google starts with lots and lots of paired-example texts, like formal documents from the United Nations, in which identical content is expertly translated into many different languages. With these documents they can discover that "white house" tends to co-occur with "casa blanca," so that the next time they have to translate a text containing "white house" they will tend to use "casa blanca" in the output.
- They have even more untranslated text in each language, which lets them make models of "well-formed" sentence fragments (for example, preferring "white house" to "house white"). So the raw output from the first translation step can be further massaged into (statistically) nicer-sounding text.
- Their key for improving the system - and winning competitions - is an automated performance metric, which assigns a translation quality number to each translation attempt. More on this fatally weak link below.
This game needs loads of computational horsepower for learning and testing, and a software architecture which lets Google tweak code and parameters to improve upon its previous score. So given these ingredients, Google's machine-translation strategy should be familiar to any software engineer: load the statistics, translate the examples, evaluate the translations, twiddle the system parameters, and repeat.
What is clearly missing from this approach is any form of "understanding". The machine has no idea that "walk" is an action using "feet," except when its statistics tell it the text strings "walk" and "feet" sometimes show up together. Nor does it know the subtle differences between "to boycott" and "not to attend." Och emphasized that the system does not even represent nouns, verbs, modifiers, or any of the grammatical building blocks we think of as language. In fact, he says, "linguists think our structures are weird" - but he demurred on actually describing them. His machine contains only statistical correlations and relationships, no more or less than "what is in the data." Each word and phrase in the source votes for various phrases in the output, and the final result is a kind of tallying of those myriad votes.
Next page: Winning at chess, losing at language
good start but needs more
Google's approach is a good one. Translation is very similar to code breaking, so use similar algorithms.
However, when you already know things about the languages, you can incorporate this knowledge. For example give it a dictionary and thesaurus, teach it a little about grammar, in each language. Then it can put things in (some sort of) context.
But lets look at it this way. Assuming there is life outside of this planet, and we someday meet them, how do we communicate? Would this approach not be way to get the very first insights into the way they communicate. Sure it wouldnt be perfect, but it would help.
It will never be perfect. I do beleive that language is based on hard and fast rules, but humans dont like rules. It's like my music composition teacher said, "You've got to know the rules, THEN you can break them". We continualy go against the rules with language, make up new words, say things wrong. Computers wont keep up with that, but Googles translator can still do its job: Giving you a rough guide of what is said.
Rules, yes, but self-adapting rules, and not rules in the form of what most people would consider as "grammar". Language operates at a much deeper level, as you can see from the fact that good translations hardly ever reproduce the most apparent grammatical structures of the original text.
On the UN producing "expert" translation, I wouldn't count on it. Most UN and EU translations better machine translation in degree only, but not in essence. They are by and large atrociously overliteral, and have little in common with natural language.
If language is algorithmic at all (and I don't think it is), it can only be so at a degree of complexity that defies reverse engineering along the lines of an electronic translator. Nobody has ever come close to writing a full grammar of any language, and I suspect the very nature of language (total open-ended versatility) is such that no such grammar can exist. This is because meaning is not encapsulated in the words of the speaker but revealed solely in the response of the listener. Words only mean what people take them to mean.
That is the first insurmountable problem for electronic translation. The second is that meaning is distributed across huge expanses of discourse. In the case of spoken language, it is distributed beyond phonetics into prosody, then beyond prosody into gesture. Written language uses a whole panoply of devices to simulate the effects of prosody and even gesture, and I don't see how an algorithmic approach could possibly allow for this.
Time flies vs fruit flies and white house vs casa blanca
These are examples of the ambiguity of language. The first is a case of the same string of letters representing different words (which may or may not have the same pronunciation). I remember a science fiction story I read years ago where a blue print of a device was written in Russian. Due to compartmentalization restrictions, the Russian wording was copied into a word list which was then translated by someone from Russian to English for passing on to someone who then could analyze parts of the translated blue print (no-one saw the complete Blue Print but just sections of it). Due to the words being translated with no context there was problems such as translation ending up with the string "lead" as both the metal as the technical term for a wire (ie: Lead lead as in a wire made of Lead). The Flies example uses the String "Flies" both as a verb to connote movement and to designate an insect class (modified by the designation of Fruit). Words can not be translated/interpreted in isolation but need to be viewed in context so that the proper meaning is assigned to them for purposes of the translation. Science fiction writer Piers Anthony makes use of this type of word play in his Xanth Series.
As to white house and casa blanca there was an incident during WWII where there was a secret meeting of all the allied leaders in Casa Blanca (where they could have been attacked and killed by the Germans if the meeting plans became known). As it happened. a spy reported the plans to the Germans but due to encoding, decoding, and translation into German the reference to the meeting being help Casa Blanca ended up as getting reported to the Spy's controller as being held in [the] White House (ie: Washington DC/USA),