How Google translates without understanding

Most of the right words, in mostly the right order

  • alert
  • submit to reddit

The Power of One Infographic

Column After just a couple years of practice, Google can claim to produce the best computer-generated language translations in the world - in languages their boffin creators don't even understand.

Last summer, Google took top honors at a bake-off competition sponsored by the American agency NIST between machine-translation engines, besting IBM in English-Arabic and English-Chinese. The crazy part is that no one on the Google team even understands those languages.... the automatic-translation engines they constructed triumphed by sheer brute-force statistical extrapolation rather than "understanding".

I spoke with Franz Och, Google's enthusiastic machine-translation guru, about this unusual new approach.

Sixty years of failure

Ever since the the Second World War there have been two competing approaches to automatic translation: expert rules vs. statistical deciphering.

Expert-rule buffs have tried to automate the grammar-school approach of diagramming sentences (using modifiers, phrases, and clauses): for example, "I visited (the house next to (the park) )." But like other optimistic software efforts, the exact rules foundered on the ambiguities of real human languages. (Think not? Try explaining this sentence: "Time flies like an arrow, but fruit flies like a banana.")

The competing statistical approach began with cryptography: treat the second language as an unknown code, and use statistical cues to find a mathematical formula to decode it, like the Allies did with Hitler's famous Enigma code. While those early "decipering" efforts foundered on a lack of computing power, they have been resurrected in the "Statistical Machine Translation" approach used by Google, which eschews strict rules in favor of noticing the statistical correlations between "white house" and "casa blanca." Statistics deals with ambiguity better than rules do, it turns out.

Under Google's hood

The Google approach is a lesson in practical software development: try things and see what sticks. It has just a few major steps:

  1. Google starts with lots and lots of paired-example texts, like formal documents from the United Nations, in which identical content is expertly translated into many different languages. With these documents they can discover that "white house" tends to co-occur with "casa blanca," so that the next time they have to translate a text containing "white house" they will tend to use "casa blanca" in the output.
  2. They have even more untranslated text in each language, which lets them make models of "well-formed" sentence fragments (for example, preferring "white house" to "house white"). So the raw output from the first translation step can be further massaged into (statistically) nicer-sounding text.
  3. Their key for improving the system - and winning competitions - is an automated performance metric, which assigns a translation quality number to each translation attempt. More on this fatally weak link below.

This game needs loads of computational horsepower for learning and testing, and a software architecture which lets Google tweak code and parameters to improve upon its previous score. So given these ingredients, Google's machine-translation strategy should be familiar to any software engineer: load the statistics, translate the examples, evaluate the translations, twiddle the system parameters, and repeat.

What is clearly missing from this approach is any form of "understanding". The machine has no idea that "walk" is an action using "feet," except when its statistics tell it the text strings "walk" and "feet" sometimes show up together. Nor does it know the subtle differences between "to boycott" and "not to attend." Och emphasized that the system does not even represent nouns, verbs, modifiers, or any of the grammatical building blocks we think of as language. In fact, he says, "linguists think our structures are weird" - but he demurred on actually describing them. His machine contains only statistical correlations and relationships, no more or less than "what is in the data." Each word and phrase in the source votes for various phrases in the output, and the final result is a kind of tallying of those myriad votes.

The Power of One Brief: Top reasons to choose HP BladeSystem

More from The Register

next story
That AMAZING Windows comeback: Wow – 0.5% growth in 2015
Whoooah, my face is going all floppy with the speed
Think Google Glass is creepy? Wait until it READS YOUR MIND
Startup penetrates the mind of Glassholes
Linux turns the crank on code for cars
Got a feel for your automobile
Google policy wonk patronises Brits over EU search biz probe
Downgrading rivals? Whetstone: 'I just don't think it's really true'
Victim of Tor-hidden revenge smut site sues Tor Project developers
But EFF lawyer says deep-web team 'no more liable' than web server makers
Chrome Remote Desktop adds Linux to supported OS list
Drive Debian from the confines of a Chromebook
prev story


Seven Steps to Software Security
Seven practical steps you can begin to take today to secure your applications and prevent the damages a successful cyber-attack can cause.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Eight steps to building an HP BladeSystem
Building your ideal BladeSystem infrastructure solution begins with eight simple steps, outlined in this whitepaper.
Securing Web Applications Made Simple and Scalable
Learn how automated security testing can provide a simple and scalable way to protect your web applications.
Build a Business Case: Developing Custom Apps
In this whitepaper learn how to maximize the value of custom applications by accelerating and simplifying their development.