How Google translates without understanding

Most of the right words, in mostly the right order

  • alert
  • submit to reddit

Build a business case: developing custom apps

Column After just a couple years of practice, Google can claim to produce the best computer-generated language translations in the world - in languages their boffin creators don't even understand.

Last summer, Google took top honors at a bake-off competition sponsored by the American agency NIST between machine-translation engines, besting IBM in English-Arabic and English-Chinese. The crazy part is that no one on the Google team even understands those languages.... the automatic-translation engines they constructed triumphed by sheer brute-force statistical extrapolation rather than "understanding".

I spoke with Franz Och, Google's enthusiastic machine-translation guru, about this unusual new approach.

Sixty years of failure

Ever since the the Second World War there have been two competing approaches to automatic translation: expert rules vs. statistical deciphering.

Expert-rule buffs have tried to automate the grammar-school approach of diagramming sentences (using modifiers, phrases, and clauses): for example, "I visited (the house next to (the park) )." But like other optimistic software efforts, the exact rules foundered on the ambiguities of real human languages. (Think not? Try explaining this sentence: "Time flies like an arrow, but fruit flies like a banana.")

The competing statistical approach began with cryptography: treat the second language as an unknown code, and use statistical cues to find a mathematical formula to decode it, like the Allies did with Hitler's famous Enigma code. While those early "decipering" efforts foundered on a lack of computing power, they have been resurrected in the "Statistical Machine Translation" approach used by Google, which eschews strict rules in favor of noticing the statistical correlations between "white house" and "casa blanca." Statistics deals with ambiguity better than rules do, it turns out.

Under Google's hood

The Google approach is a lesson in practical software development: try things and see what sticks. It has just a few major steps:

  1. Google starts with lots and lots of paired-example texts, like formal documents from the United Nations, in which identical content is expertly translated into many different languages. With these documents they can discover that "white house" tends to co-occur with "casa blanca," so that the next time they have to translate a text containing "white house" they will tend to use "casa blanca" in the output.
  2. They have even more untranslated text in each language, which lets them make models of "well-formed" sentence fragments (for example, preferring "white house" to "house white"). So the raw output from the first translation step can be further massaged into (statistically) nicer-sounding text.
  3. Their key for improving the system - and winning competitions - is an automated performance metric, which assigns a translation quality number to each translation attempt. More on this fatally weak link below.

This game needs loads of computational horsepower for learning and testing, and a software architecture which lets Google tweak code and parameters to improve upon its previous score. So given these ingredients, Google's machine-translation strategy should be familiar to any software engineer: load the statistics, translate the examples, evaluate the translations, twiddle the system parameters, and repeat.

What is clearly missing from this approach is any form of "understanding". The machine has no idea that "walk" is an action using "feet," except when its statistics tell it the text strings "walk" and "feet" sometimes show up together. Nor does it know the subtle differences between "to boycott" and "not to attend." Och emphasized that the system does not even represent nouns, verbs, modifiers, or any of the grammatical building blocks we think of as language. In fact, he says, "linguists think our structures are weird" - but he demurred on actually describing them. His machine contains only statistical correlations and relationships, no more or less than "what is in the data." Each word and phrase in the source votes for various phrases in the output, and the final result is a kind of tallying of those myriad votes.

Bridging the IT gap between rising business demands and ageing tools

More from The Register

next story
NO MORE ALL CAPS and other pleasures of Visual Studio 14
Unpicking a packed preview that breaks down ASP.NET
Cheer up, Nokia fans. It can start making mobes again in 18 months
The real winner of the Nokia sale is *drumroll* ... Nokia
Mozilla fixes CRITICAL security holes in Firefox, urges v31 upgrade
Misc memory hazards 'could be exploited' - and guess what, one's a Javascript vuln
Put down that Oracle database patch: It could cost $23,000 per CPU
On-by-default INMEMORY tech a boon for developers ... as long as they can afford it
Google shows off new Chrome OS look
Athena springs full-grown from Chromium project's head
Apple: We'll unleash OS X Yosemite beta on the MASSES on 24 July
Starting today, regular fanbois will be guinea pigs, it tells Reg
HIDDEN packet sniffer spy tech in MILLIONS of iPhones, iPads – expert
Don't panic though – Apple's backdoor is not wide open to all, guru tells us
prev story


Designing a Defense for Mobile Applications
Learn about the various considerations for defending mobile applications - from the application architecture itself to the myriad testing technologies.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Seven Steps to Software Security
Seven practical steps you can begin to take today to secure your applications and prevent the damages a successful cyber-attack can cause.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.