Feeds

How Google translates without understanding

Most of the right words, in mostly the right order

  • alert
  • submit to reddit

Build a business case: developing custom apps

Column After just a couple years of practice, Google can claim to produce the best computer-generated language translations in the world - in languages their boffin creators don't even understand.

Last summer, Google took top honors at a bake-off competition sponsored by the American agency NIST between machine-translation engines, besting IBM in English-Arabic and English-Chinese. The crazy part is that no one on the Google team even understands those languages.... the automatic-translation engines they constructed triumphed by sheer brute-force statistical extrapolation rather than "understanding".

I spoke with Franz Och, Google's enthusiastic machine-translation guru, about this unusual new approach.

Sixty years of failure

Ever since the the Second World War there have been two competing approaches to automatic translation: expert rules vs. statistical deciphering.

Expert-rule buffs have tried to automate the grammar-school approach of diagramming sentences (using modifiers, phrases, and clauses): for example, "I visited (the house next to (the park) )." But like other optimistic software efforts, the exact rules foundered on the ambiguities of real human languages. (Think not? Try explaining this sentence: "Time flies like an arrow, but fruit flies like a banana.")

The competing statistical approach began with cryptography: treat the second language as an unknown code, and use statistical cues to find a mathematical formula to decode it, like the Allies did with Hitler's famous Enigma code. While those early "decipering" efforts foundered on a lack of computing power, they have been resurrected in the "Statistical Machine Translation" approach used by Google, which eschews strict rules in favor of noticing the statistical correlations between "white house" and "casa blanca." Statistics deals with ambiguity better than rules do, it turns out.

Under Google's hood

The Google approach is a lesson in practical software development: try things and see what sticks. It has just a few major steps:

  1. Google starts with lots and lots of paired-example texts, like formal documents from the United Nations, in which identical content is expertly translated into many different languages. With these documents they can discover that "white house" tends to co-occur with "casa blanca," so that the next time they have to translate a text containing "white house" they will tend to use "casa blanca" in the output.
  2. They have even more untranslated text in each language, which lets them make models of "well-formed" sentence fragments (for example, preferring "white house" to "house white"). So the raw output from the first translation step can be further massaged into (statistically) nicer-sounding text.
  3. Their key for improving the system - and winning competitions - is an automated performance metric, which assigns a translation quality number to each translation attempt. More on this fatally weak link below.

This game needs loads of computational horsepower for learning and testing, and a software architecture which lets Google tweak code and parameters to improve upon its previous score. So given these ingredients, Google's machine-translation strategy should be familiar to any software engineer: load the statistics, translate the examples, evaluate the translations, twiddle the system parameters, and repeat.

What is clearly missing from this approach is any form of "understanding". The machine has no idea that "walk" is an action using "feet," except when its statistics tell it the text strings "walk" and "feet" sometimes show up together. Nor does it know the subtle differences between "to boycott" and "not to attend." Och emphasized that the system does not even represent nouns, verbs, modifiers, or any of the grammatical building blocks we think of as language. In fact, he says, "linguists think our structures are weird" - but he demurred on actually describing them. His machine contains only statistical correlations and relationships, no more or less than "what is in the data." Each word and phrase in the source votes for various phrases in the output, and the final result is a kind of tallying of those myriad votes.

Boost IT visibility and business value

More from The Register

next story
KDE releases ice-cream coloured Plasma 5 just in time for summer
Melty but refreshing - popular rival to Mint's Cinnamon's still a work in progress
Leaked Windows Phone 8.1 Update specs tease details of Nokia's next mobes
New screen sizes, dual SIMs, voice over LTE, and more
Mozilla keeps its Beard, hopes anti-gay marriage troubles are now over
Plenty on new CEO's todo list – starting with Firefox's slipping grasp
Apple: We'll unleash OS X Yosemite beta on the MASSES on 24 July
Starting today, regular fanbois will be guinea pigs, it tells Reg
Another day, another Firefox: Version 31 is upon us ALREADY
Web devs, Mozilla really wants you to like this one
Secure microkernel that uses maths to be 'bug free' goes open source
Hacker-repelling, drone-protecting code will soon be yours to tweak as you see fit
Cloudy CoreOS Linux distro declares itself production-ready
Lightweight, container-happy Linux gets first Stable release
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
The Essential Guide to IT Transformation
ServiceNow discusses three IT transformations that can help CIO's automate IT services to transform IT and the enterprise.
Maximize storage efficiency across the enterprise
The HP StoreOnce backup solution offers highly flexible, centrally managed, and highly efficient data protection for any enterprise.