Feeds

How Google translates without understanding

Most of the right words, in mostly the right order

  • alert
  • submit to reddit

Combat fraud and increase customer satisfaction

Column After just a couple years of practice, Google can claim to produce the best computer-generated language translations in the world - in languages their boffin creators don't even understand.

Last summer, Google took top honors at a bake-off competition sponsored by the American agency NIST between machine-translation engines, besting IBM in English-Arabic and English-Chinese. The crazy part is that no one on the Google team even understands those languages.... the automatic-translation engines they constructed triumphed by sheer brute-force statistical extrapolation rather than "understanding".

I spoke with Franz Och, Google's enthusiastic machine-translation guru, about this unusual new approach.

Sixty years of failure

Ever since the the Second World War there have been two competing approaches to automatic translation: expert rules vs. statistical deciphering.

Expert-rule buffs have tried to automate the grammar-school approach of diagramming sentences (using modifiers, phrases, and clauses): for example, "I visited (the house next to (the park) )." But like other optimistic software efforts, the exact rules foundered on the ambiguities of real human languages. (Think not? Try explaining this sentence: "Time flies like an arrow, but fruit flies like a banana.")

The competing statistical approach began with cryptography: treat the second language as an unknown code, and use statistical cues to find a mathematical formula to decode it, like the Allies did with Hitler's famous Enigma code. While those early "decipering" efforts foundered on a lack of computing power, they have been resurrected in the "Statistical Machine Translation" approach used by Google, which eschews strict rules in favor of noticing the statistical correlations between "white house" and "casa blanca." Statistics deals with ambiguity better than rules do, it turns out.

Under Google's hood

The Google approach is a lesson in practical software development: try things and see what sticks. It has just a few major steps:

  1. Google starts with lots and lots of paired-example texts, like formal documents from the United Nations, in which identical content is expertly translated into many different languages. With these documents they can discover that "white house" tends to co-occur with "casa blanca," so that the next time they have to translate a text containing "white house" they will tend to use "casa blanca" in the output.
  2. They have even more untranslated text in each language, which lets them make models of "well-formed" sentence fragments (for example, preferring "white house" to "house white"). So the raw output from the first translation step can be further massaged into (statistically) nicer-sounding text.
  3. Their key for improving the system - and winning competitions - is an automated performance metric, which assigns a translation quality number to each translation attempt. More on this fatally weak link below.

This game needs loads of computational horsepower for learning and testing, and a software architecture which lets Google tweak code and parameters to improve upon its previous score. So given these ingredients, Google's machine-translation strategy should be familiar to any software engineer: load the statistics, translate the examples, evaluate the translations, twiddle the system parameters, and repeat.

What is clearly missing from this approach is any form of "understanding". The machine has no idea that "walk" is an action using "feet," except when its statistics tell it the text strings "walk" and "feet" sometimes show up together. Nor does it know the subtle differences between "to boycott" and "not to attend." Och emphasized that the system does not even represent nouns, verbs, modifiers, or any of the grammatical building blocks we think of as language. In fact, he says, "linguists think our structures are weird" - but he demurred on actually describing them. His machine contains only statistical correlations and relationships, no more or less than "what is in the data." Each word and phrase in the source votes for various phrases in the output, and the final result is a kind of tallying of those myriad votes.

Combat fraud and increase customer satisfaction

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Batten down the hatches, Ubuntu 14.04 LTS due in TWO DAYS
Admins dab straining server brows in advance of Trusty Tahr's long-term support landing
Microsoft lobs pre-release Windows Phone 8.1 at devs who dare
App makers can load it before anyone else, but if they do they're stuck with it
Half of Twitter's 'active users' are SILENT STALKERS
Nearly 50% have NEVER tweeted a word
Oh no, Joe: WinPhone users already griping over 8.1 mega-update
Hang on. Which bit of Developer Preview don't you understand?
Internet-of-stuff startup dumps NoSQL for ... SQL?
NoSQL taste great at first but lacks proper nutrients, says startup cloud whiz
Windows 8.1, which you probably haven't upgraded to yet, ALREADY OBSOLETE
Pre-Update versions of new Windows version will no longer support patches
Microsoft TIER SMEAR changes app prices whether devs ask or not
Some go up, some go down, Redmond goes silent
Red Hat to ship RHEL 7 release candidate with a taste of container tech
Grab 'near-final' version of next Enterprise Linux next week
prev story

Whitepapers

Designing a defence for mobile apps
In this whitepaper learn the various considerations for defending mobile applications; from the mobile application architecture itself to the myriad testing technologies needed to properly assess mobile applications risk.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.