Feeds

Google's robot army learns Spanish

La rebelión de las máquinas

Internet Security Threat Report 2014

If you want to learn another language, you need to spend time in the country, talk to people, get drunk and attempt to order complex drinks, and eventually read that country's great works of literature – unless you're Google, that is.

In a recent paper, three Googlers outlined a new approach to machine-based translation that uses the Chocolate Factory's weapons of choice: masses and masses of data, and neural networks.

The paper, "Exploiting Similarities among Languages for Machine Translation", shows how Google is able to use a small dictionary of pairs of words in two languages to train a network that can infer missing dictionary entries.

"Our method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data," they write. "This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs."

The system works by visualizing the vectors of individual words, then projecting the vector from the source language to the target language and swapping in the word with that vector representation in that dictionary.

Google_machine_translation

Feeling nervous yet, human?

It is able to work because, the researchers explain, "all common languages share concepts that are grounded in the real world (such as that cat is an animal smaller than a dog), there is often a strong similarity between the vector spaces."

Google's technology relies on the Skip-gram or Continuous Bag-of-Words (CBOW) models proposed by Googlers in another, earlier paper, which found that word vectors could be used to infer other words. "For example, vector operations 'king' - 'man' + 'woman' results in a vector that is close to 'queen'."

These models let Google create neural network models that learn high-quality word vectors from vast datasets, and do so in a less compute-intensive way than ever before. This lets the company scale up the model far beyond previous limits.

"Using the DistBelief distributed framework, it should be possible to train the CBOW and Skip-gram models even on corpora with one trillion words, for basically unlimited size of the vocabulary," they wrote at the time. "That is several orders of magnitude larger than the best previously published results for similar models."

Now, the team has been able to put these models to use to train them to figure out the relationship between different words, and infer the vector representations of a word's counter in another language.

"Thus, if we know the translation of one and four from English to Spanish, we can learn the transformation matrix that can help us to translate even the other numbers to Spanish," they write.

The technique works for languages far more alien from each other such as English and Czech, and English and Vietnamese with high degrees of accuracy.

"In particular, our work can be used to enrich and improve existing dictionaries and phrase tables, which would in turn lead to improvement of the current state-of-the-art machine translation systems," they write. "Clearly, there is still much to be explored."

In other words, get tweaking the CV, translators, because Google's algo-army is coming for you. Comprender? ®

Remote control for virtualized desktops

More from The Register

next story
Rosetta probot drilling DENIED: Philae has its 'LEG in the AIR'
NOT best position for scientific fulfillment
'Yes, yes... YES!' Philae lands on COMET 67P
Plucky probot aces landing on high-speed space rock - emotional scenes in Darmstadt
FORGET the CLIMATE: FATTIES are a MUCH BIGGER problem - study
Fat guy? Drink or smoke? You're worse than a TERRORIST
THERE it is! Philae comet lander FOUND in EXISTING Rosetta PICS
Crumb? Pixel? ALIEN? Better, it's a comet-catcher!
SEX BEAST SEALS may be egging each other on to ATTACK PENGUINS
Boffin: 'I think the behaviour is increasing in frequency'
HUMAN DNA 'will be FOUND ON MOON' – rocking boffin Brian Cox
Crowdfund plan to stimulate Blighty's space programme
Post-pub nosh neckfiller: The MIGHTY Scotch egg
Off to the boozer? This delicacy might help mitigate the effects
I'M SO SORRY, sobs Rosetta Brit boffin in 'sexist' sexy shirt storm
'He is just being himself' says proud mum of larger-than-life physicist
NASA launches new climate model at SC14
75 days of supercomputing later ...
prev story

Whitepapers

Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Getting started with customer-focused identity management
Learn why identity is a fundamental requirement to digital growth, and how without it there is no way to identify and engage customers in a meaningful way.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Website security in corporate America
Find out how you rank among other IT managers testing your website's vulnerabilities.
Top 5 reasons to deploy VMware with Tegile
Data demand and the rise of virtualization is challenging IT teams to deliver storage performance, scalability and capacity that can keep up, while maximizing efficiency.