Do you Word2Vec? Google's neural-network bookworm

Making machines eat our words

Pre-trained models

It’s for this reason that pre-trained models are available. You’ll find models trained on the whole of Google news with a vocabulary of three million words, or 174,000 words from Wikipedia, here. Be warned, though, even the models aren’t small: that Google model I mentioned is more than a gigabyte in size.

One 3Mb set of data classifies short statements from PolitiFact.com as: pants-fire, false, barely true, half-true, mostly true, and true...

One thing you need to be on your guard against in these datasets is bias. Both the Google and Wikipedia sets will be full of American words and their meanings and spellings could cause difficulties – for example a petrol station in the US is not called a petrol station, and let’s not get started on “z”s for “s”s.

Any training set will also need to be carefully cleaned, because misspelled words will also be learned, as will any gender or racial bias in the text. We’ve seen how artificial intelligence let loose on the internet can go badly wrong in a matter of days, for instance.

So far this is something of a parlour trick – doing maths with words is fun (especially the rock bands example), but possibly not much use. It turns out that Word2Vec might not be much use on its own, but combined with other well-known techniques it can give powerful insights into datasets.

I’ve seen it used with dimensionality reduction to give an insight into job adverts and how they relate to each other, allowing the tracking of industrial sectors across the adverts or examining how job seniority is linked to job titles in the advert. Word2Vec along with other machine learning techniques have been used to evaluate a dataset for detecting fake news: one 3Mb set of data classifies short statements from PolitiFact.com as: pants-fire, false, barely true, half-true, mostly true, and true.

Going further, the technique could help in the detection of rumours on social media, although it may not be able to work for newly fabricated rumours as they would be outside the initial training set, and so not picked up. Rumours are one thing, but what about deliberate attempts to deceive, such as fake reviews created automatically on merchants’ websites? Some work has been done to use Word2Vec to detect these reviews and presumably remove them from the webpage.

Word2Vec can be extended by Doc2Vec, which can take whole sentences or even documents and find similarities between them. University of Dundee researchers have applied it to Parliamentary debates to extract the character attacks (and support) between politicians. It does have limitations of course, and there remains some basic work to be done. “The honourable gentleman is right” and “Is the right honourable gentleman” are detected as similar, but they are not.

Clearly, Word2Vec is a great tool to have. Just beware. It lacks a grasp of the basic world knowledge needed for artificial intelligence, which might be why it is being used as an input for more advanced neural nets. That might turn out to be genuinely useful. ®




Biting the hand that feeds IT © 1998–2018