Googlers devise DeViSE: A thing-recognising FRANKENBRAIN
Machine-learning tech glues together image eyeballing and text grokking
You'd think guesswork and advanced science would be natural enemies, but not at Google where a crack team of researchers are trying to mate the two together.
In a paper presented on Monday at an artificial-intelligence conference in California, seven Google researchers outlined their image classifier, software that labels pictures by identifying what's in them. It was created by fusing two distinct machine-learning approaches together.
In short, the system can make an educated guess at identifying an unfamiliar picture based on the text labels offered to it. For example, if it was shown a photo of a black Victorian top hat it hadn't seen before, and asked if it was a black Victorian top hat or a black pedal-opened wastepaper bin – both labels it also hadn't heard of before – it could guess correctly because it knows what various other hats and garbage bins look like and knows the relationships between their labels.
The DeViSE: A Deep Visual-Semantic Embedding Model paper [PDF] describes a tech that strives to combine the eerie image recognition capabilities of Google's traditional weak-AI systems with the broad semantic modeling capabilities of its "Skip-gram" text classifiers.
This approach is called "zero-shot learning", and is seen by the Google brain trust (which includes MapReduce-creator Jeff Dean) as one of the best chances of designing systems that can deal with changeable datasets with poor classifications – in other words, the info Google's growing fleet of handheld or wheel-bound electronic eyes are likely to slurp up from the world around them.
"The goals of this work are to develop a vision model that makes semantically relevant predictions even when it makes errors and generalizes to classes outside of its labeled training set," they write.
DeViSE contains two elements: a text classifier that labels text based on its contents, and an object recognizer that studies images.
The text classifier trains a neural language model using 5.7 million documents comprising 5.4 billion words slurped from Wikipedia. The approach lets the tech convert the fuzzy world of language into a numeric graph in which each word is defined by its relationships with others.
The image recognizer, meanwhile, is a "state-of-the-art deep neural network for visual object recognition" that was trained to recognize some 1,000 categories of images.
Armed with these two power technologies, the researchers figured out a way to fuse the two together so that the model could use both approaches when attempting to classify a new image.
This model is marginally more accurate than today's state-of-the-art systems and is inherently more flexible. The researchers hypothesized:
A DeViSE model that was trained on images with labels like "tiger shark", "bull shark", and "blue shark", but never with images labeled simply "shark", would likely have the ability to generalize to this more coarse-grained descriptor because the language model has learned a representation of the general concept of "shark" which is similar to all of the specific sharks. Similarly, if tested on images of highly specific classes which the model happens to have never seen before, for example a photo of an oceanic whitecap shark, and asked whether the correct label is more likely "oceanic whitecap shark" or some other unfamiliar label (say, "nuclear submarine"), our model stands a fighting chance of guessing correctly because the language model ensures that representation of "oceanic whitecap shark" is closer to the representation of sharks the model has seen, while the representation of "nuclear submarine" is closer to those of other sea vessels.
Subsequent experiments detailed in the paper bore out this theory.
Google believes the system has a broad range of applications in some of the search giant's trickiest problem areas.
"We believe that our model's unusual compatibility with larger, less manicured data sets will prove to be a major strength moving forward," the nine wrote. "Though here we trained on a curated academic image dataset, our model's architecture naturally lends itself to being trained on all available images that can be annotated with any text term contained in the (larger) vocabulary. We believe that training massive "open" image datasets of this form will dramatically improve the quality of visual object categorization systems."
And once Google has honed the capabilities of this tech further, it could be used for a multitude of problems, such as distinguishing between categories like dogs, cats, and lawnmowers, and also specific entities, like telling the difference between cars such as a "Honda Civic, Ferrari F355, Tesla Model-S" they note – capabilities that are crucial ingredients for further developments in Google's key business of highly targeted, automated advertising.
As oil is to the plastics industry, data is to Google: it is the fundamental resource on which the company depends, and the more it can refine it, the more money it can make from it. For this reason machine learning and other deep analytical approaches are a priority for Google as the ad-slinger attempts to automate the classification and tagging of an ever-swelling world of digital data, with this system it has devised another approach to let it slurp more cash from the ethereal digital world. ®