Your AI pet project is only as smart as its garbage training set

No one said it'd be easy

Teacher despairs at pupil's poor arithmetic

AI isn't immune to one of computing's most basic rules – garbage in, garbage out. Train a neural network on flawed data and you'll have one that makes lots of mistakes.

Most neural networks learn to distinguish between things by sampling different groups. This is supervised learning, and it only works if someone labels the data first so that the network knows what it's looking at.

But how can you find the "right" data to train your AI, and confirm its quality? Well, what you feed your machine might surprise you. Not only are there a variety of off-the-shelf choices but we've now entered an era where real-world data can be replaced with machine-created data for AI and ML.

And if that sounds a tad too artificial, don't worry: there is an expanding ecosystem of humans in the machine-learning feedback loop who keep the machines on track.

Let's start at the beginning, though. If you'd like some off-the-shelf data, you're in luck: there are plenty of labelled data sets out there suited to a range of scenarios.

One of the better-known data sets is ImageNet, both because of its broad applicability to image processing – a common task in deep learning – and its associated annual challenge. It is a collection of millions of image links, each tagged with meaningful concepts called "synsets". You can find satellite pictures courtesy of SpaceNet, which released a labelled set of them on AWS.

If you are training your neural net to recognise handwriting, there's the Modified National Institute of Standards and Training (MNIST) database of handwritten digits. It comprises of a set of 60,000 handwritten digits used for training image processing systems. There are many more data sets covering faces, text, speech and music.

Getting computers to make their own data

If you don't want something off the shelf, you can make your own data – so-called synthetic data. This is where computers fabricate data that is so realistic that it looks like it originated in the real world.

This happens particularly in high-end video games, where images are becoming so realistic that they are becoming good enough to "train" with.

In the world of self-driving vehicles, engineers can use data sets such as CityScapes, which labels classes of pixels at varying levels of granularity in 25,000 images cross 50 different cities. There's also the CamVid database, which provides 10 minutes of labelled video footage from a moving vehicle.

Is this approach safe – or even accurate? Isn't it the AI equivalent of eating Soylent Green? Not according to researchers from Intel Labs and TU Darmstadt, who collected 25,000 images by simply driving around in a photorealistic open-world computer game. They manually labelled objects in one frame and had their software propagate the labels across many video frames (after all, pavements and pedestrians look very similar from one frame to the next). The techies claim to have experienced a greater rate of accuracy using the game data to train their AI than relying entirely on the real-world stuff from CamVid.

Into the feedback loop

Computers train their AI in loops, repeatedly formulating hypotheses based on input data and then testing to find the result.

Synthesis might work in open-world environments, and for faces and modelled objects, but some prefer to gather that data from real people. Among them is the firm Mighty.AI, which gathers and organises data sets for clients.

Mighty.AI told us last year: "Computers are very fast, accurate and stupid, and humans are brilliant, slow, and inaccurate. So how do you get the best of both, given that they're weak where the other is strong?"


Want to learn machine learning in 15 minutes? Start here...


The firm's software sends users tasks such as image segmentation; users to draw round object borders and label them. It is using this model to label data for self-driving car clients. This input feeds human judgements back into the training set to help the computer refine its own model. "It's best to get results out of this learning loop if you put humans in the middle, because then we can insert our judgements," Mighty.AI told us.

Rival Alegion uses a similar approach to build an algorithm detect damage on car body panels. It's a system that might find use where cars are being transported.

Crowdsourced workers outline the damage on hundreds of thousands of images of car body panels, and then classify them with jobs sent to Amazon's Mechanical Turk. "What they need are examples of graded pictures, so they have a clear classification taxonomy for mild, moderate and severe damage," vice president Chip Ray says.

For some applications, digesting and distilling the training data is far more complex than drawing a box around something.

Dennis Mortensen, CEO and founder of AI virtual assistant company, spent years training his neural network how to read emails and automatically respond to book meetings with people. It required building a complex view of that problem space, which he calls his "universe".

This universe has three entities and a pool of intents. The entities are date and time, location, and people. The intents are more complex, covering the need to reschedule, tell someone you're running late, decide who is optional and mandatory for the meeting, and so on.

To get these intents, Mortensen had to analyse the language in thousands of emails, producing training data that understood not only misspellings but also linguistic idiosyncrasies. When someone in London mails to ask for a 4pm call but you're in Paris, which time zone are they talking about? If they mail at 11:30 on Tuesday night and ask for a call "tomorrow", do they mean Wednesday or Thursday?

Misunderstand something early in the conversation, and there will be what Mortensen calls a "cascading set of negative consequences".

To prevent this, he went deep, drawing the data from a three-year beta covering tens of thousands of people. "We looked at people scheduling meetings, thousands of times," he says. "There are thousands of edge cases." Each time the team found new parameters, it had to retrain its AI model.

Imbalanced AI

So, you have your data and feedback loops. Next comes massaging your data sets, and how you do this will depend in part on what you're building.

In many cases, you'll want to filter out noise data, such as duplicates and outliers that may be errors. That's natural data cleansing.


Is this a hotdog? What it takes for an AI to answer that might surprise you


On the other hand, however, you may actually want to introduce noise.

The Silicon Valley Not HotDog app was designed to recognise whether something was a hot dog or not – duh. The app's creators produced copies of their hotdog pictures but then distorted, rotated and flipped them. This had two effects. First, it made hotdogs look more like they might when taken from a tilted phone. Second, it helped to reduce an imbalance in their data set.

An imbalanced set with far more pictures in one group than any other can lead a neural network to constantly label new pictures with that popular classification. The network's program assumes that if most pictures in its training set were in one group, then assuming new pictures are in that same group will make it accurate most of the time.

You can solve this by creating a more balanced set of classifications, and also by changing your neural network model, weighting its results to give more attention to under-represented models.

Imbalances in your data may lead to bias. One of the first tasks you see when signing up for Mighty.AI is to rate puppies for cuteness. The firm told us it wanted to see what the aggregated results would look like.

"As it turns out, there's a strong gender bias in how women and men rate the cuteness of puppies," Mighty.AI told us. "It's cute, but demonstrative." Left unchecked, it can lead to errors that have unexpected real-world impacts.

Avoiding such dangers and producing a training set that will give you an accurate neural model is harder than it looks. Finding, scrubbing, and interpreting training data for deep learning algorithms is an important part of the AI development process that often takes up most of the time before you even get to have fun with neural network code.

"Most people think that TensorFlow is where the party is," concludes Mortensen. "But that's where the party ends." ®

Sponsored: Minds Mastering Machines - Call for papers now open

Biting the hand that feeds IT © 1998–2018