Inside the 1TB ImageNet data set used to train the world's AI: Naked kids, drunken frat parties, porno stars, and more

Creators tried to stop us looking through it. We saw a copy anyway. Now it's being scrubbed clean

Shocking contents revealed

Feature ImageNet – a data set used to train AI systems around the world – contains photos of naked children, families on the beach, college parties, porn actresses, and more, scraped from the web to train computers without those individuals' explicit consent.

The library consists of 14 million images, each placed into categories that describe what's pictured in each scene. This pairing of information – images and labels – is used to teach artificially intelligent applications to recognize things and people caught on camera.

The database has been downloaded by boffins, engineers, and academics to train hundreds if not thousands of neural networks to identify stuff in photos – from assault rifles and aprons to magpies and minibuses to zebras and zucchinis, and everything in between.

In 2012, the data set was used to build AlexNet, heralded as a breakthrough development in deep learning since it marked the first time a neural network outperformed traditional computational methods at object recognition in terms of accuracy. AlexNet went on to win the annual ImageNet Large Scale Visual Recognition Challenge.

While ImageNet’s successes are celebrated, its troubling contents have largely been ignored.

Vinay Prabhu, a machine-learning scientist at an AI startup in Silicon Valley, stumbled across some of the data set’s darker and murkier photos by accident.

“I was trying to generate pictures of bicycles using BigGAN,” he told The Register. BigGAN is a generative adversarial network, a machine-learning system that can be taught to craft new unique outputs after studying a set of inputs. In other words, Prabhu hoped to feed BigGAN photos of bicycles from ImageNet so that his computer would generate its own original and never-seen-before pictures of bikes.

Instead, however, his code conjured strange flesh-colored blobs that resembled blurry, disfigured female bodies. Puzzled, he went back to the training data set, and realized he had accidentally trained his model on bikinis instead.

ImageNet's categories are sorted in alphabetic order and referenced by software in numerical ascending order. In a subset of the database Prabhu was using for his research, bicycles were category 444, and bikinis were 445. A single-digit typo in his code caused his neural network to draw from category 445 – bikinis – rather than the bicycles in 444.

Sometimes the nature of what is pornographic is debatable, but in some cases, the links to the porn websites are included right in the images

“At first I found it amusing, and I decided to look through the data set,” he told us. It didn’t take long before his amusement turned to shock.

“It was clear that these were unethical,” Prabhu said. He saw photos of a naked child’s backside, porn stars, shenanigans at frat parties, plus private and intimate photos of men dressed in women’s underwear. Some of the snaps included watermarks with URLs leading back to smut sites where the photos were lifted from.

“Sometimes the nature of what is pornographic is debatable, but in some cases, the links to the porn websites are included right in the images,” he told us.

Venturing further down the rabbit hole, he investigated other categories in the ImageNet library, and uncovered more eyebrow-raising material, such as photos of women showing their pubic hair. Crucially, it appears no one pictured had given their specific consent to have their images included in a data set analyzed by untold number of neural networks and eggheads.

“These people have no idea that their faces are in this data set,” Prabhu told The Reg. When he emailed the ImageNet creators to alert them to what he had found, he received no response, he told us.

The ImageNet team refused to give The Register access to the data set when we asked. Instead, a spokesperson told us on behalf of the team that the library was unavailable due to “maintenance reasons.” Indeed, the data has been unavailable for download since January this year, though we asked before and after that month for access, and were denied both times.

Another source within the industry, however, who asked to remain anonymous, allowed us to pore over the full ImageNet library – all 1.31TB of it – and we found a trove of troubling photos.

Swimming trunks and shower caps

There were nude children in the "swimming trunks" and the "shower cap" categories. In one particular case, a naked grimacing child wearing a transparent shower cap is standing in front of a naked man, and the photo is cropped just above the child’s genitals. In the "kimono" class, there’s a woman who lies on a bed with her kimono spread to reveal her legs and pubic hair. Elsewhere, there are snaps of people in various states of undress lounging by the sea on sandy beaches.

ImageNet has millions upon millions of photos. Most of them are pretty harmless and depict things like animals, plants, or random objects. It’s when humans are involved that things get a little dicier.

A large part of the problem is due to how the images were collected. Researchers at Stanford University and Princeton University in the US wrote software that automatically scraped a large portion of them from sites including Google, Bing, and the photo-sharing platform Flickr.

Even though some, or all, of the pictures harvested from image sites may have been covered by permissive licenses, such as Creative Commons, allowing them to be used and distributed freely by the ImageNet team and scientists, it's understood the people pictured did not all agree to have the images fed into various forms of artificial intelligence years later. That, in itself, is a lesson for us all: our data released or shared today may well be used for wildly unexpected purposes tomorrow.

'Mistakes'

These scraped photos were passed to humans slaving away on Amazon Mechanical Turk (MTurk) – a program in which people sign up to perform simple tasks for others for a small wage. These serfs were asked to manually assign the images to specific categories, or draw bounding boxes around objects of interest in the photos.

Even after the images had been eyeballed by these turks, who were instructed to filter out any dodgy material, problematic snaps managed to slip through. The ImageNet team blamed this on human error. More than 160 million pictures were processed by tens of thousands of cyber-freelancers, so some blunders were to be expected, we're told.

“There’s no doubt MTurk workers can make mistakes, and spammers on MTurk – people who do not pay attention to instructions or otherwise submit low-quality work – is always a problem,” the project's spokespeople told The Register earlier this month.

“We have an automatic quality control system in place to filter out spammers and problematic images, but the system is not perfect. Even with multiple safeguards, a problematic image can still slip through, given the scale of the effort. There were over 160 million candidate images and over 50,000 MTurk workers.”

Tom White, a digital artist focused on AI and a lecturer at the Victoria University of Wellington School of Design in New Zealand, who has previously used the data set in his work, told us: “There are loads of inappropriate images online, and so we should expect any automated collection of online photos to include these as well. The publishers of the data sets generally do actively try to remove these, but no matter how hard they try, there comes a point of diminishing returns.

“The data set creators allow some of these 'contaminants' to remain simply because there is little incentive to spend the resources eradicating them all and they have minimal overall effect on the training of machine learning models.”

Copyright laws need to be updated

Scraping photos from public sources to feed data sets like ImageNet is a contentious issue. Even if the snaps were used under permissive licenses, is it reasonable to expect the photos to be scrutinized to build powerful object-recognizing neural networks? Do we really have to predict the future before we share any more information online?

Albert Cahn, a lawyer and founder and executive director of the Surveillance Technology Oversight Project – a non-profit activist group based in New York – told us he was “incredibly concerned about the privacy impact of computer vision data scraping.”

“I’m alarmed that millions of individuals’ faces are being used and commercialized without their consent," he told The Register. "I think that the fact that there are children are in these databases highlights just how few rules there have been for aggregating these data sets.”

He argued new laws are needed to safeguard people's data as it harvested and analyzed. “All too often, firms use and sell our biometric data without our consent," he said. "While some of these practices may be legal, they highlight the need for stronger biometric privacy practices that protect against this sort of data harvesting.

"As it becomes ever easier to take ever more of our information and incorporate it into machine learning, we need for the laws to catch up and see that traditional copyright rules don’t go nearly far enough in protecting the public.”

Bias and racism

ImageNet's creators acknowledged they were aware of its problematic content. Not only are there inappropriate photos, some of the labels used to describe them are biased and racist.

A recent project, ImageNet Roulette, created by Kate Crawford, co-founder of the AI Now Institute, a research hub focused on the social implications of AI, and Trevor Paglen, an artist interested in mass surveillance and data collection, revealed that a system trained from ImageNet would label people’s selfies with insults and racial slurs. They declined to comment for this story.

“We agree that inappropriate images should not be in the data set,” the ImageNet team told El Reg.

“We are developing and implementing a systematic approach to identify and remove problematic images, because what is considered problematic can evolve with time, be subjective, and depends on local community standards and the intended use; reasonable people might disagree on where to draw the line.”

Someone holding the loser L symbol to their forehead

This image-recognition roulette is all fun and games... until it labels you a rape suspect, divorcee, or a racial slur

READ MORE

Essentially, they are right now scrubbing ImageNet of its inappropriate footage, and hope to blur out all the faces in it.

Os Keyes, a PhD student at the University of Washington on the west coast of America, who is studying gender and algorithms, told us "face blurring is what we would call 'necessary, but insufficient.' A face is hardly the only way to identify someone." Said other ways include clothes, tattoos, and ID badges.

"Identifying people can be done through faces, but it can be done through a lot of other components of how people present in photography, and so it's difficult to imagine researchers being able to actually write systems to comb identifiable features out," Keyes said. "Again, this is a situation where active consent goes a long way towards resolving the problem."

Deleting images also introduces other problems, too, Crawford and Paglen previously said. “By erasing them completely, not only is a significant part of the history of AI lost, but researchers are unable to see how the assumptions, labels, and classificatory approaches have been replicated in new systems, or trace the provenance of skews and biases exhibited in working systems.

“This is the problem of inaccessible or disappearing data sets. If they are, or were, being used in systems that play a role in everyday life, it is important to be able to study and understand the worldview they normalize.” ®

Sponsored: Technical Overview: Exasol Peek Under the Hood

SUBSCRIBE TO OUR WEEKLY TECH NEWSLETTER




Biting the hand that feeds IT © 1998–2019