Emergent Tech

Artificial Intelligence

AI sucks at stopping online trolls spewing toxic comments

It's easy to for hate speech to slip past dumb machines

By Katyanna Quach

45 SHARE

New research has shown just how bad AI is at dealing with online trolls.

Such systems struggle to automatically flag nudity and violence, don’t understand text well enough to shoot down fake news and aren’t effective at detecting abusive comments from trolls hiding behind their keyboards.

A group of researchers from Aalto University and the University of Padua found this out when they tested seven state-of-the-art models used to detect hate speech. All of them failed to recognize foul language when subtle changes were made, according to a paper [PDF] on arXiv.

Adversarial examples can be created automatically by using algorithms to misspell certain words, swap characters for numbers or add random spaces between words or attach innocuous words such as ‘love’ in sentences.

The models failed to pick up on adversarial examples and successfully evaded detection. These tricks wouldn’t fool humans, but machine learning models are easily blindsided. They can’t readily adapt to new information beyond what’s been spoonfed to them during the training process.

“They perform well only when tested on the same type of data they were trained on. Based on these results, we argue that for successful hate speech detection, model architecture is less important than the type of data and labeling criteria. We further show that all proposed detection techniques are brittle against adversaries who can (automatically) insert typos, change word boundaries or add innocuous words to the original hate speech,” the paper’s abstract states.

The problem of sniffing out toxic language normally boils down to a classification problem. Does this sentence contain any swear words or racist and sexist slurs?

Google’s API Perspective calculates a score to determine if text is hateful or not. But by narrowing it down to a simple classification problem, it means that it can suffer from false positives - when the sentence contains offensive language but its overall meaning is harmless.

Some false positive examples that show how brittle Google's Perspective model is. Image credit: Gröndahl et al.

The researchers were too polite and replaced a “common English curse word, marked with “F” here, but [was used] in [it’s] original form in the actual experiment.” You get the idea.

“Attack effectiveness varied betweeen models and datasets, but the performance of all seven hate speech classifiers was significantly decreased by most attacks,” according to the researchers.

The weakest models are ones that inspect sentences word-by-word, since tiny changes like adding spaces between words will slip by unnoticed. The ones that break down words by individual characters do slightly better at recognizing attacks.

Google's troll-destroying AI can't cope with typos

READ MORE

“A significant difference between word- and character based models was that the former were all completely broken by at least one attack, whereas the latter were never completely broken,” the team said.

Future research should focus on making models more robust to attacks, the researchers said. Developers should pay closer attention to the training dataset rather than the algorithms themselves, they argued.

“We therefore suggest that future work should focus on the datasets instead of the models. More work is needed to compare the linguistic features indicative of different kinds of hate speech (racism, sexism, personal attacks etc.), and the differences between hateful and merely offensive speech,” the paper included. ®

Sign up to our NewsletterGet IT in your inbox daily

45 Comments

More from The Register

'Adversarial DNA' breeds buffer overflow bugs in PCs

Boffins had to break gene-reading software but were able to remotely exploit a computer

Hack Google's AI for cash, DeepMind gets cancerous, new Lobe for Redmond – and more

Roundup It's the week's other machine-learning news

Researchers create AI attacker to defeat AI malware defender

It's like Spy Vs Spy, but with neural network boffins

Beware! Medical AI systems are easy targets for fraud and error

You can fake diagnoses with adversarial examples

Another AI attack, this time against 'black box' machine learning

The difference between George Clooney and Dustin Hoffman? Just a couple of pixels

KFC: Enemy of waistlines, AI, arteries and logistics software

Self-driving cars mistake the Colonel for a Stop sign, which is cruel given a software SNAFU's emptied UK eateries

AI image recognition systems can be tricked by copying and pasting random objects

Picture of a human + elephant = Chair. Good job.

Now that's sticker shock: Sticky labels make image-recog AI go bananas for toasters

Google boffins develop tricks to fool machine-learning software

How to stealthily poison neural network chips in the supply chain

Your free guide to trick an AI classifier into thinking an umbrella is the Bolivian navy on maneuvers in the South Pacific

Boffins craft perfect 'head generator' to beat facial recognition

Think Face/Off, in software, plus some digital touchup