When you play this song backwards, you can hear Satan. Play it forwards, and it hijacks Siri, Alexa

Speech recognition systems seduced by masked messages

By Thomas Claburn in San Francisco

Posted in Artificial Intelligence, 30th January 2018 07:02 GMT

Computer science boffins affiliated with IBM and universities in China and the United States have devised a way to issue covert commands to voice-based AI software – like Apple Siri, Amazon Alexa, Google Assistant and Microsoft Cortana – by encoding them in popular songs.

They refer to these tweaked tunes, which issue mostly inaudible commands to speech recognition devices within earshot, as CommanderSongs.

In CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition, a paper distributed through preprint service Arxiv, the ten authors involved in the project – Xuejing Yuan, Yuxuan Chen , Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A. Gunter – describe their technique for duping deep-learning models used to recognize speech with "adversarial perturbations."

Adversarial attacks are a way to deceive AI systems by altering input data to obtain desired results from a specific system. They've been explored extensively for images. For example, MIT students recently demonstrated that they could trick Google's image recognition system into labeling a turtle as a rifle.

Alterations

Less work has been done with audio and speech recognition. The researchers say where images provide an easy way to alter pixels to trip up algorithms without noticeable visual artifacts, it isn't obvious whether audio attacks can also pass under the radar because alterations added to voices typically cannot be recognized by voice-controlled devices like Amazon Echo.

Last year, a different group of clever people proposed what they called DolphinAttack, to manipulate software-based voice recognition apps using sound outside the range of human hearing. That technique, however, can be mitigated by technology capable of suppressing ultrasound signals.

The CommanderSong researchers – from the State Key Laboratory of Information Security (SKLOIS), University of Chinese Academy of Sciences, Florida Institute of Technology, University of Illinois at Urbana-Champaign, IBM T. J. Watson Research Center, and Indiana University – say their technique has two differences: it does not rely on any other technology to hide the command, and it cannot be blocked by audio frequency filters.

"Our idea to make a voice command unnoticeable is to integrate it in a song," they explain in their paper. "In this way, when the crafted song is played, the [speech recognition] system will decode and execute the injected command inside, while users are still enjoying the song as usual."

In a phone interview with The Register, Gunter, a computer science professor at the University of Illinois, said while previous work has been done showing that garbled sounds can trigger voice recognition systems, masking the command in a song would be less noticeable because music is often present.

"It has a more practical attack vector," he said.

The researchers started with a randomly selected song and command track generated by a text-to-speech engine. They then decoded each audio file using the open-source Kaldi speech-recognition toolkit, and extracted the output of a deep neural network (DNN).

After identifying specific DNN outputs that represents the desired command, they manipulated the song and command audio using the gradient descent method, a machine learning optimization algorithm.

Chord cutters

In essence, they used their knowledge of the way the audio would be processed to ensure the speech recognition system would hear the command within the music.

The result is adversarial audio – songs containing a command interpretable by Kaldi code but unlikely to be noticed by a human listener.

The altered audio may be perceptible to a listener, but it's doubtful the added sound would be recognized as anything other than distortion.

"You mistake some of these signals as defects in the media," said Gunter, allowing that some songs masked the command better than others. "Some of the examples, they would make you grimace. Others are more subtle."

The researchers tested a variety of in-song commands delivered directly to Kaldi as audio recordings, such as: "Okay Google, read mail" and "Echo, open the front door." The success rate of these was 100 per cent.

They also tested in-song commands delivered audibly, where environmental noise can hinder recognition, including "Echo, ask Capital One to make a credit card payment" and "Okay Google, call one one zero one one nine one two zero."

Dolphins inspire ultrasonic attacks that pwn smartphones, cars and digital assistants

READ MORE

As a stand-in for actual devices, the boffins used the Kaldi software listening to songs with embedded commands, delivered via a JBL clip2 portable speaker, TAKSTAR broadcast gear and an ASUS laptop, from a distance of 1.5 metres.

For the open air test, success rates varied from 60 per cent to 94 per cent.

Gunter said that to be certain the attack would work with, say Amazon's Echo, you'd have to reverse engineer the Alexa speech recognition engine. But he said he knows colleagues working on that.

The researchers suggest that CommanderSongs could prompt voice-recognition devices execute any command delivered over the air without the notice of anyone nearby. And they say such attacks could be delivered through radio, TV or media players.

We already have the proof-of-concept for overt commands sent over the airwaves. In time, we may get a covert channel too.

"It's going to take continued work on it to get it to the point where it's less noticeable," said Gunter. ®

Sign up to our NewsletterGet IT in your inbox daily

59 Comments

More from The Register

Alexa, please cause the cops to raid my home

Sour krauts after Amazon digital assistant throws wild midnight party – for itself

Audio spy Alexa now has a little pal called Dox

Updated You keep using that word, dox. It means more than you think it means...

Amazon mumbles into its coffee when asked: Will you give app devs people's Alexa chats?

Cloud giant worryingly coy about its intentions

'Alexa, manage my enterprise storage'

No, says Tintri, it isn’t nonsense. (Psst, want to see a demo?)

Tech soap-opera latest: Alexa marries Cortana, will share custody of customers

Analysis Amazon, Microsoft agree to complete each other

Amazon: 'Alexa, how do you fix shoddy APIs that keep breaking apps? Asking, er, for a friend'

Updated Devs complain Skills Kit knackers their code

Boffins throw Amazon Alexa on the rack to extract hidden clues

Investigators can look forward to better thumbscrews for making digital assistants squeal

Alexa: How do I get free AWS? Simples: Build more of ME!

Amazon throws AWS credits at Alexa dev fans

Amazon's Alexa is worst receptionist ever: Crazy exes, stalkers' calls put through automatically

Ability to screen callers, block people Coming Soon™

Alexa, why aren't you working? No – I didn't say twerking. I, oh God...

Hands-on Amazon's unhelpful assistant also sleeps through its wakeup word on HTC mobe