Lend me your ears and AI will play with your brain: Machine voice imitators outsmart us
Brain-scanning boffins find no signals to differentiate real and spoofed speakers
Human brains appear to be better at detecting fake images than fake voices, a distinction that bodes ill for security as voice spoofing technology becomes more effective and more accessible.
In a blog post this week, Adrian Colyer, venture partner with Accel in London, explored a paper presented earlier this year at 2019 Network and Distributed Systems Security (NDSS) Symposium that examines this unexpected perceptual blind spot.
The paper, The Crux of Voice (In)Security: A Brain Study of Speaker Legitimacy Detection, co-authored by Ajaya Neupane (University of California Riverside), Nitesh Saxena (University of Alabama at Birmingham), Leanne Hirshfield (Syracuse University), and Sarah Elaine Bratt (Syracuse University), examines the neural underpinnings of voice security through functional near-infrared spectroscopy (fNIRS).
Basically, the researchers conducted brain scans of subjects listening to recordings of people's voices, the dulcet tones of Oprah Winfrey and Morgan Freeman among them, and compared results to computer generated (morphed) versions of those voices and to impersonators.
They found that the brain doesn't register a significant difference between real and morphed voices, as it does with impersonators or fake images. A previous study found neural differences in the brains of people viewing fake and real Rembrandt paintings and also in people viewing phishing websites.
That's not to say recorded Oprah and simulated Oprah were of equal quality; they were not. The researchers asked subjects to identify whether the speaker was real or fake. The subjects reported genuine recordings as real 82 per cent of the time, morphed (software-altered) recordings as real 58 per cent of the time, and impersonating speakers as real 32 per cent of the time.
What matters is that when people took a morphed voice to be the real person, there were no neural signals to indicate otherwise. And when the researchers tried using machine learning to distinguish the scan data from real and morphed voices, they could only achieve 53 per cent accuracy – not much better than a coin toss.
"It turns out though that voice may be a particularly devastating attack vector because 'deep down', inside our brains, we genuinely can't tell the difference between a real voice and a morphed voice impersonating it," observed Colyer.
The researchers used software called CMU Festvox for voice morphing and they speculate that ongoing advances in voice synthesis (eg, Lyrebird and Google WaveNet) will make morphed voices sound even more credible, thereby making voice-based social engineering attacks more practical.
Forget DeepFakes. This robo-Rembrandt with AI for brains is not bad at knocking off paintingsREAD MORE
Potential threat scenarios, they suggest, include publishing fake voice samples on social media, impersonating people in phone conversations or voicemail systems, and man-in-the-middle attacks against voice verification systems.
The boffins warn that their inability to find a neural distinction between how real and morphed voices register in the brain doesn't preclude the possibility that such signals exist.
Colyer considers the possibility that the researchers might not be measuring the right areas of the brain. But he also observes that scans comparing an original speaker audio sample and a human impersonator showed statistically significant differences, which indicates fNIRS scans were looking at least in relevant brain territory.
The researchers argue that raising people's awareness of the risk of voice spoofing may be the most feasible way to deal with the threat at present. Biometric systems, they claim, are vulnerable to voice synthesis attacks, though they allow that future technical solutions may help. ®