Speech Technology Magazine

 

When Dolphins Attack: How Voice Assistants and Speech Recognition Software Can be Fooled

If you own a smart speaker, you know that it can be fun trying to trick Alexa, Siri, or Google into doing or saying something it shouldn't—like obeying your friend who imitates your voice commands. While such ruses are fun and harmless, the truth is that bad actors are undoubtedly attempting trickery of a more nefarious nature and voice-controlled systems (VCSs) and speech recognition systems (SRSs) can be easily fooled via clever techniques.
By Erik J. Martin - Posted Feb 23, 2018
Page1 of 1
Bookmark and Share

If you own a smart speaker, you know that it can be fun trying to trick Alexa, Siri, or Google into doing or saying something it shouldn’t—like obeying your friend who imitates your voice commands. While such ruses are fun and harmless, the truth is that bad actors are undoubtedly attempting trickery of a more nefarious nature and voice-controlled systems (VCSs) and speech recognition systems (SRSs) can be easily fooled via clever techniques.

When Voice Biometrics Attack

For proof, consider the revelations of a paper published last year by Chinese researchers who successfully commanded VCSs like Siri, Cortana, and Alexa by employing a technique, called “DolphinAttack,” that uses high frequencies incomprehensible to human ears but detectable by electronic microphones. They designed DolphinAttack to achieve inaudibility by modulating voice commands on ultrasonic carriers. The researchers were able to make Siri initiate a FaceTime call on the iPhone, activate a car’s navigation system, get a smartphone to switch to airplane mode and more.

Secondly, a team from the University of California published a paper in early 2018 in which they detail a white box iterative optimization-based audio adversarial attack on Mozilla’s DeepSpeech speech-to-text transcription software. The researchers were able to reproduce any kind of audio waveform with 99.9% accuracy and transcribe it as any phrase they chose (at a max rate of 50 characters a second) with a 100% success rate.

These findings beg serious questions: how susceptible are VCSs and SRSs to these threats and what can be done to safeguard against them?

Do You Need to Worry?

Scott Amyx, managing partner at Amyx Ventures and an internet of things expert, for one, is concerned. “This research brings awareness to vulnerabilities. It's unclear whether ill-intent states and agents or hackers are exploiting these vulnerabilities on a large-scale. But it poses risks—from pranksters who may be able to control devices for purposes like recording people in private situations to eavesdropping of highly confidential and sensitive materials to malicious intent commands that could endanger lives, physical assets, and sensitive data and financial transactions,” says Amyx.

Cornelius Raths, PhD, senior director of data science for Veritone, Inc., an artificial intelligence technology and solutions provider, however, is less perturbed by the success of these published attacks. “The DolphinAttack idea is entirely based on good old-fashioned analog signal processing, which you don’t see too much of nowadays. It requires the transmitter to be extremely close to the VCS device, which makes such an attack a lot less likely to occur. Plus, you need a particular speaker’s voice to activate the device and enable an attack, which would require a lot of effort,” notes Raths.

When it comes to an audio adversarial attack like the one applied to Mozilla’s DeepSpeech, “it’s not a very practical attack,” Raths adds. “The distortion that will make the utterance transcribe something different is closely linked to the specific utterance. But the variability of regular human speech basically prevents this from happening, as we never utter things the same way twice. For the attack to succeed, you need to be able to have access to the neural network’s loss function to generate powerful distortions and add those to the utterances made.”

Still, adversarial attacks could pose a larger problem in the future, Raths warns, especially if hackers find ways to mount an attack with a black box instead of a white box and are able to test a large number of distortions—one or more of which could succeed at controlling a device or software.


Protecting Your Voice-Activated Systems

Fortunately, VCS and SRS companies have defenses against the attacks outlined in the published papers. “To prevent a DolphinAttack, you can change the frequency response curve of the microphone by using a low-pass filter that makes the spectral roll-off of that curve steeper. And you can implement a classifier that can detect features predictive of voice activity at frequencies above 20 kHz in order to detect this kind of attack,” suggests Raths. “To protect an SRS like Mozilla DeepSpeech, you can introduce randomness in the signal artificially by injecting a low amount of noise that will not affect the transcription quality. Or you can route your audio through various engines that are each slightly different, because introducing light variability in your neural network architecture can prevent distortion attacks from being effective.”

Lastly, expect more exploitation of these vulnerabilities in the times ahead. “Hackers will always find ways to attack the system,” Amyx says. “Given that speech-based human machine interfaces will only become more pervasive, expect the problem to worsen, not necessarily in the same way as reported by these studies but in different and constantly changing ways.”

Page1 of 1