When Dolphins Attack: How Voice Assistants and Speech Recognition Software Can be Fooled
If you own a smart speaker, you know that it can be fun trying to trick Alexa, Siri, or Google into doing or saying something it shouldn’t—like obeying your friend who imitates your voice commands. While such ruses are fun and harmless, the truth is that bad actors are undoubtedly attempting trickery of a more nefarious nature and voice-controlled systems (VCSs) and speech recognition systems (SRSs) can be easily fooled via clever techniques.
When Voice Biometrics Attack
For proof, consider the revelations of a paper published last year
by Chinese researchers who successfully commanded VCSs like Siri, Cortana, and Alexa by employing a technique, called “DolphinAttack,” that uses high frequencies incomprehensible to human ears but detectable by electronic microphones. They designed DolphinAttack to achieve inaudibility by modulating voice commands on ultrasonic carriers. The researchers were able to make Siri initiate a FaceTime call on the iPhone, activate a car’s navigation system, get a smartphone to switch to airplane mode and more.
Secondly, a team from the University of California published a paper
in early 2018 in which they detail a white box iterative optimization-based audio adversarial attack on Mozilla’s DeepSpeech speech-to-text transcription software. The researchers were able to reproduce any kind of audio waveform with 99.9% accuracy and transcribe it as any phrase they chose (at a max rate of 50 characters a second) with a 100% success rate.
These findings beg serious questions: how susceptible are VCSs and SRSs to these threats and what can be done to safeguard against them?
Do You Need to Worry?Scott Amyx
, managing partner at Amyx Ventures
and an internet of things expert, for one, is concerned. “This research brings awareness to vulnerabilities. It's unclear whether ill-intent states and agents or hackers are exploiting these vulnerabilities on a large-scale. But it poses risks—from pranksters who may be able to control devices for purposes like recording people in private situations to eavesdropping of highly confidential and sensitive materials to malicious intent commands that could endanger lives, physical assets, and sensitive data and financial transactions,” says Amyx. Cornelius Raths
, PhD, senior director of data science for Veritone, Inc
., an artificial intelligence technology and solutions provider, however, is less perturbed by the success of these published attacks. “The DolphinAttack idea is entirely based on good old-fashioned analog signal processing, which you don’t see too much of nowadays. It requires the transmitter to be extremely close to the VCS device, which makes such an attack a lot less likely to occur. Plus, you need a particular speaker’s voice to activate the device and enable an attack, which would require a lot of effort,” notes Raths.
When it comes to an audio adversarial attack like the one applied to Mozilla’s DeepSpeech, “it’s not a very practical attack,” Raths adds. “The distortion that will make the utterance transcribe something different is closely linked to the specific utterance. But the variability of regular human speech basically prevents this from happening, as we never utter things the same way twice. For the attack to succeed, you need to be able to have access to the neural network’s loss function to generate powerful distortions and add those to the utterances made.”
Still, adversarial attacks could pose a larger problem in the future, Raths warns, especially if hackers find ways to mount an attack with a black box instead of a white box and are able to test a large number of distortions—one or more of which could succeed at controlling a device or software.
Protecting Your Voice-Activated Systems
Fortunately, VCS and SRS companies have defenses against the attacks outlined in the published papers. “To prevent a DolphinAttack, you can change the frequency response curve of the microphone by using a low-pass filter that makes the spectral roll-off of that curve steeper. And you can implement a classifier that can detect features predictive of voice activity at frequencies above 20 kHz in order to detect this kind of attack,” suggests Raths. “To protect an SRS like Mozilla DeepSpeech, you can introduce randomness in the signal artificially by injecting a low amount of noise that will not affect the transcription quality. Or you can route your audio through various engines that are each slightly different, because introducing light variability in your neural network architecture can prevent distortion attacks from being effective.”
Lastly, expect more exploitation of these vulnerabilities in the times ahead. “Hackers will always find ways to attack the system,” Amyx says. “Given that speech-based human machine interfaces will only become more pervasive, expect the problem to worsen, not necessarily in the same way as reported by these studies but in different and constantly changing ways.”
It's infiltrated homes in the form of smart speakers. Now, voice technology is poised to make a big splash in the workplace, as more companies adopt enterprise solutions like Alexa for Business and other speech-enabled tools and conversational interfaces designed to enhance productivity, manage common tasks, and improve communication.
Biometric solutions built on voice recognition technologies are projected to create an incremental opportunity of $4.5 billion between 2017 and 2025.
Keeping up with the changing whims of search engines has always been a sisyphean task, but now a new obstacle has been tossed into the always evolving SEO mix: voice search. There is some good news, however, for brands and marketing agencies look to optimize content for voice search. Much of what they need to do to optimize content for voice search is simply to hone in on many of the same rules they have used to achieve SEO success in the past.
Service providers lose more than USD $38.1 billion from voice fraud annually, according to the Communications Fraud Control Agency (CFCA). In a voice market where margins are declining, any loss from fraud is too much. Service providers have to take action or face potentially going out of business.
Familiarity with voice recognition software has increased exponentially with our use of smartphones (thank you Siri!), voice controls in our cars, and smart home devices like Google Home or the Amazon Echo. So why would I want my gaming experience to be any different?
According to a post on the European Union's website, the EU-funded SIIP (Speaker Identification Integrated Project) aims to put an end to any doubts about voice recognition in the court room "with an innovative probabilistic, language-independent identification system. This system uses a novel Speaker-Identification (SID) engine and a Global Info Sharing Mechanism (GISM) to identify unknown speakers who are captured in lawfully intercepted calls, recorded crime or terror arenas, social media and any other type of speech source."
New algorithm looks at speech patterns to help identify early stages of Alzheimers, dementia and aphasia.