Security Threats to Speech Apps in the Age of Deep Fakes
Look around you—there’s a good chance you’ll find a voice-controlled device nearby. If not, there’s the digital assistant on your phone. You can issue voice commands and get quite a few things done. Get weather information, play music, control home appliances, order groceries—the list of tasks you can accomplish is fast-growing. No doubt, we appreciate the hands-free convenience that these voice interfaces enable for us.
But as voice applications proliferate, we need to talk about their security. As machine learning marches forward, we are becoming aware of the dangers of “deep fakes” or fake images that are generated by deep learning algorithms. This will become a concern with voice applications as well.
Securing your voice applications begins by first understanding their potential vulnerability to attacks by malicious actors. Several applications use voice biometrics for user authorization (e.g., customer service or mobile banking applications using automatic speaker verification as a login mechanism); these are not the focus of this article. Let me just note in passing that the increasing sophistication of synthetic voices created using advanced machine learning techniques is something that even such voice biometric systems will have to watch out for going forward.
Many voice applications, however, do not verify the speaker’s identity. That’s by design. If you had to log in each time to check the weather outside, it’d make for a terrible user experience. But this convenience exposes the applications to several attack vectors.
The most basic method of attack is “replay.” User voice commands are recorded and then replayed to control the application. The rudimentary replay attack is easy to detect. But rogue actors may be able to exploit host platform and hardware characteristics (e.g., install malware, bypass device permissions, control microphone/speaker) to make their unwanted intrusions go undetected. For example, they can get ahold of users’ voice samples and then run them as replays as background services in Android. Replay and its several sophisticated variants are examples of “white box” attack methods, where the attackers know the details of the host environment. Not surprisingly, you’ll see more white box attack attempts on Android devices, because the source code is open source and quite well understood.
In contrast, “black box” attacks are likely to be targeted at all platforms. They try to fool the machine learning models that are used for automatic speech recognition. Humans and algorithms recognize speech differently, and this simple fact presents opportunities for attackers. In black box attacks, the voice input is modified and is not understandable to the user, but the neural network correctly predicts the command being issued. The colloquial term for this is “cocaine noodles” because that phrase was misinterpreted by Google Now as “OK, Google” and activated the digital assistant. Another adversarial approach generates an audio signal that sounds normal to the user but is transcribed very differently by the neural network model.
We’ve only scratched the surface here, but that should give you a flavor for the security threats to voice applications. So what can be done to improve security? Authenticate the speaker automatically using their voice identity (without subtracting from user experience)? That would still be susceptible to replay spoofing and synthetic voices.
Reflect on the attack vectors we discussed, and it seems reasonable that we can safeguard against them if we are sure that the input is actually received from a human speaker. The approach, then, is to identify the source of voice commands: Is it a human or a playback device? Humans speakers and digital speakers produce and transmit sound differently, and there are enough markers and clues in the sound frequencies and patterns (e.g., breathing sounds, click sounds when a file is played) to tell them apart. Leveraging machine learning, we can determine whether the voice commands are being issued by a human or not and reject those coming from an electronic source. This is definitely a plausible solution.
The field of security for speech and voice applications is still an emerging field. But as smart speakers and voice applications gain greater adoption, expect voice security also to become more mature to battle the “deep fakes” of audio.
Kashyap Kompella is the CEO of rpa2ai, a global AI industry analyst firm, and is also a contributing analyst at Real Story Group.
It matters who your target is
AI-powered bots, armed with natural language understanding, are giving vital help to users navigating healthcare
Advances in cognitive computing seek to improve on traditional rules-based algorithms and probability models—but the latter still know their stuff