Voice Cloning Using Artificial Intelligence Is a Pandora’s Box

Article Featured Image

Many artificial intelligence applications wow you during a demo but fail to live up to the hype in actual use. AI-generated voices, or voices cloned by applying AI techniques to a person’s existing speech recordings, are decisively not in that category. They work well in the real world.

Such AI voices (also referred to as synthetic voices) are a notch above simple text-to-speech as they incorporate realistic speech patterns and don’t sound monotonous or robotic. AI voices have multiple use cases: personalizing virtual assistants, automating customer service, narrating audio books in the author’s voice, creating video game character voices, dubbing TV shows and movies into multiple languages, and so on. AI voices are useful in marketing too for creating automated voiceovers for product demos, custom brand voices for companies, and celebrity voice greetings, to name just a few examples. AI voices are very useful in healthcare and are an important part of assistive and accessibility technologies: voice prosthetics to restore speech for those who’ve lost their voices due to ailments, assistance in speech therapy, voice interfaces for hands-free control of devices, read-aloud of text, and so on.

Alas, AI voices also illustrate the duality of AI: They can easily be put to sinister use. And that’s happening at an alarming pace. For years, we worried about deepfake videos. We fretted that as AI got better, we wouldn’t be able to tell real videos from fake AI-generated videos and it would amp up propaganda and dial down trust. That scenario is coming at us fast; we already have voice deepfakes. With only a small speech sample, a high-quality, real-sounding, synthetic voice can easily be generated using AI. It is not costly to do so; the tools are easily available. Image fakes and video fakes offer a few visual clues: imperfect fingers in the images or flouting of the laws of physics. No such luck with voices. Detecting AI-generated ones is a tough problem.

Not surprisingly, scams using AI-cloned voices are on the rise. An elderly couple was duped into believing that their grandson was calling asking for cash to get him out of trouble (see this issue’s Editor’s Letter for a similar case). A New York couple was made to transfer money to scamsters, as they had been led to believe that a relative was taken hostage based on an AI voice heard over the phone. Such imposter scams are not new, but AI is being used by scamsters to increase the sophistication of their sophistry.

The U.S. Federal Trade Commission (FTC) issued warnings for users to be alert and recently even ran a challenge for help outlining solutions to detect voice cloning. Through this challenge the FTC was hoping to spur innovative solutions but struck a cautious note: that if no viable solutions are forthcoming or feasible, its warning should serve as a heads-up to the policymakers about the need to be ready with nontechnical solutions.

Meanwhile, the U.S. Federal Communications Commission (FCC) declared that under the Telephone Consumer Protection Act (TCPA), using AI-generated voices in robocalls is illegal.

While we are discussing outright scams here, there are several other legal risks and questions related to voice cloning. For example, there are potential intellectual property violations and privacy risks attached to unauthorized cloning of musical artists and celebrity voices. The rapid advances in generative AI make finding solutions and mitigation measures that much more urgent.

Another worrying prospect is the potential for election interference. Misinformation campaigns and propaganda using voice cloning of public figures and political leaders is a real possibility. Consider that 2024 is the year of elections. Sixty-four countries, including the U.S.—representing nearly half of humanity—will conduct national elections this year. Viewed through this lens, speech technology is no longer a niche technology but a powerful force in global affairs. The stakes are high, and there is no one silver bullet. It will require a well-thought-out regulatory framework and enforcement guidelines, technology solutions such as digital fingerprinting and audio watermarks by AI developers, and awareness campaigns for the general public. 

Kashyap Kompella, CFA, is an industry analyst, author, educator, and advisor. He is founder of the AI advisory outfits RPA2AI Research and AI Profs and a For Humanity Certified AI auditor.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues