Safety and Ethical Concerns Loom Large in Voice Cloning

Article Featured Image

Though it sounds like the stuff of an H.G. Wells story, synthetic speech was first created centuries before the computer age. According to Matinee Multilingual, a U.K. firm that specializes in translation, transcription, voiceovers, subtitling, captioning, and video editing, the first synthetic speech came in 1770 with a rudimentary machine capable of mimicking vowel sounds. Then a few years later, a mechanical speech machine that could make multiple noise types, including full words, was created. Both of these machines were basic and worked by replicating human vocal structures. They were initially developed to study voice function.

Many, many years later, speech machines came into use to help people who were unable to speak, the most famous example of which was the synthetic voice used by renowned theoretical physicist Stephen Hawking. Though certainly helpful for those who used them, these devices sounded robotic and mechanical—not something anyone would mistake for a human voice.

Fast-forward to the past decade or so and voice technology, particularly with the help of artificial intelligence, has progressed swiftly. AI voices are being used in interactive voice response systems, messaging services, voice assistants, and many other areas. The tech-produced voices have gone from sounding like robots to sounding like humans.

These developments set the stage for Lyrebird (now the AI Research Division of Descript), which in 2017 released a product that could mimic any person’s voice after just one minute. Today, similar products can re-create the human voice with just two or three seconds of audio.

The preference for a human-sounding voice is very high. Voicebot research has found that consumers prefer human voices over synthetic ones in company interactions by 71.6 percent.

Ethical Use Concerns

However, that preference for human-sounding voices and the advances in the technology that produces them make ethical considerations of voice cloning more important than ever. And it’s not just actors, politicians, and other famous people who are at risk. Even typical consumers’ voices are subject to unethical use, particularly as generative AI technology becomes more widespread.

Modern speech technology can listen to phone calls and hijack them with fake biometric audio for fraud or manipulation purposes. In a now widely publicized Hong Kong fraud case, an employee transferred $25 million in funds to five bank accounts after a virtual meeting with what turned out to be audio-video deepfakes of senior management.

Chenta Lee, chief architect of threat intelligence at IBM, explained how the hacker pulled it off: He combined large language models (LLMs), speech-to-text, text-to-speech, and voice cloning tactics to dynamically modify the context and content of a live phone conversation. Tactics can be deployed through a number of vectors, such as malware or compromised VoIP services. A three-second audio sample is enough to create a convincing voice clone, and the LLM takes care of the parsing and semantics.

“Everyone needs to be concerned about the technology being used for nefarious purposes,” says Bob Long, president of the Americas at Daon, a biometrics and identity assurance software company. “It’s no longer where you’re looking at just famous people; it’s everyone.”

The long-running scam call from a supposed relative needing money (without a cloned voice) is no longer as effective as it once was. But when a cloned voice is added, those calls are extremely believable, he says.

Additionally, spoofed voices can be used to compromise voice verification systems if very sophisticated technology and additional verification systems aren’t used, according to Long.

Not all uses of voice clones are nefarious, of course. “There are a lot of use cases that are extremely ethical using AI voices,” says Zohaib Ahmed, CEO and cofounder of Resemble AI, who points to entertainment and personalization as well as use in virtual assistants.

Another popular use of the technology is to help people who are losing their voice and those who have already lost the ability to speak, Long adds. “Any technology is a tool before it is a threat.”

Where Ethical Use Begins

For the many companies using professional voice talent for voiceovers and other synthesized speech applications, the integrity of the voice clone has to be the primary concern. The contracts with the voice actors are the most important starting point.

“As long as there is consent for the AI voice to be created, we are totally on board. Our product relies on that consent piece,” Ahmed says. “We don’t care how it’s being used, as long as it’s within the bounds of legal concerns. Robocalls aren’t allowed. But as long as it’s for creative purposes—we have games; we have movie studios and probably videos that are built off this—that’s all very fair.”

As the technology has gotten better and better in just the past few months, PolyAI, a creator of customized voice assistants, has relied less on actual voice actors and more on voice cloning technologies, “both for cost purposes and to reduce the amount of administration,” says Nathan Liu, PolyAI’s head of deployments. “It’s only been in the last 6 to 12 months that [the technology’s] been good enough.”

To ensure the ethical cloning of the voices of the talent with which it works, PolyAI has actors sign contracts giving the company the permission to do so and outlining compensation for the voice actor.

The contracts are very specific, Liu adds. “The voice actors always get paid, and we don’t use their voice for anything other than what we agreed to use it for. They will state that we can use their cloned voice only for a particular client. The amount that we pay them is very transparent. They’ve agreed to it before we use their voice.”

With today’s technology, it’s relatively easy and inexpensive to clone a voice an infinite number of times. To protect actors and the company, PolyAI’s contract to use a cloned voice for multiple clients pays the actor based on usage, with a cap on payout from that contract.

Resemble AI’s contracts with voice actors include revenue sharing. The more the actor’s voice is used, the more that person earns, Ahmed adds. “If it sounds like a celebrity’s voice, and if it was trained off the celebrity’s dataset, then the celebrity needs to be compensated.”

The National Association of Voice Actors has set forth a series of guidelines for how these contracts are to be structured. It includes a clear indication of how and where the voice will be used, the medium (text-to-speech, interactive voice response, speech-to-speech, etc.) involved, and pay requirements. For example, if the actor’s voice is meant to be used in narration and then it shows up in a commercial, the actor should absolutely know how he’ll be paid for the new usage, the guidelines specify.

Those guidelines also include term and exclusivity agreements: “How long will your character be in use? In perpetuity, throughout the known universe, robs you of potential earnings. Make sure you stipulate a specific start and end date for the usage so that you and your agent can renegotiate at the end of the term. Also understand if the contract makes you exclusive to a specific company, preventing you from working for anyone else in that market,” the guidelines also state.

The companies that will ultimately use the voices that companies like PolyAI and Resemble AI create have their own reputations to protect, so they wouldn’t use the voices outside of the contracted agreements, Liu says.

Technological Solutions

But not all users of voice cloning technologies follow the same ethical use standards, and so technology has to be brought to bear to protect the owners of the actual voices.

It’s a battle that’s been raging for years.

In 2019, Google released a synthetic speech database designed to stop audio deepfakes.

“Malicious actors may synthesize speech to try to fool voice authentication systems,” the Google News Initiative blog reported at the time. “Perhaps equally concerning, public awareness of deepfakes (audio or video clips generated by deep learning models) can be exploited to manipulate trust in media.”

Also in 2019, Google introduced the Translatotron AI system to translate speech into another language. By 2021, it was clear that deepfake voice manipulation was a serious issue for anyone relying on AI to mimic speech, so Google designed the Translatotron 2 to prevent voice spoofing.

The technology used to produce deepfakes has evolved markedly since Google’s Translatotron technology debuted. But the technology designed to protect against these synthesized voices has evolved as well.

There are, for example, technologies designed to detect whether a voice is live, recorded, or cloned, but not all companies use those technologies, Long cautions. And it’s very unlikely that an individual would have such technologies on hand.

To help protect against others stealing and using the voices, Resemble AI’s technology includes audio watermarks to authenticate the company’s files.

“If someone scrubs your data off the internet that is watermarked, the output will also be watermarked. So you can easily detect that the voice isn’t [a particular person], but it was trained off data from this real person,” Ahmed explains.

Resemble AI also has defect detection technology that Ahmed says can identify whether a voice is real or has been generated by spoofing technologies with 90 percent accuracy.

Other technologies designed to determine whether voices are synthetic are also emerging.

Ning Zhang, an assistant professor of computer science and engineering at the McKelvey School of Engineering at Washington University in St. Louis, has developed a tool called AntiFake, a novel defense mechanism designed to thwart unauthorized speech synthesis before it happens.

Unlike traditional deepfake detection methods, which are used to evaluate and uncover synthetic audio as a post-attack mitigation tool, AntiFake takes a proactive stance. It employs adversarial techniques to prevent the synthesis of deceptive speech by making it more difficult for AI tools to read necessary characteristics from voice recordings. The code is freely available to users.

“AntiFake makes sure that when we put voice data out there, it’s hard for criminals to use that information to synthesize our voices and impersonate us,” Zhang said when introducing the technology in November. “The tool uses a technique of adversarial AI that was originally part of the cybercriminals’ toolbox, but now we’re using it to defend against them. We mess up the recorded audio signal just a little bit, distort or perturb it just enough that it still sounds right to human listeners, but it’s completely different to AI.”

And then there’s ID R&D, a provider of liveness detection and voice biometrics, which directly addresses voice-based threats by combining passive facial liveness and voice anti-spoofing technologies, accurately identifying even advanced AI-based deceptions.

“True to our name, research is a strategic priority at ID R&D, and our products are built upon investigation and discovery of new ways to reduce fraud without burdening users,” said Alexey Khitrov, CEO and cofounder of ID R&D, in a statement.

The technologies to produce synthetic voices and to protect people from the illegal use of their voices will continue to evolve.

Legal protections haven’t caught up with the advances in the technology, Long says. But legal protections, even if they were up to date, only go so far. The voice compromises can come from regions outside of some legal jurisdictions. Plus the cloning technology can be used in such a way that it’s tough or impossible to detect who is doing it. So advanced technology will be the prime deterrent.

“Generative technology will continue to get better and better in the next year,” Ahmed adds. “You are seeing models that can draw images that are hyper-realistic and create voices that are hyper-realistic. The big shift that you will see in the next couple of years is that these models will become hyper-personalized. Imagine hearing an AI agent that understood your history and that can effectively answer questions that you have. There are tons of startups that are working on this. It’s a very ambitious project. This is something that could be used in some very interesting ways.”

But unfortunately this technology could also be used in increasingly nefarious ways, so the technologies to protect against these threats will need to continue to evolve as well, experts agree.

“Rapid advancements in artificial intelligence bring new challenges to securing digital onboarding and authentication, particularly from deepfakes, and we expect this trend to persist,” ID R&D’s Khitrov said in the statement. “Fortunately, we are also working hard to research and develop new ways to leverage AI to its full potential to counter AI-powered fraud.” 

Phillip Britt is a freelance writer based in the Chicago area. He can be reached at spenterprises1@comcast.net.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

2024 State of AI in the Speech Technology Industry: Voice Biometrics Both Profits From and Is Plagued by AI

Deepfakes threats advance, and technology is challenged to keep up.

2023 Speech Industry Award Winner: ID R&D Pioneers Liveness Detection

ID R&D, a New York-based provider of liveness detection and voice biometrics, has quickly become a leader in addressing AI-powered fraud by combining passive facial liveness and voice anti-spoofing technologies.

Voice Cloning: A Breakthrough with Boundless Potential

Get ready for a set of technologies that could revolutionize human communication.