Voice Cloning: A Breakthrough with Boundless Potential

Article Featured Image

Get ready for a set of technologies that could revolutionize human communication.

The most dramatic advancement yet in synthetic speech technology is fast approaching, though many outside (and even inside) the industry might not know it: the ability to transfer the true sound of your voice to automation. This means not only transferring your own voice to a personal assistant, an avatar, or an animated character, but more important the ability to instantly translate your voice into other languages you do not speak.

Until now, the usual voices generated by text-to-speech (TTS) software have been based on recordings by professional voice talents in each language. TTS brilliantly tackled the deep complexities of language pronunciation, accent, dialect, and smooth natural audio concatenation, but its speech has always been composed of someone else’s voice, not the user’s.

The newest development can enable your speaking voice to morph into dozens of languages. Surprisingly, there does not yet seem to be a fixed name for the process. Even the companies developing this cluster of related technologies give it multiple names: “voice cloning,” “voice conversion,” “voice localization,” “localized voice dubbing,” and more. This special cloning captures the individualized, unique sound of your voice; when combined with speech translation, voice styling, dubbing, and new realistic avatars, voice cloning will touch every medium of communication on earth.

As the computing power required to create new voice technologies lowers to a more manageable level—together with a significant amount of basic speech code becoming available as open source and a swelling number of experienced developers—it is becoming possible for entrepreneurs and innovative small companies to reproduce, manipulate, and improve computer-generated voices. There are dozens of such vendors, and each developer is seemingly on its own slightly different trajectory.

A Laymen’s View of this Breakthrough

Voice Conversion

Resemble AI, led by CEO Zohaib Ahmed and based in the Unites States, offers not only generic TTS and custom TTS but also a novel and interesting voice conversion feature. A person, let’s say an English speaker, makes a 30- to 40-minute prerecording in her native tongue; that prerecording can then be used to re-create her voice in several languages. When an API sends text in French, German, Spanish, or some other supported language to Resemble AI’s English TTS voice channel, the results return as an amazingly natural conversion from the speaker’s English to that new language, carrying the same accent that would come naturally if she were speaking outside her native tongue. When this writer tried out the technology a year ago, the results in French featured a heavy accent, but in a more recent conversion, the results were more lightly accented and astonishingly natural-sounding, with a surprising retention of voice personality. When used as the last lap of speech-to-speech translation, the multilingualism achieved by Resemble AI could impact entire countries.

Voice Localization

The goal of voice localization is a bit different: to “translate” your voice into another language seamlessly with no accent whatsoever, sounding exactly as if you were a native speaker. The process of voice localization is dual-track: The magic of software dissects the “sound elements” of your original voice and holds them; then these elements are added to a separate track.

A voice localization cloning service works like this: Original audio is generated by a speaker, such as a video voiceover or a sentence spoken by a customer service agent. That original recording is then translated, creating a secondary track using native-speaker audio. This second track can be spoken by a voice actor, a professional interpreter, or neural/AI TTS; the performance and quality of this secondary audio is pivotal to the resulting quality of the voice cloning performance. Then the extracted “sound elements” from the original recording are transferred to the secondary translation voice, thereby retaining the accent and vocal performance of the secondary track, while inserting the identity of the original voice.

Voxygen, based in France, has been known for its smooth, natural TTS in multiple languages and dialects, including groundbreaking solutions for pronouncing “liaisons,” those important slurrings of consonant concatenations that previously only a human French speaker could create. Voxygen is now embarking on a new ambitious project: starting with a single TTS voice, that of, say, a woman named Naomi, and generating her voice into multiple languages. The end goal, then, would be an API that can send translated text and request the voice of Naomi in a dozen languages, basically accent-free. This approach would enable businesses, for example, to use a single custom voice for global commerce in multiple languages. The TTS is delivered via the cloud as well as on-premises containers with high numbers of connections, shortening latency for far-flung locations and easing the path for real-time voice translation. Voxygen develops voice localization code for quicker and easier use in near-real-time consumption, including with custom voices, and is preparing for expansion.

Resemble AI is also steaming full speed ahead in offering dual-track voice localization. Importantly, the company takes a step forward in the ease of utility and delivery, racing toward near-real-time cloning services via API as well as on-premises. Using this approach for real-time voice translation, a conference speaker’s voice will be translatable in the natural flow of the event, or a support agent’s voice either massaged to another accent or translated, with the secondary track consisting of TTS followed by voice clone localization.

Broadcast Voice Localization

The use of voice cloning and voice localization in broadcast media translation is just beginning, initially for limited lines or corrections to be added or altered in a film. The cloning technology makes fixing or replacing lightning fast, using the same casted voice.

The amount of content within a program that is translated via cloning is growing. Soon series and movies will be cloned either partially or in their entirety. There have been recent splashes of press coverage for the voice cloning work done by Ukraine’s Respeecher, which re-created the voice of actor James Earl Jones, working with clips of his past performances, to produce a clone that gave new life to Darth Vader for the TV series Obi-Wan Kenobi.

Alex Sediuk, cofounder and CEO of Respeecher, explains the process: “Professional voice dubbing requires dozens of specialist actors, and previous automation options involved regular TTS recorded by a single native speaking voice without performance or inflection. It’s really hard to use text-to-speech for [localizing] high-quality content. Text-to-speech is limited by language models and relies on words only, while natural speech—especially film performances—consist of many other sounds that are not words. We sing, we whisper, we cry. Our technology leans toward performance, because that’s something we take from humans, and we believe human [performance] would still be the best option. In this way, the translation voice track to be cloned has the greatest impact combining human and automation, including astonishing voice localization [translation] of songs and emotional performances digitally replacing the vocal apparatus.”

Platform for Content Localization

When working on a budget, basic TTS can be effective in simple dubbing for documentaries or explanatory videos, especially if enough pauses in narration enable the translation language to catch up. But the localization of film and television productions involves significantly more than just words and audio.

Oz Krakowski, chief revenue officer of Deepdub, clarifies: “Dubbing brings a high amount of challenges: script translation, script adaptation to lip sync, localization voice casting, quality of translated audio, physical syncing of the audio to lip movements, and finally the full audio mix. In transferring the voice attributes we must regenerate the same effects on the new voices. And all boils down to cost and other challenges. We developed not only the technology to generate the voices, but also other technologies that help us in the process, including a full-scale web-based platform that allows us to integrate all of the human effort as well as technologies in the localization process, even casting control and multitiered approval process.

“Script adaptors must be highly trained to rewrite when they need to,” Krakowski continues. “From that we create a voice guide. The voice guide will bring the emotions, especially when emotions are extreme. [For lip sync] we have algorithms that find matches, such as aligning the ‘L’ sound in ‘Hello’ with the Spanish translation ‘Hola,’ with the ‘L’ being the anchor point to auto-adjust the audio to the tongue and lips.”

Cloning Used for Original Works

For brand-new content, especially animation and games, voice cloning can be used for the original voice track, not only for translation. Says Krakowski of Deepdub, “Voice guides may actually move in the direction of being spoken by the directors or producers themselves. They need to explain everything to an actor; usually directors are almost like actors themselves, but they just don’t have the nice voice that they want for the content. So technology pulls in the sounds of a professional actor, usually from a voice bank, over the track laid by the producer or director or whoever is doing the voicing.”

Finding Lost Voices

Voice cloning could be able to revive voices we thought were lost. Museums will be able to create exhibits where the true voices of our ancestors, our indigenous leaders, and our historic figures read aloud their own writings and correspondence. The holograms featured in the Shoah Foundation’s Dimensions in Testimony now have the option to speak of the Holocaust in dozens of languages, using the real voices of survivors for the translation.

For stroke patients, the VoiceAdapt project in Germany uses innovative speech-sensitive technology to develop a system that detects and adapts to spoken language deficiencies—the typical signs and symptoms of aphasia. Respeecher expands on the medical applications of voice cloning by helping patients who went through larynx removal recover the quality of their own voice and live a more normal life. And Resemble AI also works with a few independent users who submit audio data to build voice models and TTS for people suffering from ALS.

Cloning the sounds of voices is already here, and new services spring up every week. Great improvements are rushing to market, with upcoming APIs for developers. Someday not long from now, voice cloning will extract, translate, then re-create the entire voice performance as well as the sound of that voice. Automatic lip sync will stream with the audio, and videos will auto-adjust to the timing of the translated voice track.

When that day comes, all of this will occur in near real time, as if by magic. Voice cloning represents a potential watershed for the speech technology industry.

Sue Reager specializes in across-language speech communication, applications and context engines. Her innovations are licensed by Cisco Systems, Intel, media and telecoms worldwide.

Who Will Be Affected by Voice Cloning?

A compendium of the technology’s use cases and beneficiaries

  • International employees will hear their CEO’s natural voice speaking to them in their native tongue.
  • Phone support agents can elect to speak with a different accent.
  • Interpreters at conferences can be made to sound like the conference speakers they are translating.
  • Streaming media will be automatically translated and its voices cloned.
  • Journalists will hear live press conferences in their own tongue, in the sound of the original speaker’s voice.
  • Online meetings will be auto-translated with the speakers’ own voices.
  • Sportscasters will be streamed worldwide with their branded sound in various languagues.
  • Cartoons, animation, and video gameswill have their animated voices inserted as clones from voice banks.
  • Teachers in countries with many national languages will be able to be heard in all languages simultaneously.
  • The United Nations will hear the simultaneous interpretation of each speech with the original voice identity.
  • Entire movies will be translated into more than 100 languages, with each character’s voice faithful to the original actor.
  • University professors will be able to lecture worldwide in native tongues.
  • Museum visitors will hear historic written works turned into audio by the original author’s voice.
  • Languages threatened by extinction will be replicated and preserved.
  • Persons with vocal disabilities gain new hope of retaining the sound of their own voices.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

Safety and Ethical Concerns Loom Large in Voice Cloning

AI makes synthetic speech sound more realistic than ever—and therein lies the danger.