The Future of Dubbing: Automatic Media Voice Translation

Dubbing, also known as dialogue replacement, means the foreign-language adaptation of a movie or show using voice rather than subtitles. There’s more to the adaptation than that; once the dialogue is replaced, all other sounds—everything from footsteps to explosions, known collectively as Foley effects—must be added artificially, too.

Audiences generally prefer listening to a movie rather than reading text subtitles if the movie is in the viewer’s language because all of the elements sync up: The lips match the audio, no background sounds are lost, the mix of music and effects is perfect, and the viewer understands every word. Software companies have tried to match this perfect synchronization in another language yet have never achieved it: Sometimes the lips do not match, or the new translated audio is longer than the original audio, or when a voice is replaced the background sounds are lost. This is all about to change. Speech technology and artificial intelligence are ready to step in to revolutionize the world of automatic media translation.

The media industry is in enormous flux, and this year is the perfect time to tackle new approaches to generating new audiences. After the pandemic ceased, media audiences on every front dropped drastically. And during the pandemic itself, several booming production companies and related businesses went bankrupt due to forced inactivity.

Until now, the possibility of good media voice-to-voice translation using automation has not been feasible. Invariably, at least one part of the automation process has been either missing or of unacceptable quality for this purpose. Thus, some 3 billion potential viewers are often ignored by content distributors due to human-based translation cost. A source of frustration for producers and distributors is the fact that a “translation black market” has sprung up, caused by that lack of translation support for certain languages. Members of the black translation markets create error-ridden subtitles themselves, then pass around their own translations on Telegram and Facebook. This black market is a persistent threat to content creators and streamers alike, as there is no payment shared in this approach.

The role of automatic media translation will be to attract new audiences that do not watch legitimately streamed programs for lack of access to translation. So the role of automatic media translation will be new revenue for content creators, producers, and streamers. Netflix, for example, has made major moves to attract global audiences, both by creating new specifically multiracial content and by subtitling and dubbing translations into other languages. Voice dubbing using human actors, when done correctly, is an extremely expensive proposition, and that expense negates voice translations for smaller-audience languages such as Greek, Japanese, Korean, Armenian, Finnish, and another 100-plus languages, not to mention native local languages like Setswana or Cherokee. These lesser-used languages open the window for automatic media translation, now possible at low cost with high quality.

Automatic Media Translation

Automatic media translation (especially voice to voice) will explode within the next two years. The APIs required are being developed around the world and will be available from multiple sources. The time to become involved in this is now.

Media translation is an investment that will not fail. At absolute worst, investors will get their money back for any development funds invested, whatever the dollar amount. They will be first to use all the newest AI speech technologies to create shockingly good automatic voice-to-voice translations of streaming video/movie content, as well as of corporate videos, webinars, educational lectures, and conference speeches.

Global audiences are waiting to see and hear all broadcast programming in their own language. Now, finally, all of the pieces (APIs) are available to make automatic media voice translation a viable option that is fulfilling both visually and emotionally in some 200 languages.

Why Now?

AI has changed everything. Your author was a director of professional media translation from the ’70s to the ’90s in Europe and the United States. At that time all media translation involved 100 percent human intervention (human translators, script adaptors, voice actors, vocal directors, studios) and a basketful of money for each language. This year, the final vital pieces of the puzzle are falling into place that will enable good, cheap media content as voice-to-voice translation. The formerly missing pieces now available include these:

Automatic Captioning

Automatic captioning (via automatic speech recognition) has almost reached its zenith; the results from some companies are astounding. If an original voice audio narration is clear and separate from music and effects, then 100 percent automatic captioning is now possible with certain exceptions. Importantly, captions upgraded through human intervention (human review of auto-captions) is affordable and even standard today. A good caption file results in a good translation. There is much competition for automatic subtitling but little or none for automatically revoicing translation.

Emotive Text-to-Speech (TTS)

The improvements in TTS are phenomenal. Some companies are now concentrating heavily on developing TTS that will transfer the emotional characteristics of an original language recording to a translation recording. But even before that approach is ready, the new AI-powered TTS is so quick to create that a TTS script can be recorded by a voice actor in several separate performances—i.e., happy, sad, laughing, angry—then the appropriate TTS version selected to match the emotions of the original script’s text.

Speed Adjustment

Translations are generally 15 percent to 20 percent longer than the original. So the difference in audio recording length between an original and a TTS translation has always been a major impediment for media professionals. Until recently human intervention was required to adapt a script so that the translated recording could align sentence by sentence with the length of the original, as well as to onscreen lip movements. Dubbing voice artists recorded the dialogue in synchronization by watching the movie while listening to the original audio in one ear while hearing their own voice in the other ear; then studios had to remix the entire audio track to include music and effects. Over the decades there have been numerous attempts to adjust the audio track’s speed to fit the longer translation timings, resulting in boringly slow audio or Alvin and the Chipmunks high-speed chirping. Finally, now companies like VideoLocalize have developed excellent code that combines a balanced adjustment of both audio speed and image speed, thereby changing the original video image to fit the timing of the new translated TTS in a pleasant and natural manner, undetectable by general audiences.

Automatic Lip Sync

With this feature, the image of the lips of the onscreen character will adjust themselves to the new language, avoiding the Japanese-horror-film syndrome of past decades in which the movements of the onscreen actor’s lips did not match the sound of the translation voice. This software needs improvement to follow the lips of actors in profile, but when they’re facing straight on, the results are remarkable.

Voice Cloning

With the addition of voice cloning that revamps TTS translated audio to sound like another voice, the sound of the original actor can be siphoned from the source voice track and converted into other languages. In upcoming years, of course, there will be many discussions on the ethics and legalities related to this automatic conversion of a translated voice to sound like the original movie actor, and specialized voice banks will be created specifically for use in automatic media translations.

Sue Reager specializes in cross-language speech communication, applications, and context engines. Her innovations are licensed by Cisco Systems, Intel, and telecoms worldwide. For the prior 20-plus years, Reager was responsible for translating television shows, movies, cartoons, documentaries, and corporate videos in Europe, Africa, South America, and the United States.