2024 State of AI in the Speech Technology Industry: AI Is Revolutionizing Translation, Dubbing, and Subtitling

Article Featured Image

Bridging the world’s diverse languages and aiding hearing-impaired and speech-impaired individuals through revolutionary technology is no small feat. But artificial intelligence (AI) has become the great equalizer, simplifying communications and making spoken content more comprehensible, accessible, and user-friendly.

Speech translation, which has witnessed notable advancements thanks to AI, now offers real-time capabilities and greater precision across multiple languages, demonstrating an enhanced understanding of contextual nuances and cultural variations.

In dubbing for film and other media, AI can automatically synchronize audio with lip movements, while sophisticated algorithms match the tones and styles of actors’ voices. And AI-driven captioning has significantly elevated the quality and accuracy of subtitles, improving recognition of colloquial speech, slang, and contextual cues.

The landscape for such technologies has undergone a revolutionary transformation with AI, resulting in heightened accuracy, efficiency, and accessibility. Deep learning algorithms have outpaced conventional methods, attaining levels of accuracy that approach human proficiency.

Moreover, AI-powered translation tools have been integrated into a myriad of applications, marking a significant stride toward enhanced linguistic versatility in the digital realm.

“Advances in real-time translation integrated into smartphones make translation more accessible than ever before. Projects such as No Language Left Behind show that efforts are being made to include more languages and dialects, especially low-resource languages, which enhances inclusivity,” says Akash Raj Komarlu, cofounder and chief technology officer of Whispp.

Xuan Zhang, machine learning engineer with Monsters Aliens Robots Zombies (MARZ), an AI-enabled visual effects studio, is wowed by AI-based machine translation systems today.

“This enables wider distribution and accessibility, allowing creators to reach global audiences without requiring dedicated translation teams,” he says. “Also, AI-powered automatic speech recognition tools are transcribing spoken words with impressive accuracy and speed, significantly reducing the time and effort required for content production.”

Concurrently, AI is making subtitling more accurate, efficient, and accessible, while providing contextual information.

“These tools can also adapt subtitles to different screen sizes and adjust the font size and color for optimal readability,” Jaebok Kim, head of research and development at ReadSpeaker, explains. “The development of AI-powered subtitling solutions has made it easier for individuals and businesses to access multilingual content, breaking down language barriers and promoting inclusivity.”

Thanks to improved algorithms, AI now can produce more accurate video-synchronized subtitles and captions almost instantaneously—a feature particularly valuable for live broadcasts and online streaming.

“The models have also gotten better in understanding context and idiomatic expressions. This allows customization in subtitles, such as adjusting speed and style, thereby improving readability. Overall, it has greatly improved accessibility for the deaf and the hearing-impaired community,” Komarlu adds.

Meanwhile, AI is thoroughly modernizing dubbing by automating the process of translating and adapting audio for different languages.

“AI-powered dubbing tools can analyze the original audio and generate natural-sounding translations. This has led to the development of solutions that are faster, more efficient, and more affordable than traditional methods,” Kim notes.

Maurice Kroon, founder and CEO of Vox AI, is equally impressed with how effortlessly AI can automate lip-syncing in dubbing.

“With voice matching, AI algorithms match the tone and style of the original actor’s voice, making the dubbing more authentic and engaging,” he says.

2023 Highlights

Tracing this path to progress over the past year reveals several key events, breakthroughs, and important players.

“Translation was hugely advanced by AI in 2023 thanks to real-time multilingual translation in video conferences improving as well as enhanced contextual translation. AI models began to show a much better understanding of slang and context, providing more accurate translations,” says Damian Edwards, commercial manager at Omnie AI.

Crucial AI translation moments in 2023 included the following:

  • The debut of Meta’s Universal Speech Translator, a prototype for real-time translation and voice cloning across multiple languages. In dynamic live conversations, the AI seamlessly translated each speaker’s words into other languages while simultaneously replicating voice style and achieving lip-syncing.
  • The introduction of Google Translate’s Local Context feature, which considers the local context of phrases and expressions. The result is translations that adapt to regional dialects, slang, and cultural references.
  • Microsoft Translator’s rollout of its “Custom Speech Models” platform, allowing users to train AI models with their own data, encompassing domain-specific terminology and industry jargon.

Komarlu points to other translation high points from last year.

“Spotify’s voice translation feature now translates podcasts into additional languages, in the voice of the podcaster,” he says. “Also, the open-source movement is going strong with more models being open-sourced every day and fine-tuning for specific tasks and creation of new translation products.”

Dubbing tech made great leaps in 2023, too, as evidenced by synthetic voices being able to retain cultural and emotional nuances within multiple languages. Edwards credits Flawless AI, Papercup, Descript, Veed, and Eleven Labs in particular for the improvement of AI-powered dubbing and lip-synching technology last year.

In 2023, AI-enabled real-time subtitling became far more accurate, and fresh tools for automatic subtitle generation and editing were enhanced, making live broadcasts and events more accessible.

“Companies such as Amara were key in these advancements, as were AI-driven platforms integrated within YouTube and other video hosting services,” Edwards says.

In recent months, more services, including Facebook Live’s Automatic Video Captioning and Microsoft Azure Speech Services’ Live Captioning, have harnessed sophisticated speech recognition and AI models. This enables real-time captions with remarkable accuracy during live presentations, lectures, and conferences.

As further proof of AI’s reach and transformative capabilities, streaming platforms have expanded their closed captioning, subtitles, and audio descriptions tools since 2022. For example, Netflix’s “All the Light We Cannot See” miniseries represents some of the most detailed audio narration in the platform’s history. The AI analyzes on-screen elements like characters, actions, and settings, tailoring descriptions to individual preferences and providing a more immersive experience.

Embracing AI-Powered Translation, Dubbing, and Subtitling

The global AI translation market is expected to enjoy a compound annual growth rate (CAGR) of 16.8 percent through 2026, when it will reach $12.4 billion, according to Verified Market Reports. Business Research Insights expects the global captioning and subtitling solutions market to reach $476.9 million by 2028, up from $282 million in 2021, demonstrating a CAGR of 7.7 percent. Additionally, by 2028 the AI dubbing market is forecast to grow by 7.8 percent, up from $2.5 billion three years ago, with the entertainment industry and e-learning sectors benefiting significantly.

“In particular, translation and voice cloning are expected to grow massively due to their increased use within healthcare, media, and education and because of the increased demand for accessibility requirements and more assistive technologies,” Edwards says.

The embrace of tools and technology advances have been quicker than expected, many experts agree, although improvements are still needed.

“AI translation tools have become indispensable in global communication, and adoption is widespread. But context-specific translations remain a challenge, especially for less common languages,” Kroon says. “AI’s role in dubbing is growing, particularly in automating lip-syncing and voice matching; however, the difficulty lies in preserving the original emotion and cultural nuances in the dubbed version, something that AI is still catching up to. AI has significantly sped up the subtitling process, as well, but the adoption faces hurdles in terms of accuracy.”

And there are other kinks in need of smoothing.

“Although AI is now prevalent in online translation services and software applications as well as communication tools, the obstacles relate to accuracy and cultural context,” Edwards points out.

Kroon echoes those concerns.

“Accurately capturing context and cultural nuances remains a tall order for AI translators. There’s also a gap in translation quality for less widely spoken languages that need bridging,” he says.

Kate Kalcevich, head of accessibility innovation at Fable, notes that problems remain when it comes to conversations that don’t take place online.

“In a noisy restaurant, my smartphone can’t effectively filter out background noises and speech from other tables so that I can see a written transcript of conversations at my table,” she says. “From a business perspective, we have to better navigate the needs of people with disabilities to access AI for accommodation purposes and the needs of others for privacy.”

Dubbing drawbacks include matching the actor’s emotional tone.

“That’s why there’s still a preference today for human actors,” Edwards cautions.

Additionally, lip-syncing in multiple languages remains difficult.

“Ensuring dubbed content respects and aligns with the target culture is critical,” Kroon says.

Although AI is being increasingly used in subtitling on streaming platforms and social media, similar challenges with dialects, slang, and cultural sensitivity remain. “It’s difficult for AI to translate humor and cultural references,” Edwards points out.

Above all, AI-powered tools need to achieve higher accuracy and speed before they can fully replace humans, Kim believes. But he also points to the ethical acquisition of large speaker pools as a critical issue.

“Multilingual AI-dubbing technology requires a huge amount of speech data from a large population of speakers. Many players in the segment may feel tempted to acquire them in a faster or easier manner, which may compromise ethical ways.”

What Lies Ahead

Notwithstanding these deficiencies, plenty of pros expect continued rapid progress and implementation of AI-powered translation, dubbing, and subtitling technology.

“Adoption will increase as businesses and the consumption of video content become more global, which increases the need for accessibility,” Edwards predicts. “The focus will be on creating more context-aware translation models with better emotion and cultural subtleties, as well as AI models that can capture the tone and context along with the speed of words in real-time subtitling, which will be used even more in live events and social media platforms.”

Pawel Budzianowski, head of machine learning at PolyAI, is convinced these technologies will rapidly grow, particularly because most call center traffic will be powered by the generative models that require instant voice generation.

“Models will truly understand the context of the generation leading to adaptation across a wider range of markets,” he states. “I also envision most marketing campaigns and the film industry starting to use voice dubbing and translation as a default option for localization in the next 12 months.”

AI should become increasingly adept at understanding context and cultural nuances in the near future, Kroon maintains.

“Dubbing improvements could include more accurate voice matching and emotion replication,” he says. “In addition, we should see advancements in real-time subtitling capabilities and improved accuracy. The overall theme will be more capable and faster. And translation tools could integrate more seamlessly into real-time communication platforms.”

Expect AI-driven resources to further facilitate communication across borders and cultures, breaking down language barriers and promoting global understanding, Kim says.

“These tools will also be used to make foreign films and TV shows more accessible to audiences around the world, including people with hearing impairments, allowing them to experience the full content of videos,” Kim continues.

Kalcevich finds the latter prediction most encouraging. “I’m most excited for AI to bring live transcripts to in-person conversations and live entertainment,” she says. “Whether displayed on a smartphone, smart glasses, projector, or even digital contact lenses, AI-based live transcriptions can break down barriers for people with hearing disabilities.” 

Erik J. Martin is a Chicago area-based freelance writer and public relations expert whose articles have been featured in AARP The MagazineReader’s DigestThe Costco Connection, and other publications. He often writes on topics related to real estate, business, technology, healthcare, insurance, and entertainment. He also publishes several blogs, including martinspiration.com and cineversegroup.com.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues