2024 State of AI in the Speech Technology Industry: AI Is Enabling Audiovisual Enhancements
With astounding speed, artificial intelligence (AI) has permeated many businesses and industries, but its encroachment into video and voice creation has been particularly remarkable, providing ample opportunities for players in this field to succeed in more efficient, economical, and impressive ways.
Video editors powered by AI today can efficiently analyze and process footage, identify key moments, and automatically generate highlights. AI also plays a crucial role in intricate editing tasks such as color grading and transitions. Additionally, AI-driven tools empower creators to produce distinctive and captivating visual effects.
AI has reshaped the creation of voices as well, ushering in an era of realistic and expressive synthetic speech that is now used widely for customer service chatbots, voice-overs, virtual assistants, and audiobooks. Deep learning algorithms, trained on vast datasets of human speech, can now generate voices that are virtually indistinguishable from their human counterparts. This has expanded possibilities across domains like voice-over work, animation, entertainment, and personalized communication.
Make no mistake: AI has thoroughly transformed the way we create and access voice and video content, offering numerous benefits in efficiency, accessibility, and global reach and paving the way for a more inclusive and multilingual media landscape.
“AI synthesizers can generate natural-sounding voices that accurately represent the original language or translate it into another. This opens up possibilities for things like multilingual voice-overs, dubbing, and audio descriptions, enhancing accessibility and user engagement,” says Xuan Zhang, machine learning engineer with Monsters Aliens Robots Zombies (MARZ), an AI-enabled visual effects studio. “Ponder, too, how AI algorithms are being developed to generate video content directly from text descriptions or scripts. This technology is still in its early stages, but it holds potential for automating simple animations, explainer videos, and personalized video greetings.”
Maurice Kroon, CEO and founder of Vox AI, is especially wowed by the extent to which AI-created voices are so much more realistic and human-sounding, “not just in terms of speech clarity but also in inflection, emotion, and tone variability.”
In 2024, AI-driven tools can enable customization of voice attributes, including pitch, tone, speed, and accent, helping create more diverse and inclusive voice content used by, for instance, virtual assistants and smart speakers. State-of-the-art systems such as Microsoft’s VALL-E and Coqui’s XTTS and YourTTS can produce hyper-realistic speech that’s virtually indistinguishable from human speech.
“Today’s market offers increasingly advanced voices, enabling us to design virtual assistants that converse in natural language with a very human voice for maximum user experience. A year ago, high-quality voices were costly; today, they are not only affordable but also allow easy cloning of your own human voice,” explains Katya Lainé, cofounder and CEO of TALKR.ai.
By adjusting pitch, accent, style, and characteristics such as timbre, new humanlike voices can now be designed according to one’s needs or preferences, says Akash Raj Komarlu, cofounder and chief technology officer of Whispp.
“And text-to-video generation is getting more popular with applications such as realistic face-swapping, lip-syncing, and content creation,” he continues.
On the video creation front, there are more powerful AI tools available as well.
Damian Edwards, commercial manager for Omnie AI, agrees. “AI algorithms can now edit videos, which massively reduces the manual effort required, by applying effects or trimming down videos to select the most relevant content. Artificial intelligence also allows the creation of generative media, meaning that realistic video content can now be altered or amalgamated. What’s more, improving video quality, restoring dated videos, and colorizing black-and-white footage is now possible through AI,” Edwards says.
As we move into the new year, the previous one was a watershed for AI-superpowered voice and video creation, experts agree. Last year began with a bang when VALL-E introduced language modeling for text-to-speech (TTS) synthesis; now, in as little as three seconds, VALL-E can generate personalized speech from text while preserving the speaker’s emotion and acoustic environment.
“AI-powered TTS technology has reached a new level of realism, with voices that are indistinguishable from human counterparts,” Jaebok Kim, head of research and development at ReadSpeaker, says. “Generative TTS based on diffusion and generative adversarial network models can generate voices with a wide range of accents, emotions, and personalities.”
“We saw huge improvements in emotional intelligence in 2023, with AI systems able to better capture and replicate human emotions in personalized voice creation for media and voice assistants. Companies like Descript and tools such as Murf AI were key in these advancements,” Edwards adds.
Thanks to prime players like Adobe, DeepBrain AI, Frame.io, and Synthesia, AI was increasingly used in film production, broadcast news, special effects, and scriptwriting last year. Weta Digital and Industrial Light & Magic, meanwhile, unveiled AI-driven visual effects capable of creating hyper-realistic environments and characters, significantly elevating the quality of visual storytelling.
In the past year, actor Val Kilmer, who lost his natural voice due to throat cancer, was able to reprise his iconic role in the film Top Gun: Maverick using AI-generated speech developed by Sonantic. ReadSpeaker applied its AI-driven custom branded voice technology to clone the voice of actor Giancarlo Esposito for Sonos Voice Control. “Will Smith Eating Spaghetti” went viral; it was created by OpenAI’s Stable Video Diffusion model and demonstrated the humor and absurdity achievable with text-to-video generation. Waymark’s short film The Frost, created using OpenAI’s DALL-E 2 image generation technology, showcased the potential of AI to create visually compelling narratives without human intervention. And Amazon’s Audible and Google’s Play Books put in motion the ability to create audiobooks narrated by AI voices.
Expect the global AI video generator market to balloon in the near future. Grand View Research valued it at $472.9 million in 2022 and projected a compound annual growth rate (CAGR) of 19.7 percent through 2030.
“These numbers are driven by the increasing adoption of video within marketing strategies that now use AI-powered video tools,” Edwards says. “In 2022, the average online video viewing time was approximately 100 minutes, showing just why there is such growth in the usage of AI within video production and the significance of AI in enhancing the quality, efficiency, and personalization of video content within marketing and media production.”
Meanwhile, Market.us expects the market for AI voice generators to reach $4. 9 billion by 2032, expanding at a CAGR of 15.4 percent.
“The adoption of voice AI, especially for customer service and virtual assistants, is booming,” Kroon notes.
Roadblocks to Clear
However, the path to progress in AI-driven voice and video creation requires navigating past several bumps in the road. Near- and long-term challenges identified by Zhang include the following:
- Technology limitations around accuracy, bias, and lack of emotional expression.
- Cost and accessibility, as advanced AI tools can be expensive, especially for smaller players and individual creators.
- Ethical considerations regarding data privacy, potential misuse of deepfakes, and the impact on voice actor employment.
- The ongoing need for human touch and creative control.
“For video creation and animation specifically, some challenges that need to be addressed are quality and authenticity,” cautions Kevin He, founder and CEO of DeepMotion AI. “Currently, all AI-generated animation content requires human touch for a fully polished result.”
Regarding data privacy and security, many companies are leveraging data from the public domain, whether it’s copyrighted or not—a sticky wicket that can trigger lawsuits. Ethical concerns related to voice cloning, consent, and deepfakes persist as well, especially relating to famous or deceased people.
In addition, getting AI systems to accurately recognize and reproduce a wide range of accents and dialects remains a hurdle.
“AI voice and video generation should also ensure representation across a diverse range of accents, languages, and dialects to avoid bias,” Komarlu advises.
He also urges businesses to make sure their technology has security by design to address potential misuse of voice cloning,
“This is primarily a data play. We need to get access to the people’s voices to train our systems,” Kroon recommends. “Given that voice data is highly personal, ensuring user privacy and data security is critical. People are still worried that their information is being used as training data and exposed to other people later.”
There are computational resource limitations to fret about, too, “as running these advanced AI algorithms can be a challenge for smaller organizations,” he maintains. “The more cost-efficient we can make this, the less chance of a monopoly there will be and the more creativity we will see from smaller developers in the space.”
Kim adds that AI-fueled video creation tools “also need to improve accuracy and efficiency to be considered a mainstream alternative to manual editing.”
Forecasting the Future
As the market for AI solutions expands, there will be a heightened need for personalized user experiences and streamlined content creation and consumption.
“That’s why, especially as AI continues to disrupt in areas like content creation, we can expect a larger investment in AI research and startups in the coming years,” Komarlu says. “I also anticipate that new product categories will be created and personalized advertising and content will be produced, and enhanced translation and voice/video creation will make educational content more accessible to different audiences.”
Edwards also envisions big things in the not-too-distant future.
“The market for voice and video creation are both going to expand dramatically due to the rising popularity of AI-generated voices in areas such as customer service and entertainment and the increased popularity of video content in fields like education and marketing,” he says. “Anticipate large technological innovations in video and voice creation as companies work to create more realistic and emotive AI videos and voices.”
The biggest growth is likely to come from industries not currently using voice AI, such as small businesses, restaurants, and call centers, Kroon says.
“In voice creation, I expect to see even more lifelike and expressive AI voices, potentially even personalized voice clones. Voice AI could also extend into areas like mental health for sentiment analysis,” he adds. “For video creation, AI might bring advancements in automated editing and perhaps deeper exploration into ethical uses of deepfake technologies. Video AI might find new roles in personalized advertising and content, too.”
Kim agrees, predicting more AI-customized videos on platforms like YouTube as well as a boom in AI-generated documentaries, while “AI-powered voices will be used in a wider range of applications, including e-learning.”
Count, too, on AI-created video continuing to make inroads in augmented reality and virtual reality, “offering a wider range of immersive experiences, the likes of which we are just now scratching the surface,” He says. “Also, there will continue to be a surge in startups exploring novel solutions to bring to market.”
Erik J. Martin is a Chicago area-based freelance writer and public relations expert whose articles have been featured in AARP The Magazine, Reader’s Digest, The Costco Connection, and other publications. He often writes on topics related to real estate, business, technology, healthcare, insurance, and entertainment. He also publishes several blogs, including martinspiration.com and cineversegroup.com.