The 2022 State of Speech Engines

Article Featured Image

Last year was a whirlwind year for speech engine technology. The industry saw plenty of innovation and market growth. Yet impediments remain, including the continuing COVID-19 pandemic and tech limitations that can lead to user frustration.

“The prevailing theme of speech technology engines to date is innovation,” says Sejal Amin, chief technology officer of Khoros. “Over the past decade, speech technology has seen the advent of many new social media platforms, and advanced technological capabilities like artificial intelligence and natural language processing have increased its versatility and scale.”

Rutuja Ubale, a research engineer at ETS AI Research Labs, says Google, Amazon, IBM, and Microsoft continue to dominate the space and are continuing to improve their APIs for speech-to-text, automated speech recognition (ASR), text-to-speech (TTS), dialogue management, and natural language understanding (NLU) for chatbots, translation, and more.

“These APIs are being increasingly used by several companies and especially startups in their early development stages to design speech-based apps to meet different user needs when they don’t have the resources to build in-house technology,” Ubale notes.

According to Daniel Ziv, vice president of speech and text analytics at Verint, speech engine evolution is accelerating as consumers become accustomed to speech as a natural interface thanks to voice interfaces like Alexa and Siri.

“Investments in speech engine technology and data collection to help tune and optimize these engines is taking place at some of the biggest companies in the world as well as within the startup community. It’s a hot market, with innovation growing rapidly and new use cases being forged around voice, data, sentiment, and intent,” Ziv says.

Voice assistants in mobile apps seem to be the hottest trend right now—a force that has permeated daily life for virtually everyone.

“Far-field ASR has expanded voice assistants’ functionality for smart TVs and smart displays,” Ubale says. “I’m also particularly excited about the expansion of speech capabilities to the fields of education and health care.”

Among other developments in voice in 2021 were monetization, voice shopping, and new voice-enabled devices.

“We’ve seen voice assistants expand across industries, with more companies realizing the benefits of voice AI technology and seeking omnichannel experiences for their customers. Brands are also starting to consider important aspects for their voice assistants, such as ethics, gender, accents, and cultural biases,” explains Michael Zagorsek, chief operating officer of SoundHound.

Hamid Nawab, co-founder and chief scientist of Yobe, is particularly impressed by the progress speech engines have made in language understanding, performing above 90 percent accuracy in no-noise environments.

“They are incredibly effective and robust thanks in large part to the work done in natural language processing,” Nawab says.

Year in Review

Last year saw a number prominent developments:

  • Microsoft bought Nuance Communications.
  • Meta (Facebook) introduced the Generative Spoken Language Model (GSLM), which can learn speech representations from audio without labels or text, allowing speech technology to be more inclusive of languages, improve capabilities with rarer languages, and capture nuances in speech that don’t translate to text.
  • Meta AI also released a large open-source dataset, Multilingual LibriSpeech, consisting of 50,000 hours of speech data for eight languages that can be used to train independent or combined ASR models.
  • Apple launched on-device speech recognition for Siri for simple navigational tasks.
  • Google launched project LaMDA (Language Model for Dialogue Applications).
  • New data was added to the Common Voice dataset that anyone can use to train speech applications.
  • A multilingual version of wav2vec2 was released, called XLSR (cross-lingual speech representations), which is trained in 128 languages.
  • Vosk API released lightweight ASR models for 20 languages that are compatible with their API for real-time speech recognition.

Hardly a surprise, the industry is poised for rapid growth. IDC predicts that the worldwide conversational AI software market will expand from $2.2 billion in 2020 to $7.9 billion in 2025 at a compound annual growth rate of 28.8 percent.

“Growth in this market continues to be driven by increases in conversational AI, speech-to-text, text-to-speech, machine translation, and stand-alone natural language processing (NLP) software that are being used to create conversational AI solutions and provide conversational capabilities to other types of enterprise software,” Ziv says.

Consider, too, that the TTS market is projected to increase from $1.94 billion in 2020 to $5.61 billion by 2028.

In the field of conversational AI, many are excited about the development of end-to-end spoken language understanding (SLU) systems.

“While previous efforts have aimed at eliminating the need for ASR and going directly from raw speech to intent and slot identification, newer efforts are aimed at incorporating dialogue history to improve understanding in human-machine conversations,” Ubale says. “While several companies are already working on deploying ASR on-device, recent research from Amazon on fusing ASR and natural language understanding for on-device SLU is also exciting.”

Amin believes the most significant progress is the increase of adoption of multilingual modes as companies advance conversational customer service.

“Turning to multilingual voice assistants leads to greater accessibility and brand reach, allowing for exposure to audiences across new and potentially previously unaccessed markets. Customers are more apt to stay true to a brand that knows their demographics,” Amin says.

Another noteworthy development last year was the extension of core sequence modeling to other domains.

“Researchers showed that the technology underlying current language models can be applied to solve a broad range of reinforcement learning problems,” Phil Steitz, chief technology officer of Nextiva, explains. “We also saw big steps forward in accessibility and ease of implementation across multiple AI/machine learning domains. Open-source frameworks, models, and components have significantly reduced the barrier to entry for teams implementing contemporary AI solutions.”

Effectively filtering background noise and understanding users in noisy environments remains a major difficulty in this space.

“Noise disrupts the speech patterns that are being picked up by the microphone. The ability to remove noise can open the door for interacting with the voice assistant in a variety of environments, such as cars, on the street, or in areas with a lot of background noise,” Zagorsek says.

Nawab calls this the “cocktail party problem.”

“Despite robust natural language understanding capabilities, machine learning has not yet been able to solve this problem, particularly for noisy real-world environments. This is a bottleneck for speech-to-text, conversational AI platforms, and voice assistants,” he says.

Jörg Scherer, director of user experience at Elektrobit, says integrating AI technology has improved recognition performance to acceptable levels.

“However, clear understanding of intent remains a challenge. Therefore, more information associated with context needs to be taken into account, such as preferences of the user, location, and dialogue history, to generate speech dialogue answers by reasoning,” Scherer suggests.

Managing diversity in speech and controlling bias is another sticky widget that will need attention.

“Current ASR models are now extremely good at clear, slow speech, but they need to get better at picking up different dialects and specialized vocabularies,” Steitz says.

Another challenge is finding a balance between the use of branded TTS with customers and authentic human experiences while ensuring that each message is differentiated for each customer segment.

And then there are growing concerns around voice data and privacy. “Organizations need to operationalize voice data effectively but safeguard against misuse without infringing on user and customer privacy,” Ziv believes.

A Look Ahead

Despite the challenges, forecasts call for exciting developments ahead.

“I foresee AI-driven speech technology increasingly being developed for the betterment of society, specifically for healthcare and education domains,” Ubale says. “Now, most off-the-shelf capabilities provide very limited speech information. But in the future, engineers and scientists will invest more in building underlying capabilities to address specific challenges for users trying to learn a new language, reskilling or upskilling for professional development, and overcoming specific medical needs.”

Zagorsek envisions four categories where innovation will be robust: proactive voice assistants, emotion detection, expanded multilingual and accented language capabilities, and increased monetization.

“In the near future, we will see voice assistants taking a proactive role and providing greater usefulness by collecting information about the context and situation and then taking the initiative to make helpful suggestions and take actions,” Zagorsek says.

Amin predicts that empathy will become even more important as customers continue to seek out human interactions. “Empathy is possible with advanced technologies, like asynchronous messaging and voice of the customer, that better meet customers where they are and provide a white-glove experience.”

Ziv is equally enthusiastic. “I foresee the rise and continued momentum of real-time speech applications, such as real-time agent assist,” he says. “I also see the emergence of immersive human-to-machine speech that interfaces with virtual reality and voice where, for example, keyboards and texting via fingers disappear.” 

Erik J. Martin is a Chicago area-based freelance writer and public relations expert whose articles have been featured in AARP The Magazine, Reader’s Digest, The Costco Connection, and other publications. He often writes on topics related to real estate, business, technology, healthcare, insurance, and entertainment. He also publishes several blogs, including martinspiration.com and cineversegroup.com.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues