The State of Speech Engines

Article Featured Image

The transformation that speech engines have experienced over the past few years has been nothing short of phenomenal, due in no small part to advancements in important areas like artificial intelligence and improvements in accuracy, performance, and scale.

Innovations and progress continued in 2020, despite a global pandemic that presented significant challenges to the industry. Now, speech engines and the technologies that drive them appear poised to capitalize on an evolving business and consumer landscape that increasingly prefers and depends on speech.

Abhinav Misra, an associate research scientist in speech and natural language processing at ETS, says the performance of speech recognition engines has improved dramatically since the advent of deep neural networks (DNN), with more organizations and consumers reaping the benefits.

“As a greater number of people increasingly use voice assistants, businesses are collecting more data to feed these AI algorithms and further improve speech engine performance,” he says.

Judith Bishop, senior director of AI specialists at Appen, agrees.

“Automatic speech recognition engines have become significantly more powerful and versatile. The major engines are now able to recognize a much broader range of voices across age groups, including children’s voices, which are significantly different from adults,” she says.

Bishop adds that COVID-19 helped spur recent innovation. “The pandemic has put an emphasis on speech recognition in noisy conditions, with noise suppression algorithms to cope with background noise becoming a significant focus over the last year. As a result, speech no longer has to be directed quite as loudly and clearly to devices at close range, and speech can now be detected and understood even in some situations where the speaker is addressing the device indirectly, such as while preparing a meal. These advances further underscore how speech engines are more closely approaching the natural conditions of human communication.”

With vendors like Amazon, Google, and IBM now providing speech services from the cloud, speech engines also evolved from fixed-grammar models with limited vocabularies to more flexible, open systems.

“Cloud speech services today are more affordable than traditional fixed-grammar models. And the adoption of advanced speech recognition is steadily increasing in contact center environments; more customers can speak to a virtual contact center agent using natural language, resulting in an improved experience,” notes Santosh Kulkarni, vice president of product at Inference Solutions, which was recently acquired by Five9.

Nuance Communications, one of the leaders in this space, can testify to how the pace of progress has quickened—particularly in healthcare.

“Deep learning technology has rapidly transformed the way computers perform speech recognition. It has enabled us to build speech recognizers for very challenging applications, such as transcribing conversations between doctors and patients,” says Felix Weninger, principal research scientist at Nuance.

Many enterprises are also taking advantage of cutting-edge speech tech’s ability to augment customer conversations.

“We are seeing a shift away from brittle, command-based interactive voice response systems to natural interactive virtual agents that can handle multistep customer inquiries entirely using software,” says Evan Macmillan, CEO of Gridspace. “We are also seeing closed-loop speech systems that learn from past conversations and make agents more helpful and productive. The amount of real-time speech audio being handled by these speech systems continues to go up as well, awakening the possibility for even more sophisticated voice interfaces and delivery models.”

Year in Review

Several key developments unfolded in 2020 that impacted speech engines and related technologies.

“In 2020, we saw an increased acceptance of cloud-based speech engine solutions. Organizations that were dependent on on-premises models were forced to rethink their approaches and investments,” says Daniel Ziv, vice president of speech and text analytics at Verint Systems. “Leveraging speech analytics in the cloud allows organizations to get up and running quickly while providing an elastic and secure usage model that offers an attractive subscription financial model as well.”

The past year also saw the accelerated adoption of speech engines in response to COVID-19 and the need for contactless service.

“2020 brought a new urgency to increasing voice-assisted quick-service restaurant drive-thru efficiency,” Bishop says.

The emergence of end-to-end speech recognition engines, thanks to more aggressive research from Google, Facebook, Microsoft, and others, captured ample attention, too.

“Employing a single deep neural network to directly convert an audio signal to a letter is very enticing, as it removes many complexities,” Misra notes.

More businesses harnessed conversational AI and launched intelligent virtual agents (IVA) in 2020.

“Today, using the latest code-free IVA development platforms, companies can build IVAs powered by the same natural language processing technology as consumer smart speakers in minutes and deploy them in their customer contact centers within days or weeks,” Kulkarni says.

Branded text-to-speech voices became more popular over the past 12 months as the revolution in AI-backed voice assistants and conversational systems grew exponentially due to the coronavirus.

“To help separate themselves from the competition, brands also began experimenting with different text-to-speech speaking styles, including emotive voice in place of the robotic voice commonly found in voice assistants today,” says Niclas Bergstrom, chief technology officer at ReadSpeaker.

Technology providers improved the ease of code integration in their software development kits (SDKs). Amazon, for example, introduced dual-lingual modes and more translation modes in its Alexa SDK.

Interest in voice cloning ramped up too. “Voice cloning allows developers to extract specific characteristics of a target voice, such as tone, and apply them to waveforms of a different speech,” Bergstrom explains.

A Look Ahead

The future looks bright to many speech engine experts, although several challenges will need to be overcome.

“Many capabilities will be widely implemented to help organizations better support the work-from-home model while continuing to provide a positive customer experience,” Ziv predicts. “These capabilities include leveraging speech analytics insights to optimize the effectiveness of self-serve channels to provide exceptional service at a lower cost.”

Analytics can help, for example, identify the reasons customers call and ways to continuously improve customer engagement.Volker Springer, senior expert at Elektrobit, foresees better dialogue context tracking ahead.

“Systems will better understand the semantics of a sentence and more accurately match it to the user’s environment, which will minimize listener fatigue. And systems will allow more complex sentences and intent,” he says.

Expect conversational agents to assist, augment, and automate more voice interactions over the next year, insists Macmillan.

“We could easily go from 2 percent to 50 percent of voice inquiries being handled by conversational speech technologies for some large healthcare and financial services players,” he says.

Scott Stephenson, cofounder and CEO of Deepgram, anticipates more dollars being allocated toward voice-enabled experiences for both agents and customers this year.

“At the same time, software providers will aggressively fund speech-related product developments to break through the noise and try to become the next big player in the customer experience technology space,” Stephenson adds.

Paralinguistic voice interfaces that better measure what users say and how they say it will also improve, according to Bergstrom.

“This will be important in combating the other innovation we’ll start to see in the coming years, which is a bigger focus on emotional text-to-speech,” he says. “The quality of voice is already there, but voice providers will need to make emotional voice offerings a priority so they can provide a better customer experience.” 

Erik J. Martin is a Chicago area-based freelance writer and public relations expert whose articles have been featured in AARP The Magazine, Reader’s Digest, The Costco Connection, and other publications. He often writes on topics related to real estate, business, technology, healthcare, insurance, and entertainment. He also publishes several blogs, including martinspiration.com and cineversegroup.com.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues