The 2023 State of Artificial Intelligence

It’s hardly a surprise that 2022 was a pivotal year for speech technology, especially considering the continued explosion of artificial intelligence in this space.

“AI has had a significant impact on speech technology to date. Advancements in AI, particularly in the areas of deep learning and natural language processing (NLP), have led to the development of highly accurate speech recognition systems,” says Mady Mantha, chief technology officer and cofounder of Happypillar, a provider of mental health support for families. “Large-scale language models coupled with infrastructure that’s allowed for the collection and processing of large amounts of data has advanced the accuracy of speech-to-text and text-to-speech systems in the past few years. These systems can now transcribe spoken language into text with a high degree of accuracy and can also be increasingly used for tasks such as voice-controlled virtual assistants, automated call centers, and speech-to-text dictation software.”

Year in Review

The massive surge of data from smartphones and handheld devices has heightened expectations, believes Akash Raj, chief technology officer and cofounder of Whispp, creators of speech technology for those with voice disorders or stutters.

“With sophisticated techniques developed using AI-based principles, it is now practically possible to handle such volumes of data to derive solutions for trivial and nontrivial problems pertaining to different modalities,” Raj says. “New-era AI technologies, boosted with apt hardware support, are handling multiple languages and dialects, dynamic background noise, different emotional states, and much more, for typical and atypical speakers. It has further enabled the integration of speech with other domains, such as healthcare, video analytics, language processing, robotics, and many more.”

AI advancements have even helped speech reclaim its importance as the prime mode of communication, Raj insists.

Amitha Pulijala, vice president of product, AI, video, and platform services at Vonage, agrees that AI’s influence on speech continues to expand.

“Speech recognition AI today uses NLP more effectively to better understand customers and respond conversationally,” Pulijala notes. “The technology powers everything from interactive voice response to virtual assistants and chatbots to call transcription. Speech recognition can also be used as a security measure to verify customer identity and approve access to account information and self-service options.”

But Neil Sahota, CEO of ACSI Labs and AI adviser to the United Nations, says that while we’ve taken a few important steps, much progress remains.

“In using speech technology, we have trained ourselves to speak more slowly and really enunciate. That has really improved the effectiveness. However, we need to move toward precision speech recognition, much like we’re leveraging AI to move into precision medicine. Each person has a unique accent and way of pronouncing words. This is beyond regional dialects and local slang. Until machines understand how each person talks, we’ve still got work to do,” he cautions.

Along the same lines, 2022 saw more generative applications for large language models (LLMs), including PaLM, LaMBDA, Megatron-Turing NLG, Chinchilla, and ChatGPT.

The many innovations on this front include these:

The open-source movement gained momentum, with OpenAI’s release of Whisper, an automatic speech recognition (ASR) system that transcribes in multiple languages and translates those languages into English.
More adoption of speech technology occurred due to the proliferation of cloud-based tools for ASR like Deepgram, Google ASR, Kaldi, Mozilla, and DeepSpeech.
On-device inference for speech and voice cloning gained traction.
Advances were made in diffusion-based speech systems, vocoding, and speech synthesis.
Voice cloning became more popular.
Self-supervised learning became the standard for speech technology tasks, including speech-to-text, speaker recognition, language recognition, and text-to-speech.
Most commercial speech recognition systems were trained to use transformer-based models that learn context and derive meaning by tracking the relationships between sequential datasets.
Generative AI emerged and spread to the mainstream.
Multimodel speech emotion recognition improved.

Roughly 90 percent of organizations are already using, currently implementing, or prioritizing AI, up from just 10 percent in 2017, Sahota says.

Roger Zimmerman, chief of research and development at 3Play Media, a captioning and transcription company, says ASR in particular has been widely embraced in a number of areas.

“Clearly, the use of interactive search, question-answering, and command-and-control applications via Siri and Alexa is widespread. This is not surprising, given the constrained tasks performed by these applications: They are speaker-dependent, very domain-specific, and do not require precise word-for-word accuracy,” he says. “In addition, user tolerance for error is probably quite high in these applications.”

However, while industries such as finance and healthcare have been quick to adopt AI, other industries, like manufacturing, retail, and government, have been slower to adopt due to lack of expertise and skilled workers to develop and adopt AI solutions, Mantha points out. “Also, the cost and complexity of implementing AI have contributed to this trend. Plus, there’s a lot of both hype and fear around AI and a lack of trust and understanding among the general public. This makes it difficult for companies to adopt it.”

A Look Ahead

Indeed, there are hurdles that speech vendors will need to clear before progress can be made.

“Precision speech remains a problem. Understanding how people uniquely pronounce words overcomes a lot of existing challenges, including a person speaking in their non-native language,” Sahota says. “Raw speech technology has been deployed quite effectively, but in terms of facilitating better teamwork, collaboration, and performance, companies still struggle with recognizing AI capabilities, like psychographics and neurolinguistics, that would enable real-time coaching and communication.”

Other issues remain unresolved as well, according to Mantha.

“AI and speech models are often trained on large amounts of data, but they still have issues generalizing to new and unseen data. Separately, there is a growing concern about the privacy and security of data used to train and operate AI and speech models,” she notes. “AI and speech technology can perpetuate and even amplify biases present in the data used to train them. And then there’s the problem of adapting to different accents and dialects, as AI and speech recognition systems are often trained on a specific accent or dialect.”

Improving end-to-end speech-to-intent recognition is also crucial.

“This is in contrast to traditional two-phase models for recognizing intent, which involves first converting the speech to text and then analyzing the text for intent,” says Jithendra Vepa, chief science officer of Observe.AI.

Another sticky widget is the increasing need to fast-track speech technology for monitoring health and timely diagnosis of pathologies.

“Speech signals have been utilized as one of the modes to determine the physiological wellness of a person. Healthcare experts have acclaimed that speech plays a vital role in the early diagnosis of certain motor neuronic dysfunction, along with a few other cardiac and lung ailments. Upcoming AI technologies need to be developed with the sophistication to delineate these subtle characteristics in a person’s speech,” Raj says.

Breakthroughs in speech tech are no longer the stuff of science fiction. Robust innovation is already here, and more is on the way soon, the pros expect.

“The speech technology market will soon grow much faster and larger than forecasted because people can’t fully envision all the opportunities to introduce speech tech,” Sahota says. “Consider the Iron Man movies and how Tony Stark was able to have a conversation with his AI system to develop his suit of armor. Most businesses believe we’re years away from being able to do this, but the technology actually already exists today, with ChatGPT being a great example.”

Sahota has another attention-getting prognostication: Text will soon become a thing of the past. “Ninety percent of communication will soon be done through voice, like dictated emails and messages,” he predicts.

Also, count on generative AI and models like ChatGPT to disrupt content generation.

“Traditional chatbot technology using sequential pipelines and hand-crafted dialogue flows will be replaced by frameworks that use large language models,” Vepa says.

In 2023, training LLMs on more data should yield improvements in accuracy and usable NLP applications, such as topic-modeling, sentiment analysis, and other analytical tasks, Zimmerman says.

“Continued work on end-to-end ASR systems may increase their accuracy enough to make them competitive with traditional hybrid-model systems, even on languages with lots of available supervised training data,” he says. “In the short to medium term, these will probably be used to increase the accuracy of ASR in lower-resource computing environments. And if the training methods that have been developed for LLMs…can be successfully applied to deep neural network acoustic models, this could certainly be another vector to improving ASR accuracy or reducing ASR errors in challenging acoustic environments.”

Lastly, Pulijala anticipates the technology will evolve to address greater privacy and security needs and more unique use cases.

“Delivering an AI customer experience is within reach for all businesses, but the use cases will vary from one company to another,” she says.

Erik J. Martin is a Chicago area-based freelance writer and public relations expert whose articles have been featured in AARP The Magazine, Reader’s Digest, The Costco Connection, and other publications. He often writes on topics related to real estate, business, technology, healthcare, insurance, and entertainment. He also publishes several blogs, including martinspiration.com and cineversegroup.com.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

The 2023 State of Artificial Intelligence

Year in Review

A Look Ahead

Triton Digital Partners with ekoz.ai on Voice-Cloned Podcast Ads

Soul App Launches Full-Duplex Voice Model

Mistral Unveils Voxtral Open-Source AI Voice Model

Vonage Partners with AWS for AI Voice Agent Integration