June 12, 2023
By Kevin Brown enterprise architect, Miratech.
Inside Speech

Large Language Models Are Suddenly All the Talk in Speech Technology

A large language model (LLM) is a type of generative artificial intelligence (AI) that is trained on massive datasets of text and code. Generative AI, like other forms of AI, learns how to take actions from past data. The difference is that generative AI creates brand-new content—text, images, audio (including music), even computer code—based on that training, instead of simply identifying and categorizing data like other forms of AI.

ChatGPT is the most well-known example, but competition from other LLMs is rapidly pushing speech technology vendors to plan on merging LLM capabilities into their current offerings. So let’s take a deep breath and consider some important aspects of LLMs before you dive into the pool.

A straightforward interactive voice response system that uses natural language understanding (NLU) provides a good test case. A portion of the application attempts to answer frequently asked questions. This implementation calls a Q&A service maintained by the business; the answers are 100 percent predictable and easily managed by the business as conditions change. And this business is subject to government regulation, so accuracy is doubly important.

Imagine if the application, with an integrated LLM, attempted to answer many other questions that are not in the maintained Q&A service. LLMs are not always accurate. They can make mistakes, especially when addressing factual topics. Over time, and with more data for training, they can become more accurate, but this means additional cost and resources. Significant amounts of calls will have to be recorded and transcripts created as training data. And then quality control will be required, just as it is for live calls to agents. Perhaps LLMs could have a positive return on investment for the largest contact centers over time, but small and midsize businesses will not be able to capitalize on LLM integrations anytime soon, and it seems a stretch for large organizations at present.

In the short term, the massive computing power required by LLMs will keep the cost of high-volume business applications high. As new frameworks are delivered that reduce computing power requirements, pricing will fall. More than a few LLM speech/chat offerings currently being teased to the market are in beta mode, which means vendors are using real-world deployments as a training ground. Pricing will likely be artificially low until the solutions are released in full production mode, so if you are considering running any of these beta offerings, keep that in mind.

Other uses of voice technology are more likely to benefit from LLMs sooner. Current voice assistants immediately come to mind. All the market leaders can benefit greatly from LLM capabilities. “I can’t help you with that, but I’ve found this on the web, check it out” hardly qualifies as a voice “assistant.” Usefulness will be a key performance indicator that hopefully will drive more captured market share.

The ability to summarize existing text is another ready use case for LLMs—speeding up the search for relevant legal content in LexisNexis, for example, or medical journals for medical professionals. An accurate summary eliminates the need to read through long sections of text to determine if it relates to the question at hand. Current market-specific search vendors are likely to be considering moving to LLMs.

Another fertile area is translation and interpretation, where LLMs have become the vehicle of choice, and the results have been nothing short of amazing. With the ability of users to rate interpretations and provide alternative outputs, LLMs can be rapidly trained to improve without extensive resources and cost.

Speech technologies improve at astonishing rates when powered by deep neural networks, and LLMs are no different. Segmentation of training and expected output based on speech signals such as emotion, health, age, and accents, powered by real-time deep neural networks, can bring speech recognition up to human-expert-level results. LLMs can be segmented to respond appropriately given the additional information that speech signals can provide.

As always, watch for regulatory changes given the potential for bias and inaccuracies that LLMs can lull people into believing. x

Kevin Brown is a customer experience architect at Cognizant with more than 25 years of experience designing and delivering speech-enabled solutions. He can be reached at kevin.brown@cognizant.com.