February 2, 2021
By Quinn Agen Vice president of business development at Omilia
Industry Voices

What Brings Virtual Conversations to Life?

Speech recognition technology has been woven into our daily lives; so much so that we barely notice it. Every time you interact with a virtual assistant or a smart speaker, you are using automated speech recognition technology.

Commentators have estimated that spending on automated speech and voice recognition technology is expected to grow to $26.8 billion by 2025. This growth is in part driven by the increased use of smart payment systems and strong consumer demand for smart devices. It seems that Alexa and Siri are not going to dominate the virtual assistant market for long.

Some automated speech recognition solutions have more potential than others. Solutions with 95 percent accuracy can leverage the added capability to parse and understand the relationship between words and denote the underlying meaning of an utterance, a sentence, or a combination of words. This is called natural language processing (NLP).

NLP is revolutionizing how we interact with chatbots, and it does not operate in a vacuum. It necessitates high and consistent accuracy by the speech recognition technology, and the two must work hand in hand. The role of speech recognition is to identify words, while NLP makes sense of what is said. Speech technology has had a lot of room for improvement in the last decade, and many companies have struggled to increase their speech recognition accuracy, thus limiting their use of NLP. When the speech recognition accuracy is high, the potential of NLP is endless.

Prepare for Complexity

To start, NLP must effectively distinguish the customer's voice from environmental and other noise. This includes the sound of a car honking, breathing, and simultaneous background conversations other than that of the customer. To respond to and maintain a conversation with the customer, the solution must navigate complex interactions and continue to accurately process speech.

Let's take the example of a fast-food drive-thru to demonstrate different levels of complexity for NLP.

Simple command: "May I have a large number 2 with a diet Coke?"

Medium command: "May I have a large number 2 with a diet Coke? No pickles on the burger."

Complex command: "May I have a large number 2 with a diet Coke? No pickles on the burger. I only want a little bit of ice in the diet Coke, and I would like to substitute french fries with onion rings."

Editing sandwich options and making substitutions using artificial intelligence includes a level of complexity that can only be achieved with sophisticated speech recognition with high accuracy and effective NLP.

Create the Conversation Context

An added layer of complexity beyond accurately transcribing speech to text is taking into account the context of the conversation. When developing natural language understanding (NLU) models, creating context rules (properties that have value within the overall meaning of the text) is important. So is developing a dialogue logic (meaning, the reasoning and actions behind how a human or system drives a dialogue toward an objective, like fulfilling a food order in the drive-thru or making a payment on the phone. Context rules allow for multiple context streams within a single dialogue with a customer. NLU's purpose is to identify meaning and understand the customer's intent. Context carries a lot of weight in understanding intent.

Dialogue logic allows the AI to recognize that when customers say they want "5 number 3s," 5 refers to the number of orders and 3 represents the meal option. Even though they are both numeric values, the AI has been trained to assign meanings specific to this context.

Customer: "May I have 5 number 3s, 4 with a Sprite, and 1 with a Coke?"

Response: "So, you would like 5 number 3s and would like 4 of these meals to come with a Sprite and 1 of these meals to come with a Coke? Would that be all?"

In this case, both the speech recognition engine and NLP were trained to respond to and understand the customer in the context of a fast-food drive-thru.

Keeping the Conversation Going

It is important to note that most conversations do not end after transcribing a single customer request. If a customer has a follow-up request, the speech recognition engine must have dialogue management to keep the conversation going. Dialogue management allows the technology to have a human-like conversation with a beginning, middle, and end by using memory from previous interactions.

Customer command: "May I have a large number 2 with a diet Coke?"

Response: "So, you would like a large-sized meal number 2 with a diet Coke? Would that be all?"

Customer: "Yes, that is all."

Response: "OK, the total for your order is $10.75. Please proceed to the next window."

Sophisticated AI can be adapted to function seamlessly across almost any industry. However, many organizations struggle to integrate different capabilities of speech recognition technology into one cohesive solution. The three most important capabilities of an effective speech recognition solution are as follows:

The ability to customize speech recognition models to their business;
Accurate voice activity detection that only responds to the customer's voice (and ignores other noise); and
The ability to work with NLU to communicate and adapt the conversation in real time.

Organizations that tailor their speech recognition and NLP capabilities to customer needs can greatly benefit from this innovation, saving money, reducing customer service time, and improving the overall customer experience.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

What Brings Virtual Conversations to Life?

Prepare for Complexity

Create the Conversation Context

Keeping the Conversation Going

Leena AI Launches Agentic AI Colleagues

Hyperlink InfoSystem Launches Clever247.ai Voice AI

SoundHound Partners with Acrelec

Deepfake AI Market to Generate $41.36 Billion by 2032