The Challenges and Evolution of Modern Spoken Dialogue Systems

Article Featured Image

Spoken dialogue systems have existed for a long time, and, over the years, the technology has matured. In the early 90s, speech technology vendors designed spoken dialogue systems that could recognize thousands of words based on a fixed set of keywords, and the speech recognition accuracy and vocabulary of the systems have continued to evolve.

More recently, companies such as Google, IBM, and Amazon have started offering speech services from the cloud. This approach makes spoken dialogue systems far more affordable and has led to an explosion in their use. The cloud-based services use deep learning models for speech recognition and natural language understanding, enabling users to build conversational dialogue systems rather than relying on fixed keywords. Initially, the technology was primarily used to power personal home assistants such as Google Home and Amazon Alexa. Given the popularity of these devices, the technology is now being introduced into contact center environments to provide consumers the same conversational experience when interacting with virtual customer service agents.

There are some caveats to using these modern, cloud-based speech services in the contact center. Let’s explore some of the challenges and how the technology can be improved to solve them.

Open vs. Closed Grammar

Traditionally, spoken dialogue systems relied on grammars to perform speech recognition. Speech recognition grammar is a set of rules and constraints that define what a speech recognition engine recognizes as meaningful input. SRGS, for example, is a Worldwide Web Consortium standard for grammar specification that has been widely adopted around the world by companies offering spoken dialogue systems. Most dialogue systems used in contact centers today are based on the SRGS specification.

In contrast, recent cloud-based spoken dialogue systems do not rely on grammar specifications. This makes them extremely flexible, enabling them to answer open questions. Consumers can interact with contact center virtual agents using natural language, which results in an improved customer experience.

However, many contact center applications must capture information that has a fixed pattern, such as vehicle registration numbers or Social Security numbers. In such cases, the ability to define rules as part of the recognition request would improve speech recognition accuracy. Some cloud-based vendors have tried to remove this limitation. Google introduced phrase hints and custom entities, or rules, allowing users to pass predefined tokens and phrases to the speech processor to improve recognition accuracy. This is a step forward, but a better approach would be to define a common standard, such as SRGS, that is endorsed by all cloud vendors. This would make spoken dialogue systems not only more versatile, but interoperable.

Response Time

Response time is critical to ensuring that interactions with virtual agents feel natural. Cloud-based systems can encounter some delay in fetching the recognition results and responding to customer queries. Minimizing that delay is fundamental to any conversational speech recognition system.

Speech recognition protocols such as MRCP are used to precisely tune speech timeout values to improve responsiveness. For example, a FAQ bot can be tuned to detect the end of a question, whereas a bot expecting a credit card number can be fine-tuned to process the input stream with timeout parameters accounting for a pause between the sets of four digits.

Recent updates to cloud speech services include support to control basic timeout values. These systems need to be further extended to include support for fine-tuning the advanced speech timeout parameters to improve overall system responsiveness. Furthermore, the ability to dynamically change these timeout values within a session would make cloud-based speech services extremely flexible and natural.

Support for Mixed Modes

Contact center applications have relied on dual tone multi-frequency (DTMF) to capture user input reliably. As we move telephony interactions to natural conversational interfaces, it becomes important for cloud-based services to support DTMF recognition. Ideally, these services should support a mixed recognition mode whereby the input recognition stream consists of integrated audio and DTMF.

This would be especially useful in scenarios with noisy input channels, where the system is unable to capture a long digit string accurately via the voice channel. Callers should be able to enter the digit string using the keypad and continue interacting with the virtual agent via the voice channel.

Smart Speakers and Answering Machine Detection

Contact centers often operate in a blended mode, in which live and virtual agents handle inbound and outbound calls. A key feature on which irtual agents rely when making outbound calls is answering machine detection (AMD).

As smart speaker adoption increases, contact centers will need to support these channels by publishing their apps to the vendor ecosystems of these devices. New protocols must be defined to handle such interactions, particularly when a virtual agent is trying to connect with a smart speaker that is busy or inactive. Events like those used in AMD protocols need to be defined in the context of smart speakers so that the speakers can support blended contact centers.

Support for Multiple Intents

Spoken dialogue systems are based on the concept of detecting intents or slots. An intent is a specific activity that needs to be actioned, such as booking a flight or ordering coffee. All the major natural language understanding services offer single intent detection. But, contact centers are likely to deal with customer queries that include multiple intents within the same utterance. The ability to detect multiple intents must be addressed to move toward truly natural language interfaces.

As conversational AI technology matures, its adoption within contact centers will increase, and virtual agents will be able to perform complex operations alongside live agents. Customers will expect to interact with agents using smart speakers, smart watches, and other smart gadgets. However, creating the smart contact center of the future will depend on the ability of cloud speech vendors to solve the above-mentioned challenges.

Santosh Kulkarni is vice president of products at Inference Solutions and has more than 20 years of experience in technology management and commercialization. He has consulted with major international banks on peer-to-peer payment systems, mentored startups in the video analytics space, and led research programs at Australia's largest ICT research center, NICTA (now Data 61). He started his career at Telstra, where he led projects in distributed video streaming and multimedia communications.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues