Speech Technologies Are at a Tipping Point

Article Featured Image

The use of speech in applications ranging from interactive voice response (IVR) to desktop dictation to smartphone voice search has grown steadily and significantly over the past couple of decades. And AVIOS has been a part of that journey, chronicling the paths and pitfalls of these challenging and exciting technologies. AVIOS surveys the technological horizon and examines the future trends for speech and natural language, and the present trajectory of the underlying A.I. technologies is the reason that AVIOS has refined its focus. Our upcoming annual conference—at the end of January 2017, titled “Conversational Interaction”—will explore those trends.

The rapid convergence of two technologies in particular has brought our industry to a tipping point.

Parallel to the development of speech-only interaction (IVR) is the evolution of text-based and touch-based styles of interaction, which have become ubiquitous on the desktop and the smartphone in the form of chat windows. Lately, these types of text interactions have evolved into what are commonly called “chatbots” (“textbot” seems more accurate to me). We have all received tech support via a “chat window” on a web page. Granted, it all started as a real human helping us on the other side of that window, but the interaction felt normal in the way that texting has come to feel normal. When developers began automating this new interaction paradigm, it was well understood that typing had a much lower error rate than speech recognition. So it is not surprising that early efforts to automate a chat window style of interaction was built upon natural language processing/natural language understanding (NLP/NLU) text analytics and state machines (similar to what is used for IVR applications) to manage the interaction flow.

Initially, what these systems did most reliably was classify human intent into predefined subcategories and then transfer the user interface experience to an existing page that provided additional detailed information about that subcategory. In fact, most virtual agent–based chat windows today do precisely that kind of simple category detection followed by a redirection to more detailed information.

Concurrent with the chatbot evolution, speech-based interactions continued to develop on the telephone. Because of the early limitations of speech recognition, these systems focused heavily on extracting details in small chunks. Talking to our bank IVR, we could say “checking” or “savings” to direct the system. Later these systems supported short, well-formed directives such as “transfer $400 to checking.” But if you said “move $400 out of my savings into my other account,” it would most likely fail because neither the speech recognition nor the NLU was robust enough to handle utterances that open-ended. (One vague utterance opens the door to a vast number of potential utterances that the system must anticipate.) The handcrafted grammars and NLU analytics at that time were not up to the task.

But powerful advances have emerged over the past decade. We would not have predicted that by now the average person could do near­dictation-quality speech recognition on a cell phone or with the built-in microphone on an inexpensive laptop from five feet away. Speech recognition is still far from being as good as a human, but it is good enough to do conversational transcription over less-than-ideal audio channels. While NLP/NLU has not made such dramatic advances, it has become good enough to do the needed analytics at conversational speed. One clear sign that NLU intent analysis is improving is that it’s available from multiple vendors as a RESTful microservice.

The major convergence of these technologies, along with multimodal fusion, has resulted in a natural synergy that gives us a seamless touch-talk-type multimodal interaction. Instead of being led through a dialogue, users want to be part of a richer conversation. They want to be part of a natural interaction, not simply micromanage an app. Rich interaction does not need to be a long, chatty conversation. It just needs to be aware:

Human: I’m leaving work at two today.

Computer: I’ll send a note to your team. Should I set your home thermostat for 2 p.m. arrival?

Human: Sure, thanks.

Computer: Okay, later.

Human: Oh, let Megan know, too.

Computer: Sure, I’ll text your wife you’re heading home.

Visit us at “Conversational Interaction” in January and meet the people who are creating this future. 

Emmett Coin is the founder of ejTalk, where he researches and creates engines for human-computer conversation systems. Coin has been working in speech technology since the 1960s at MIT with Dennis Klatt, where he investigated early ASR and TTS. 

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues