Speech Technology Is Not Yet Conversational Technology

Speech technology has become ubiquitous; everyone is suddenly speaking to machines. Voice is now a thing. Devices like Amazon Echo and Google Home are existing technology, not gadgets of the future.

Automatic speech recognition (ASR) is good—even astonishingly good if you’re in the sweet spot of the training space. Text-to-speech (TTS) is not unpleasant. Yet something is missing.

Where is the conversation? Why can’t we converse with machines?

Conversational technology remains at the same level as interactive voice response (IVR) design from two decades ago. Granted, some narrow improvements, like “intent recognition,” have emerged to ease the task of understanding what an utterance means. But the overarching conversational design is still a flowchart labyrinth: IF heard X, THEN say Y, and JUMP to a new position (state) in the “conversation” flowchart.

For the most part, the voice interactions we have today involve a single human utterance: a command or query, after which the human gets a simple response—lights turn on, Rachmaninoff is played, or a question is answered (“there are 3.26156 light years in a parsec”). If you need to do more or ask a follow-up, you start with a fresh slate. You must reformulate a self-contained, complete, unambiguous utterance. (Note: Some skills/actions execute multiple turns; I’ll address that later.)

My daughter, a professional psychologist in D.C., related to me a suggestion from one of her industry newsletters describing how to get into artificial intelligence (AI). She knew I’d chuckle. It quoted Mark Cuban explaining that anybody can get into AI these days: just write a script for Alexa. Amusing, but it made me wonder: If people with their finger on the pulse of high-tech venture capital think that an IVR script is tantamount to AI, then what does the rest of the public think it is?

Those who have created skills/actions with multiple turns know it is just coding a flowchart with all of the potential twists and turns that could possibly happen. It is tedious and fraught with a combinatorial explosion of details. At best you will anticipate a subset of those details (experience helps here). It must be tested with a range of people that do surprising (human) things that the coder never dreamed of.

This is why out of thousands of multi-turn skills/actions, most of them (90 percent?) are awful to the point of being unusable. And of the remainder, most of those are bad enough to be abandoned after a few encounters. The remaining few fit into a category we could characterize as usable. The point is that simple command or query applications are relatively easy to create. But conversations (multi-turn applications) remain difficult and expensive and require deep expertise.

I and the AVIOS community believe that the industry lacks tools to support true conversation. In my talks I often ask the audience to imagine the conversation between Albert Einstein and a 3-year-old. No one doubts what would happen. Al and the kid would find a topic where their experience overlapped. Their conversation would wander around the intersection of their expertise (e.g., Einstein might say “Venn I was a child…”). Occasionally the conversation would step outside their common experience (curved spacetime, or bubblegum-flavored yogurt), and one of the conversational partners would offer to expand the envelope of that common knowledge (“like a ball on a stretched sheet” or “it is fun to squeeze out of a tube”).

Even though their worldviews are vastly different in scope, they can both effortlessly use their generic (meta-) conversation engine. Neither had to study or make complex plans for how to respond. It followed a meta-level pattern. Yet their “specific” conversation is unique in the universe. How does that happen?

Often in science fiction people converse with machines. Once one excludes the extraordinary machines (such as 2001: A Space Odyssey’s HAL, Ex Machina’s Ava, Star Wars’ CP30, Isaac Asimov’s R. Daneel Olivaw, Lost in Space’s M3-B9, etc.), I like to imagine the sorts of human-machine exchanges we could have using today’s technology.

I think the fictional robot that comes closest to approximating such exchanges is Robbie the Robot from the classic 1956 film Forbidden Planet. I recommend the film in general (check out Forbidden Planet on IMDb for Robbie’s dialogue), but there are five or six interactions with Robbie that are noteworthy, and in particular a couple involving extended conversations to complete a task. Both of the tasks involve manufacturing something. Both of the tasks are in very different domains. Both have missing constraints as well as presumable constraints. There is a structure to his conversation at a meta-level.

In the world of skills/actions of today, these tasks would be crafted in completely isolated, elaborate silos. Today’s tools do not scale for building conversations. A pragmatic example is a sales job. Two positions are open: in the mattress department and in the shoe department. Other than the details of the products (pillow top, leather soles, queen-size, slip-on, etc.) how much do you have to be trained to have a conversation with the customer?

As a salesperson all you really need is a spreadsheet with columns for product names, parameter names, price, and so on. All those things can be slotted into a standard, meta-customer interaction. The goal is the same: match the features to the customer desires and convince them to buy.

In an object-oriented programming paradigm, the developer would select an existing “salesperson” base class. Then the developer would create a new “shoe salesperson” class that would inherit all of the common behaviors. This new class would add nuances required for selling shoes.

In fact, in this thought experiment we could imagine that an “apparel salesperson” base class existed (of course inheriting behaviors from the “salesperson” class), and that the developer would choose that one on which to build their “shoe salesperson.”

A scheme like this fixes another major problem: multiple personality disorder. Alexa sounds like Alexa no matter what she talks about. But every developer imbues a different personality/manner to the application (unique presumptions and responses hard-coded into the flowchart). The human is forced to be a bit of a psychiatrist and recognize which personality they are speaking with.

Some say that neural nets will learn how to converse from transcribed conversations. This is not the place to detail the very tall hurdles that underlie this approach. But at AVIOS conferences we do dig into the details like this when we talk about practical, working technology with the leaders at the cutting-edge. Visit www.botsandassistantsconf.com and join us!

Emmett Coin is the founder of ejTalk, where he researches and creates engines for human-computer conversation systems. Coin has been working in speech technology since the 1960s at MIT with Dennis Klatt, where he investigated early ASR and TTS.

Speech Technology Is Not Yet Conversational Technology

ServiceNow Partners with OpenAI on Voice AI

FlashLabs Releases Chroma 1.0 Voice AI Model

Agora Partners with MiniMax on Voice AI

VoiceRun Launches Voice AI Platform with $5.5 Million Seed Round