May 21, 2002
By Emmett Coin founder, ejTalk
Features

Speech is NOT Dialog

CM - Conversation Management
Conversation Management puts the emphasis on the mechanics of conversing as opposed to just satisfying the dialog goal. Of course, the goal is important, but a conversation that goes “with the grain” will be judged more acceptable by a human. Conversation is about following the grain. ASR - Automatic Speech Recognition and TTS - Text To Speech
All of us are familiar with ASR and TTS. ASR detects and extracts the words embedded in an utterance. It does this by using some formal expectation (n-grams, grammars, etc.) and selecting a best fit of one of the expectations to the sounds in the utterance. ASR does not know what those words mean. TTS accepts an utterance as text and attempts to generate an acoustic (spoken) version. As an aside, TTS does try to understand some of the meaning of the words in order to resolve issues of pronunciation: “Last week I read a book to learn how to perfect the way I will read the word perfect.” To do a good job, the TTS must distinguish verb/noun “perfect” and past/future “read” issues. NLU - Natural Language Understanding
Natural Language Understanding is another term commonly used. NLU attempts to structure a text sentence in a way such that specific elements can be referenced logically and directly (i.e. the adjective modifying the noun that is the subject of the sentence). Perhaps you have seen a circa 1950’s high-school English text with a section on “sentence diagramming.” You can think of NLU as a program that accepts a text sentence and outputs a description on how to draw the diagram along with additional information about the categories that the specific words on each branch relate to. SA - Synthetic Agent
Synthetic Agent is the concept that an agent engenders a component of autonomy while still having a clear purpose.

Is there a difference between speech recognition and conversation management? The recognizer hears what was said and then the computer just does something and responds, right? Actually there are big differences between the problem of deciphering the words contained in an utterance and the problem of carrying on a conversation. ASR is primarily a physics bound task. There are many other methodologies from the general field of signal processing that have been brought to bear. It is an acoustical pattern-matching problem not too different from a sonar based ship identification task. CM is a mental modeling task. Given a record of the prior exchange of utterances, what should I say next? What types of utterances from my conversational partner are most likely in response to what I will say next? Can't CM just be part of ASR?
Why can’t CM just evolve as a smooth extension of ASR? Well, for one thing they are quite different things. ASR operates in the domain of one utterance, and CM is the realization of one specific chain of utterances out of a large pool of potential chains. Much like DNA defines which amino acids are assembled linearly as beads-on-a-string that subsequently fold into incredibly complex 3D objects we call proteins. Utterances strung together fold into conversations. If you can bear one more analogy: ASR is to CM as standing is to walking. The emergent level of conversation is related to and relies on the initial level of utterances, but it is an entirely different kind of thing. De-construct then generate
Another fundamental difference is that ASR is a deductive technology and CM is generative. ASR decomposes an utterance against an expectation.

It attacks a segment of sound by reducing it to minimal elements of energy at specific times and frequencies.
It assembles the smallest pieces into somewhat larger pieces (phonemes) and then into syllables or words.
It finds the assembly that best fits for a given segment of sound.
And, not insignificantly, it benefits from the kinds of expectations gleaned from the conversation level.

CM predicts a future state.

It relies on a history of the conversation up to the present and anticipates potential future moves.
The more accurate its predictions, the better the conversation.
It should know when to lead AND when to follow.
A conversation merges the goals of two minds.

VoiceXML and SALT meet CM
Exactly how do VoiceXML and SALT meet the problems of CM? And how do they support speech technology and bridge the gap between ASR and CM? They succeed at encapsulation of the hardware, ASR, TTS, telephony and other platform issues. And they bode well for the potential of more portable voice applications. While they don’t supply any functionality at the level of CM, they do provide conventional programmatic control as well as a starting place to experiment and prototype some CM support. These CM features can be provided via separate, encapsulated code that is accessed through JavaScript conventions. They are very procedural and do not hide the conversation details. They are very flat and everything can be controlled anywhere and at any time. In fact they encourage and/or require tinkering with even the most basic conversational moves. Many platform vendors will continue to incorporate their particular flavors and require developers to generate slightly different, specific versions for each platform. The future
Any development language that gets bigger by getting broader but not deeper will stifle the development of higher order CM behaviors. These styles of languages have a growing number of low-level features, each with a large number of options. Nature approaches complexity by using layers and hierarchy. In order for these languages to become truly complex they will need to lose details not add them. Three scenarios for ASR/CM in the near future: A higher-level representation will be necessary. A system that delegates some of the generic minutia and universal strategies of conversation. In the beginning it will automate simple, yet very human behaviors such as back channel confirmation, greet and departure banter, or not-recognized gambits. Later it will represent parameterized conversations that are built on base level templates. For instance, domains that might discuss information about books and about magazines might both be based on a simpler domain that has a representation for the conversation commonalities involving information about printed word publications. The elements of editorial staff, readership, and publication schedule would be layered on the base domain about printed word publications and result in a domain about magazine information that would inherit its ability to talk about general literary content. The domain for novels might add other elements to the base such as character summary, setting, or chapters. Not only will this make complex conversations easier to build, but also they will have consistency as the conversation moves through different domains. There will be more autonomy for the SA. This will begin with subtle variation of generated speech. These variations will be constructed for the purpose of introducing novelty and to allow the SA to use conversational techniques such as conversational ellipsis. Ellipsis refers to the elimination of elements that are understood at that point in the conversation and so it improves the conversation’s bandwidth. For example, if you were scheduling several time slots for a conference room using an SA. You might hear on the first reservation, “What time do you want to schedule Meeting Room A for today?” On the second reservation you might hear “What time do you want Room A?” And on the third, “What time for Room A?” The SA based on a CM will be able to manage that behavior without all the bother of numerous tests and branches in a procedural representation. A CM that learns conversations may become practical. Today most ASR engines learn phonemes and words by listening to a human-annotated set of natural human utterances. It has been a long time since anyone has written a program to recognize the vowel “ah.” This may also be the most effective way to capture the natural characteristics of real conversation. Humans would annotate a large corpus of natural conversations between humans. Offline analysis would discover the patterns and compute the probabilities. Then, using these analyses the CM could predict statistically, the most likely conversational move that a real human would have made. Emmett Coin is the founder and CEO of ejTalk. He can be reached at emmett@ejtalk.com.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Speech is NOT Dialog

Voice Deepfake Fraud Surged 1,300 Percent

Sanas Unveils Simultaneous Real-Time Speech-to-Speech Translation

ESTsoft Partners with ElevenLabs

Deepgram Launches Voice Agent API