Read Between the Words

Voice interfaces with true conversational speech capabilities will drive the widespread adoption of voice in mass markets. They make easier, faster, and more efficient interactions with systems. This combination makes interactions with systems feel normal. And, they supersede the concept of learning a new human machine interface (HMI) when talking to a new device or machine. With true conversational speech as the standard, humans don't need to learn machine or robot language and all systems become immediately accessible by human norms.

One of the challenges that consumers face with existing voice interfaces is that they require significant learning on the part of the user. Users should be able to ask for what they want directly in a normal fashion. When they are uncertain of their needs, they should engage the system in a productive, cooperative dialog, like they would another human being. Instead, users today are forced to simplify requests to match the easiest set of instructions, and there is virtually no option for dialogue.

The potential of conversational speech technologies, particularly in multimodal and embedded systems, holds great promise for the industry and the consumer. But what exactly does conversational speech mean? How do we define the standard we are all shooting for in advancing the state of speech technology?

A system's ability to engage in conversational speech is determined by four elements: casual speech, noise tolerance, intent determination and hypothesis building, and cooperative conversation.

Casual Speech

Casual Speech is the typical day-to-day language of humans. Having the capacity of casual speech means an application supports the use of typical, day-to-day language, with all the allowances for how humans normally speak, in whatever conditions they find themselves.

An underlying assumption of casual speech is that no human arrives at the same HMI situation in the same way twice. Variables such as stress, distraction, vernacular, vocabulary, and serendipity are infinitely varied. Under these conditions, if it is still reasonable to use a design model of a passenger in a car to understand a request to change the satellite radio channel—using typical conversational style under typical highway conditions—then the machine should too. A conversational speech interfacemust support casual speech and have the following features:

Free-form utterance This means that word order is not important; specialized jargon and slang words are understood; and the application has a tolerance of verbalized pauses, such as ums, ahs, and ehs. The application should recognize a command such as, You know…um, that Squizz Channel…ah, switch it there, as well as the more common English form of verb-before-noun process in a command, Change the channel to Squizz.

Compound requests If the speaker has a request with multiple variables, the application should support it. For instance, if the query is, What is the forecast for Boston this weekend? the application should search for the result of weather in a city equal to Boston for a time equal to weekend.

Inference from incomplete commands In an example such as, Route? Shinjuku? where the user actually intended to say, Route calculation to Shinjuku from my present location, the application should be able to infer a complete command. If the amount of information delivered is inadequate, the application should—in the spirit of cooperation— then ask for additional information.

Recognition of different expressions The application should recognize common alternatives for nouns and verbs, reflecting different usage patterns by age, socioeconomic group, and ethnicity, as well as the whim of the user. The application should support expressions where word order is not important or anticipated. For example, the application should know that all of the following mean the same thing: Go to my house. Go home. Where's my apartment from here? How do I get to my crib? And, Show me the way back.

Starts, stops, restarts, stutters, and run-on sentences Applications need to understand the intention of the speaker, including instances when the speaker is only generally correct. In an example such as, Well, I wanta …Mexi... no…steak restaurant please, I'm hungry, the application must understand the intention despite the stutters, restarts, pauses, and superfluous information. A passenger in the seat next to the driver would understand clearly what the speaker's intention was, or ask a clarifying question, and so should the application.

Noise Tolerance

Noise tolerance is the ability to filter out environmental distractions. Intent determination, or meaning, is derived largely from context. Words that have no meaning in a given context should be filtered out. Additionally, applications should not be confused by environmental and non-human noise.

A useful way to demonstrate this notion is with the following command: Nagoya please Nagoya play Nagoya the Nagoya Squizz Nagoya, where Squizz is the name of a satellite radio channel and Nagoya is a randomly selected, out-of-context word representing both words which have no meaning to the applicaction, as well as sounds that represent random human noise.

This is a useful test because it illustrates real-world use where there is potential for multiple things going on, such as other human conversations within range of the microphone, non-human environmental noise, and out-of-vocabulary words introduced by speaker ambiguity or malapropisms. In any event, the application should successfully switch to the requested station, The Squizz.

Conversational speech applications should use humans as their benchmarks for performance in noisy environments. If the noise level is such that a driver at 65 m.p.h. with windows cracked is understood by his passenger 92 percent of the time, then that is a useful performance goal for conversational applications.

Intent Determination and Hypothesis Building

Intent determination and hypothesis building is the ability to infer meaning from context. Context makes human conversations more efficient by establishing topic domains of possible meanings. In the example, What about traffic? the meaning of the question depends on which domain was just communicated. If the speaker had been querying the application's media player about 1970's rock and roll bands and then said, What about traffic? the utterance would have one meaning. If the speaker had just been asking about Michael Douglas movies available at Amazon.com, there would be a second meaning. And if the speaker had just been asking the navigation application for directions to O'Hare Airport, there would be yet a third meaning.

In practical terms for speech technologists, the solution to this issue lies in welldefined context domains—collections of related functions, each with its own relevant vocabulary. Commands for playing songs or for retrieving stock and company information are good examples. Context domains not only have particular vocabularies, but groupings of words, which, when evaluated together, disambiguate one domain from the next.

In human conversation, when out-of-context and noise words are taken into account, expressions for changing a satellite radio channel are almost limitless. But when an established context domain is searching for relevant combinations, as in the Nagoya example, the application has enough information to differentiate between a mediaplayer request and a satellite radio channel request while also disregarding the irrelevant words.

The context domains should have the self-awareness to score their own confidences as to what was said and should develop hypotheses about what was intended in an environment filled with ambient noise, speaker ambiguity, and accents. The design model is a hard-of-hearing person at a noisy party. By understanding context, possibilities, what has historically been done, and what was just done, the hard-of-hearing person can establish intent from meager phonetic clues.

The conversational speech application should know whether a partial request is a plausible fit for the most recent context domain under discussion, understanding that the user's intent is most likely built on recently established context. This should be implemented as a context stack, which keeps track of the most recent topics of conversation.

It is important to note, however, that the context stack should not constrain the user—all domains that the application controls must still be accessible, allowing the user to switch topics at any time without application confusion. In addition to having complete criteria information for a given domain of discourse, a requirement of conversational speech is to search across all criteria information for all domains simultaneously, which allows rapid selection of a domain for unambiguous data.

Context domains should also contain associated metadata for each criterion (for example, the application knows that the Space Needle is in Seattle). The context domain then automatically determines other information required to successfully complete the user's request. Because all criteria are available to the context domain, order of input is not important.

Cooperative Conversation

Cooperative Conversation, a term trademarked at VoiceBox, intends to describe the linguistic principle that all participants in a conversation are contributing toward a mutual understanding. The Cooperative Principle was first described by Paul Grice in his 1975 work, "Logic and Conversation," and refers to how people interact with each other and their normal behavior in conversation. The Cooperative Principle, simply put, states that each participant contributes to the exchange for the benefit of the exchange, and those exchanges have an accepted purpose or direction.

In conversational speech, participants should adhere to Grice's Cooperative Principle. That is, the user and the machine work together to produce a successful outcome to the conversation.

The design model of cooperative conversation consists of the components below. (With these parts, an HMI can create a holistic voice user experience (VUE) that delivers on the promise of successful conversational speech. The end result is a perceived successful conversation and attained purpose with a machine, regardless of the number of steps it took to achieve that end.)

Shared Knowledge and Assumptions There is a shared short-term knowledge bank of what the user has said, what she has done and where she is. There is also longterm shared knowledge, which is user-centric rather than session-based, such as the user's preferences, the history of what she has done and said, and where she has been. In a conversational application, both the application and the user share an understanding about the topic, conversation history, word usage, jargon, and tone (formal, humorous, terse, etc.).

Conversation types Cooperativeness depends on the type of conversation one is having. For instance, examining whether the conversation is based on a request, on learning, or on a spontaneous exchange. By categorizing and developing conceptual models for each type of exchange, user expectations and domain capabilities consistently match-up.

Hypothesis building and intent determination As mentioned earlier, the application determines a hypothesis for the intent of the speaker, but the application's conviction of that hypothesis can and should vary. Just as in human-to-human interaction, where one can weigh knowledge shared between participants to help gain an understanding, so can well-implemented applications weigh knowledge of context and user history to get an idea whether a given conclusion is correct.

Adaptive responses As in a human conversation, the response of the application should be based on the application's confidence that its conclusion is correct. If it is certain, or there are few or no choices, then it should just execute the request. If it is less than certain and the choices are many, it should ask. The degree of certainty about the hypothesis—how well the utterance was recognized along with how well it fits into the domains the application serves—determines the application's understood response.

Course Correction The ability to make corrections without interrupting the flow of conversation deserves special emphasis. How well a conversational interface deals with the daily reality of misunderstandings is a key factor in its ease of use. Misunderstandings are a normal part of human conversations, but in a cooperative setting, they should serve to propel the conversation forward instead of derailing it. A person in the passenger seat may make the same recognition mistake that a speech application might make, and could reply with something like, What kind of restaurant did you say? A well-implemented conversational application will conform to those everyday qualifiers as well.


Conversational applications claiming conversational speech capabilities should be adept at handling casual speech and highly tolerant of extraneous noise, using powerful intent determination and hypothesis building algorithms. Together, those elements result in cooperative conversations.

Cooperative conversations are also based on shared knowledge. Long-term shared knowledge can be used to personalize and enhance the VUE, while short-term knowledge helps to keep the conversation moving forward. Degrees of certainty affect the application's responses, and how well misunderstandings are handled by the application will determine how human the experience is.

The more human the experience, the higher the degree of satisfaction and success for users interacting with the conversational speech application. In the end, that is what speech technologists shoot for.

Tom Freeman is cofounder and senior vice president of marketing at VoiceBox Technologies Inc., a provider of conversational voice search technologies for telematics, digital home, mobile phone, and voice-over-IP solutions.


SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues