June 30, 2003
By Walter Rolandi Founder - The Voice User Interface Company, LLC
The Human Factor

When You Don't Know When You Don't Know

A Good Intention

During a break at a recent speech technology conference, a group of attendees were discussing the importance of learnability in their application designs. One participant advocated a particular method for classifying and dealing with recognition results as helpful.

The scheme divided user utterances into three basic categories: high confidence matches; low-to-medium confidence matches; and “no-match” or out-of-grammar (OOG) utterances. For the higher confidence results, the application acted as if it knew what the user had said and proceeded accordingly. For the low-to-medium confidence results, the application asked follow-up questions designed to confirm what was “heard.” When the recognizer returned a confidence of “zero” however, the application would then inform the user that the utterance was not in the grammar.

The participant believed that this method helped users to learn what they can and cannot say when using the application. The idea was to tell the user when he or she makes an utterance that the recognizer cannot understand and thereby teach the user not to say the utterance again.

This would be a pretty good design idea were it not for the unfortunate fact that it is predicated on a nonexistent ability: it is not currently possible to determine that an utterance is definitively and absolutely out-of-grammar.

Certainty

The problem, as is often the case with speech applications, revolves around “certainty.” When a user says something, the recognizer attempts to match what is said to entries in its active grammar. Most recognizers associate some degree of certainty with every match. Typically, this degree of certainty is represented by a certainty (or confidence) factor (CF) which is assigned to the match. While not actual probabilities, CFs often look like probabilities: they are numbers that range between zero and 1.0. A CF of .91 would be a highly confident match; a CF of .52 would be a low-to-medium match. Usually a CF under .40 constitutes a sufficiently low degree of certainty such that one would be incorrect to assume that the match was “good” at least as often as one would be correct in doing so.

But where this fellow went wrong was when he naively assumed that a CF of zero means that there is definitely no match. This is simply not the case. Due to normal ASR error and extraneous environmental factors, recognizers are simply not accurate enough to be able to determine that any particular utterance is definitely not in a grammar. In fact, I have seen instances where the active grammar contained precisely two entries - ”yes” and “no” - and a user said “yes.” Yet, due to background noise, the recognizer returned a CF of zero.

In such a case, what the application “heard” was not in the grammar. But what the user said most certainly was. As far as the user was concerned, he or she answered “yes” to a “yes/no” question — a perfectly reasonable thing to do. Imagine the user’s dismay if the application were to respond by saying that what the user said was not allowed in the grammar!

Greater Implications

While it may seem obvious to most, not knowing when you know something can tremendously handicap the intelligence of an application. This is an ability that we take for granted in ourselves.

Imagine an English-only speaker being asked a question by someone speaking in French. Not understanding even a syllable of French, the English speaker might likely reply, “I’m sorry. Please speak English.” There is no need to exhaustively search his memory for possible matches or to devise a list of best guess interpretations. The English speaker instantly knows that what the other person said was not English, i.e., that the speaker’s utterance was not in the listener’s “grammar.”

While it may seem inconsequential, having a recognizer capable of accurately determining whether or not an utterance is in its grammar would be a significant step toward more intelligent voice user interfaces. Knowing when you know something suggests a greater, general self-awareness and the self-aware can often inspire confidence.

When applications can justifiably appear to possess a greater degree of confidence in their actions, they will inspire a greater degree of trust in their users.

Dr. Walter Rolandi is the founder and owner of The Voice User Interface Company in Columbia, S.C. Dr. Rolandi provides consultative services in the design, development and evaluation of telephony based voice user interfaces (VUI) and evaluates ASR, TTS and conversational dialog technologies. He can be reached at wrolandi@wrolandi.com

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

When You Don't Know When You Don't Know

SoundHound Partners with Acrelec

Deepfake AI Market to Generate $41.36 Billion by 2032

SoundHound Launches Vision AI

Vuzix Introduces LX1 Smart Glasses for Warehouses