When You Don't Know When You Don't Know
A Good Intention
During a break at a recent speech technology conference, a group of attendees were discussing the importance of learnability in their application designs. One participant advocated a particular method for classifying and dealing with recognition results as helpful.
The scheme divided user utterances into three basic categories: high confidence matches; low-to-medium confidence matches; and no-match or out-of-grammar (OOG) utterances. For the higher confidence results, the application acted as if it knew what the user had said and proceeded accordingly. For the low-to-medium confidence results, the application asked follow-up questions designed to confirm what was heard. When the recognizer returned a confidence of zero however, the application would then inform the user that the utterance was not in the grammar.
The participant believed that this method helped users to learn what they can and cannot say when using the application. The idea was to tell the user when he or she makes an utterance that the recognizer cannot understand and thereby teach the user not to say the utterance again.
This would be a pretty good design idea were it not for the unfortunate fact that it is predicated on a nonexistent ability: it is not currently possible to determine that an utterance is definitively and absolutely out-of-grammar.
The problem, as is often the case with speech applications, revolves around certainty. When a user says something, the recognizer attempts to match what is said to entries in its active grammar. Most recognizers associate some degree of certainty with every match. Typically, this degree of certainty is represented by a certainty (or confidence) factor (CF) which is assigned to the match. While not actual probabilities, CFs often look like probabilities: they are numbers that range between zero and 1.0. A CF of .91 would be a highly confident match; a CF of .52 would be a low-to-medium match. Usually a CF under .40 constitutes a sufficiently low degree of certainty such that one would be incorrect to assume that the match was good at least as often as one would be correct in doing so.
But where this fellow went wrong was when he naively assumed that a CF of zero means that there is definitely no match. This is simply not the case. Due to normal ASR error and extraneous environmental factors, recognizers are simply not accurate enough to be able to determine that any particular utterance is definitely not in a grammar. In fact, I have seen instances where the active grammar contained precisely two entries - yes and no - and a user said yes. Yet, due to background noise, the recognizer returned a CF of zero.
In such a case, what the application heard was not in the grammar. But what the user said most certainly was. As far as the user was concerned, he or she answered yes to a yes/no question a perfectly reasonable thing to do. Imagine the users dismay if the application were to respond by saying that what the user said was not allowed in the grammar!
While it may seem obvious to most, not knowing when you know something can tremendously handicap the intelligence of an application. This is an ability that we take for granted in ourselves.
Imagine an English-only speaker being asked a question by someone speaking in French. Not understanding even a syllable of French, the English speaker might likely reply, Im sorry. Please speak English. There is no need to exhaustively search his memory for possible matches or to devise a list of best guess interpretations. The English speaker instantly knows that what the other person said was not English, i.e., that the speakers utterance was not in the listeners grammar.
While it may seem inconsequential, having a recognizer capable of accurately determining whether or not an utterance is in its grammar would be a significant step toward more intelligent voice user interfaces. Knowing when you know something suggests a greater, general self-awareness and the self-aware can often inspire confidence.
When applications can justifiably appear to possess a greater degree of confidence in their actions, they will inspire a greater degree of trust in their users.
Dr. Walter Rolandi is the founder and owner of The Voice User Interface Company in Columbia, S.C. Dr. Rolandi provides consultative services in the design, development and evaluation of telephony based voice user interfaces (VUI) and evaluates ASR, TTS and conversational dialog technologies. He can be reached at firstname.lastname@example.org