The Cumulative Effect of Recognition Failures
"Adapt or perish, now as ever, is nature's inexorable imperative."
-H. G. Wells
The Good News
Everyone in the industry is well aware that current speech technologies are undeniably impressive. Speech recognition accuracy rates have been very high for some time and there have been dramatic improvements in ASR robustness (the ability to recognize utterances under unfavorable conditions) during the last few years. The news might be exclusively good if only a high speech recognition accuracy rate was sufficient to ensure pleasant and efficient man-machine dialogs.
The Fly in the Ointment
When recognition goes well, turns are taken quickly and the user rapidly advances towards his task completion. Unfortunately, even with recognition accuracy rates approaching 100 percent, it is still possible for just a few recognition failure events to totally destroy an otherwise pleasant and productive user experience. This is particularly true of IVR applications that are repeatedly and frequently called.
98 Percent Accuracy…
How can this be? How can only a handful of recognition failures frustrate a user and ruin a user experience? Let's take an example.
Let's say that an administrative assistant at a hospital is obliged to call a particular insurance industry IVR 10 times a day to verify admission eligibility. On the average, calls require 10 conversational turns in order to obtain the desired information. That is to say, the caller is prompted to speak 10 times over the course of a session. Thus, on an average day, the caller will say 100 utterances interacting with the IVR.
Assuming the industry touted a recognition rate of 98 percent, the user would experience only two recognition failures. That is 98 recognition successes versus two recognition failures! How could this be so bad?
When you consider the numbers alone, there wouldn't seem to be problem. But what if the two measly errors that the user experienced everyday were always the same errors? In other words, what if the only failures the user ever experienced were at the same juncture or state in the IVR?
Molehill to Mountain
This would have two effects. First, it would greatly exaggerate the perception of recognition failures. Recall that the user must complete 10 turns to obtain the information. Put another way, he must pass through 10 dialog states. If the two percent of errors in the daily sessions always occurred in the same dialog state, two out of 10 visits to that state would result in failure. Thus instead of perceiving two failures per 100 utterances, the user may perceive two failures per 10 states or even two failures per 10 calls, rendering an overall (if unjust) impression of 20 percent failure.
The second effect is far more damaging, however. The fact that the failure always occurs at the same juncture is a reminder and ever escalating annoyance that the application is stupid because it cannot learn from its mistakes.
Imagine having an assistant who places all of your calls; and you call about 100 individuals per week. What if, every time you asked the assistant to dial a particular, perhaps unusual name, the assistant failed to understand, causing you to have to repeat the name at least once? Even though this might occur only once for every 100 names, most people would grow impatient with such an assistant. They would do so because they would expect the assistant to be able to learn.
Even though it seems "unfair," users adopt similar expectations for speech applications. Yet few industry players seem to appreciate that when recognition rates may approach 100 percent, recognition failures can nevertheless have a strong and detrimental cumulative effect.
Speech recognizers are, in fact, highly reliable, but as I have said before, they provide the application ears, not brains. And even though VUI design best practices are rapidly proliferating, additional innovation in dialog modeling will be needed to overcome the cumulative effects of recognition failure and better address user expectations.
VUI designers, speech vendors, platform providers and others should now concentrate on making VUIs more human-like in their ability to adapt and self-improve. Greater acceptance of speech recognition applications is likely to require it.
Dr. Walter Rolandi is the founder and owner of The Voice Use Interface Company in Columbia, S.C. Dr. Rolandi provides consultative services in the design, development and evaluation of telephony based voice user interfaces (VUI) and evaluates ASR, TTS and conversational dialog technologies. He can be reached at firstname.lastname@example.org.