The Art and Science of Error Handling
Each call to an interactive voice response (IVR) system is a conversation that, like human-to-human conversations, can easily be disrupted when one party misreads the inherent cues and talks out of turn. While humans can usually recover after interrupting one another in the course of a normal dialogue, a single turn-taking error can completely derail a conversation with an automated system. These errors can send the caller down the wrong path, trigger a transfer to an agent (which defeats the purpose of having an IVR in the first place), or worse yet, completely cause a system to freeze up and drop the call.
Nandini Stocker, a senior voice interaction designer at Google, has been dealing with these kinds of scenarios for the past 16 years. Her prior experience has included speech system work at Flare Design, Adecco, Convergys, Spanlink Communications, Intervoice, Gravelroad, TuVox, and MCI.
Stocker recently spoke with Speech Technology's Senior News Editor Leonard Klie, and shared ways for voice user interface designers to prevent, diagnose, and repair systems plagued by turn-taking issues.
Speech Technology: How can interruptions derail a normal IVR interaction?
Nandini Stocker: The term that we use in the industry is turn-taking issues. In normal human-to-human conversations, there are certain signals we give each other to either yield or take our turn speaking. The simplest example is a question: There is a certain syntax and intonation we use to indicate that it's a question and the other person can almost hear the question mark.
With a speech application, there is usually a series of questions and answers, and there is this concept of a barge-in where the person can interrupt the system and answer the question at any time. Where things start to fall apart is when responses happen at the wrong time, and the system could not anticipate it and doesn't respond the same way a human might.
In face-to-face conversations, interruptions happen all the time, but we almost always know what to do and we can usually recover quite easily. Once the conversation moves to the phone, you've already lost the advantages of some of the more effective visual cues, like gesticulation or facial expressions. Now, add on top of that, mobile connections add delays that make it more difficult to communicate.
If there's even a slight delay in a phone conversation, you could start talking over one another, and it creates a lot of problems for the speech application.
So what does this do to the typical interaction?
A variety of things can happen inside a speech application itself. The least egregious is probably that the caller hears that he wasn't recognized and gets a re-prompt. It's the least egregious because at least the caller knows what happened and how to get back on track.
Far worse are things like a feedback loop, where the caller thinks it's his turn to talk and so he starts talking, then he hears the system start to respond, and he stops talking because he thinks the system is going to ask him something else. Meanwhile, barge-in was triggered and an error was hit, and the caller doesn't know what's going on so he starts again.
Worse yet, the caller answers the next question without ever even hearing it.
Where do these interruptions occur? Is there a particular point in the conversation where they happen more often?
There are a couple of issues here. The simplest—and one that most people understand but don't always know what to do about—is noise. There could be some noise that the system tries to compensate for but it has interrupted the question, and the caller doesn't even know he was being asked a question.
Another issue is timing. In human conversations, there can be awkward pauses or thoughtful pauses. In a speech application, there could be processing delays or a data dip into a back-end database, and the caller has no idea that this kind of thing is going on. He might think it's his turn to talk when that's not necessarily the case; instead, there's some kind of processing going on before we get to the next question.
But the really big problem here is the prompt structure. Speech applications are trying to solve complex business challenges, and they are trying to get some kind of a response from the caller. But with the bulk of the applications out there, there is a bit of an imbalance. The designers are trying to pack too much into a single question. It's quite common to see a system that will ask the caller for a piece of information, pause, and then give some additional instruction. The pause is just awkward enough where the caller starts talking, and it overlaps with the instruction, so he stops, and then the whole turn-taking issues start.
Then there's the problem of giving such a large bunch of options without an appropriate pause for the caller to jump in with his answer.