A Multiple Approach Is Best
The usability of a voice user interface (VUI) is typically determined in a lab under controlled conditions. However, the controlled lab environment is very different from the in-use, in-situation environment. As a result of this limitation, new methods are being employed to perform VUI usability experiments in use and in situations.
VUIs rely on user input through speech. The acoustic speech signal suffers from a great deal of variability based on the user’s accent, native language, gender, the size and shape of the vocal tract, speaking rate and pitch, background and signal noise, etc. The fact that user input is variable in VUIs, in-use, in-situation evaluation is imperative. Two of the most common evaluation practices of VUIs are Wizard of Oz and Controlled-Environment Testing. These common methods have flaws and may not capture user input variability in a predictive way.
The Wizard Of Oz (WOZ) is a form of usability testing that happens early in the design process. Such predevelopment usability affords early evaluation of both the call flow and the verbiage used in prompting. In WOZ, a trained "wizard" assumes the role of the system, and reads prompts according to how the call flow is designed. WOZ helps the designer tailor the call flow and prompt verbiage to fit the user’s needs. It permits the designer to see which prompts or questions are hard for the user to understand; it allows the speech scientist to see what synonyms should appear in the speech recognition grammars; and it gives insight to average call duration. Identifying all these things before development can result in huge cost savings. This may prevent rerecording of prompts, grammars are likely to require less tuning, and the design is nailed down before any development has started. Fewer enhancements require fewer releases.
WOZ testing is commonly viewed as a must for designing a speech application and in the end, can be a big money-saver. The WOZ tests serve as input into the design of the VUI, which is evaluated later using a controlled-environment approach.
Most VUIs undergo an extensive controlled-environment evaluation when development has been completed. This process is very similar to summative evaluation in GUIs. In a typical controlled-environment experiment, participants are recruited based upon a specified demographic, like age, gender, or appropriate background experience. Participants come to the lab or office and are given specific scenarios to follow. Scenarios range from very simple to detailed and complex. Their interactions with the VUI during the execution of the scenarios are observed and recorded. Typically, video is collected and the participants fill out a post-experiment survey.
However, there is a potential problem with controlled evaluations. The controlled lab environment is very different from the actual use environment. This phenomenon is referred to as the ecological validity of the experiment, referring to the extent to which the context of a study matches the context of actual use. Ideally, the results of any experiment will have a high degree of external validity, or an ability to hold across different experimental settings, procedures, and participants.
In VUI experiments, ecological validity is crucial to increase external validity because of the variability of in-use, in-situation application use. VUI users find more novel environments and situations to use speech applications, complicating a researcher’s ability to properly emulate the in-use, in-situation environment. This creates an experiment-to-reality gap that inevitably decreases the ecological and external validity. One approach to minimize the experiment-to-reality gap is to conduct in-use, in-situation usability studies, commonly done using surveys, call recordings, and call logs. However, these individual tools don’t give the evaluator a complete picture of VUI usability.
Surveys are issued to users after a call has been completed. In some instances, a voice-enabled survey is used. In others, a call agent interviews the caller and fills out the survey. Surveys can be a good source for obtaining the caller’s opinions, but they suffer from a number of problems.
One of the most significant problems with surveys is that some callers don’t complete the call. Another problem is that some callers choose not to fill them out. Add to that the fact that VUI designers often design the voice survey to precisely model the written survey. Voice-based surveys should be designed and implemented as VUIs, not as the simple translation of the written survey to voice form. To date, no empirical research has been done on this subject, but several VUI designers have reported this as a problem.
In general, surveys are good to learn about the caller’s opinions, but they don’t tell the facts about what actually occurred during a call. This information is generally collected using call recordings. Most VUI systems record caller interactions. Listening to call recordings can reveal information about actual calls that surveys can never capture. Most surveys don’t ask the caller What did you say to the system that the system did not recognize? Even if the surveys did ask this question, most users can’t recall what they actually said, unless it was frustrating. However, this information can be obtained from call recordings.
Call recordings have several shortcomings as well. They cannot tell the designer what the caller was trying to do, how the caller felt, or why the caller did what he did. They also require a massive amount of effort to analyze calls, especially in a large call center, where there could be thousands of calls per day.
A proper sampling of 100 calls could be used to offset the enormous load of listening to every call, but determining which calls to listen to is a complicated task. With that in mind, increased call volume can easily be analyzed using call logs.
Today most IVR platforms come with extensive call logging capabilities. Every call that enters the VUI generates one or more entries in the call log. Call logs provide the VUI designer with valuable in-use, in-situation data. Surveys and call recordings typically result in qualitative data. Call log data is typically quantitative, providing an average call length, time on hold, abandon rate, call volume, and hints about what callers are looking for and whether they are finding it. Data mining techniques are well suited for processing the massive amount of call log data generated by a call center.
There are several off-the-shelf tools that can be used to analyze call log files, by makers like ClickFox, IQ Services, Enterprise Integration Group (EIG), and others. These tools provide a plethora of analysis features, including the ability to analyze call logs in near-real time. Some tools provide a dashboard-like interface that shows calls entering the IVR and where they leave. ClickFox produces a call graph that illustrates dialogue exchange and system traversal. IQ Services monitors incoming calls and emails the speech scientist when certain thresholds—such as when 10 percent of the callers within an hour abandon the system at the same point—have been met. Although these tools provide valuable information for in-use, in-situation analysis, call logs don’t reveal the entire picture.
Call logs identify where callers have difficulty in the VUI. This is only part of the picture. They don’t provide a context for interpreting the results: things that look bad in a log might be good, and vice-versa. They can help the speech scientist identify where the problems are, but not what the problems are. They can provide a quantitative analysis of the VUI based upon in-use, in-situation interactions. To perform effective in-use, in-situation evaluation of a VUI, all three items—surveys, recordings, and call logs—must be used.
To perform in-use, in-situation evaluation of a VUI, it is necessary to use all of the tools in your toolbox; surveys, recordings, and call logs. The process begins with the call logs. When a potential problem has been identified by the call log analysis tool, the location of the problem in the VUI is revealed as well. Using this information, the call recordings at the point of the problem can be used to analyze what actually occurred in the VUI. If additional information regarding the problem is needed, then callers can be surveyed or interviewed. This approach will minimize the massive effort required to analyze all call recordings; make better use of surveys, and use call log data to obtain qualitative analysis. Using all the tools in the toolbox in support of each other addresses the weaknesses of each tool.
Additionally, call recordings can still undergo random evaluation. Surveys can still be issued to callers that complete certain tasks. These methods can still be used in isolation to spot check the VUI in use, in situation. However, the integration of all three methods will optimize VUI design and potentially minimize costs by targeted problem areas identified in use, in situation. In our research, we could not find a comprehensive tool that integrates all three testing items, but the outlook for such a platform is optimistic.
Juan Gilbert, PhD, is an associate professor in the Computer Science and Software Engineering Department at Auburn University in Alabama, where he directs the Human-Centered Computing Lab. Kristie Voss is a VUI designer at Convergys Corp.