The What, Why and How of Usability Testing
As the cartoon illustrates, users become frustrated when speech applications don’t work. Testing minimizes this frustration by detecting and resolving many speech application problems before they cause user frustration.
What do usability tests measure?
Developers use two types of metrics (measurements) in usability testing—performance and preference:
Why is usability testing important?
- Performance metrics—Measure the caller’s success or failure to perform specific tasks. A performance success indicator is the target metric value; if the caller fails to achieve the success indicator, then the application is not yet ready for release. Table 1 shows some example performance metrics and success indicators.
Table 1: Caller Tasks, Performance Metrics and Success Indicators
|Caller Task ||Performance Metric ||Success Indicator |
|The caller speaks command ||Word error rate ||Less than 3% |
|The caller understands a prompt ||The caller performs an appropriate action after hearing the prompt ||Greater than 97% |
|The caller completes a specific transaction ||The caller successfully completes a specific transaction ||Greater than 93% |
Performance metrics are objective. The indicators (values) for each of these metrics can be measured by counting the number of tasks and the time it takes users to perform the tests. During testing, qualification test engineers ask callers to perform specific tasks. The time it takes to perform each task is recorded in a log file.
- Preference metrics—Measure the caller’s likes and dislikes. Preference metrics are subjective. These metrics are measured by asking callers questions after using the application. Table 2 illustrates some example preference metrics and the corresponding success indicators.
Table 2: Preference Metrics and Success Indicators
|Metrics ||Success Indicator |
|On a scale from 1 to 10, rate the help facility ||The average caller score is greater than 7. |
|On a scale from 1 to 10, rate the understandability of the synthesized voice ||The average caller score is greater than 7 |
|The average caller score is greater than 7 ||Over 75% of the callers respond by saying "yes" |
|What would you be willing to pay to use this voice-enabled application? ||Over 75% of the callers indicate that they are willing to pay $1.00 or more per use |
The last question is really asking how much this service is worth to the caller. Answers to this question should be considered when setting a price for the service.
Many testers also ask open-ended questions such as:
- What did you like the best about this voice-enabled application? (Do not change these features.)
- What did you like the least about this voice-enabled application? (Consider changing these features.)
- What new features would you like to have added? (Consider adding these features in this or a later release.)
- What features do you think you will never use? (Consider deleting these features.)
- Do you have any other comments and suggestions? (Pay attention to these responses. Callers frequently suggest very useful ideas.)
Performance and preference metrics are important during and after the development of speech applications for the following reasons:
- Set common expectations—Both the developer and the customer use metrics to agree on what a successful application does. For example, consider the metric:
A user can perform a specific number of functions per hour.
For the developer, this means: engineer the application to meet this goal. For the enterprise customer, this means: allocate sufficient human resources to perform the estimated required number of functions within the constrained amount of time.
- Measure improvement—Developers use metric measurements as yardsticks to measure how well the application performs. Testing and deriving values for performance measurements will not only indicate positive or negative improvement, but also quantify the amount of improvement. The developer optimizes the application to meet or surpass the metric success indicators. These indicators determine when iterative testing and finetuning is complete.
- Enable comparison—Similar applications from different vendors can be compared using similar metrics. For example, if users of Application A consistently perform more functions than Application B which supports similar functions, then Application A can be said to be "faster" than Application B.
- Create world-class user interfaces—Users are the final judges of a speech application. If they have trouble learning or using the speech application, then the application will not be effective. Only by testing can developers refine and improve an application to make it "world class".
How do developers conduct usability tests?
Developers insert commands throughout the application to capture the times and names of interesting events such as when the application presents a prompt to the user. The user responds to the prompt successfully, fails to respond to the prompt, or the user responds to the prompt inappropriately. The VoiceXML browser records these events in a log file. A report generator summarizes and calculates the indicators for each performance metric. The developer quickly determines which performance metrics are satisfied and which portions of the application need refinement. Most VoiceXML system development environments support the function and one or more report generators.
Preference testing is accomplished by collecting preference scores from users after they test the system. Developers collect this data by interviewing users immediately after testing the application and asking the user to score the various preference criteria. This can be labor intensive. Alternatively, users are asked to enter preference scores onto a paper questioner, a Web page, or a verbal VoiceXML form. VocaLabs, www.vocalabs.com, provides a service that conducts usability tests and collects preference data via visual Web pages.
How much usability testing is enough? Jakob Nielsen, http://www.useit.com/alertbox/20000319.html, suggests that elaborate usability tests are a waste of resources. The best results come from testing no more than 5 users and running as many small tests as you can afford. The first test identifies several problems. As Nielsen says, "The difference between zero and even a little bit of data is astounding." The second and third tests will provide more usability. As more and more users are tested, you learn less and less because you keep seeing the same problems again and again.
I don’t want to deal with frustrated users like General Knox in the cartoon. A little usability testing goes a long way toward keeping users happy.
Dr. Jim Larson works for Intel and chairs the W3C Voice Browser Working Group. His new book, VoiceXML: Introduction to Building Speech Applications, has just been published by Prentice Hall.