-->

Ten Criteria for Measuring Effective Voice User Interfaces

A Toolkit of Metrics for Evaluating VUIs
Investors use standard metrics such as stock price and projected revenue per share to choose investment opportunities. Likewise, consumers use standard metrics such as floor space, number of bedrooms, or number of bathrooms when purchasing houses. This paper presents a toolkit containing some specific metrics for evaluating voice user interfaces (VUIs). The speech industry should use criteria from this toolkit to:

  • Judge the most efficient of several VUIs for the same application from competing vendors.
  • Determine whether a change to a VUI is worthwhile by comparing metrics from before the change to metrics after the change.
  • Avoid misunderstandings about the meaning of frequently used criteria such as "ease of use" and "completion rate" by carefully defining the criteria.  

The toolkit contains 10 metrics for evaluating VUIs which are categorized into two classes: subjective and objective.

Contributing Authors


Jonathan Bloom, ScanSoft
Juan E. Gilbert, Auburn University
Tom Houwing, VoiceObjects
Susan Hura, Intervoice
Sunil Issar, Convergys Corporation
Lizanne Kaiser, Genesys Telecommunications Laboratories
James A. Larson, Intel (organizer and editor)
David Leppik, Vocal Laboratories
Stephen Mailey, Voice Partners
Amir Mané, Voice Advantage
Frances McTernan, Nortel
Michael McTear, University of Ulster
Steve Pollock, TuVox
Phil Shinn, Genesys Telecommunications Laboratories
Lisa Stifelman, Tellme Networks
Dale-Marie Wilson, Auburn University

Subjective Metrics
Caller opinions matter! If a VUI presents a poor experience, callers will not use it. If callers have a good experience, they will be more likely to use the VUI again and again, and be more "forgiving" if they experience problems in the future.1  It is very important that user interface experts use post-call surveys and questionnaires to collect callers’ subjective opinions about a VUI. While we often think of subjective metrics as being "fuzzy," they become objective data when you ask a statistically significant number of properly chosen individuals.

A Likert scale2 is often used in questionnaires and surveys. For each item on the questionnaire, respondents specify their level of agreement to a question such as "The voice was understandable." Callers respond using a five-point Likert scale:

  1. Strongly disagree
  2. Disagree
  3. Neither agree nor disagree
  4. Agree
  5. Strongly agree

This five-point Likert scale may be used to solicit caller input on subjective3 criteria such as the following:

  1. Caller satisfaction
  2. Ease of use
  3. Quality of audio output
  4. Perceived first-call resolution rate

The mean score of a large number of subjects represents callers’ subjective evaluation of the VUI.

Three Over-arching Attributes

To gauge the health of a voice user interface, VUI designers consider three over-arching attributes:

  • Effectiveness — A measure of whether callers can complete their tasks.
  • Efficiency — Measure of the amount of time and effort required to complete tasks. 
  • Caller Satisfaction — How did callers perceive the quality of their interaction with the VUI?

These attributes are intertwined and a change in any one of these variables can potentially affect the other two.  Adding copious instructions at every prompt may increase the effectiveness of a VUI, but at the expense of efficiency and satisfaction.  If the option to transfer to a live agent is offered at every dialog state, callers will be more satisfied, but the VUI becomes less effective and potentially less efficient.  Organizations deploying speech applications need to recognize these tradeoffs and prioritize the goals of the application at the very beginning of a new speech project.

Both efficiency and effectiveness can be measured objectively,8  but caller satisfaction is purely subjective and based upon the opinions of callers.

1. Caller Satisfaction
Caller satisfaction measures the degree to which the VUI meets the caller’s expectations.  This metric is widely used, but requires some interpretation. Satisfaction does not correlate perfectly with task completion. For example, satisfaction is relative to caller expectations: mediocre service may be expected from Yugo, but not Mercedes-Benz.

Criteria  Definition  Calculation
Caller satisfaction The degree to which the VUI meets callers’ expectations Callers rate the VUI using a Likert scale with the statement: "My expectations were satisfied during this call"
 
2. Ease of Use
Ease of use measures callers’ perceptions of using the application with little or no training. Ease of use depends on many factors including navigation, intelligibility, effectiveness, and error recovery. Related criteria include intuitiveness of choices, ability of the IVR to satisfy callers’ needs, and providing help when needed. 
 
Criteria  Definition  Calculation
Ease of use Callers’ perceptions of using the application Callers rate the application using a Likert scale with the statement: "The application is easy to use"
 
3. Quality of Output
System audio can be divided into speech and non-speech elements. Speech covers the quality of the spoken system output, including synthesized speech and pre-recorded audio.
 
Criteria  Definition  Calculation
Voice intelligibility Callers’ subjective ratings of voice intelligibility Callers rate the voice using a Likert scale with the statement: "The voice was understandable"
Voice quality Callers’ subjective ratings of voice quality Callers rate the voice using a Likert scale with the statement: "The voice sounded good"
 
Non-speech audio includes earcons—sounds that convey a message (e.g., a ticking clock indicating that the computer is busy or a doorbell when a message has arrived) and audio logos—music or sound effects used for branding (e.g., the "bong" heard when first connected to ATandT or the four tones for Intel Inside heard on commercials about Intel chips.)
 
Criteria  Definition  Calculation
Earcon recognition Callers’ recognition of the message (semantics) associated with the earcon Callers rate the earcon using a Likert scale with a statement such as: "I understood what the non-verbal sounds (sound effects) signified"
Audio logo Callers’ recognition of the brand associated with the audio logo Callers are able to recognize the brand associated with the audio logo
 
4. Perceived First-Call Resolution Rate
This subjective criterion measures whether callers accomplish their goals on the first call, including both the interaction with the VUI and possible interaction with a human agent. This criterion is also known as first customer service resolution, first time-final, and once-and-done. According to Nederlof and Anton, this metric is the most predictive criterion for positive customer satisfaction.4
 
Criteria  Definition  Calculation
Perceived first-call resolution rate Perceived successful completion rate on the first call, including both VUI and possible interaction with a human agent Callers rate the voice using a yes/no answer to the question: "Did you accomplish your goal?"
 

Twice Is Not Better Than Once

It is critical for human agents to be able to access information previously provided to the automated system. Both callers and live agents waste time when the caller must provide the same information twice. This impacts agent costs, frustrates callers, and may decrease adoption of the automated system.

Conversely, the VUI should be able to provide callers with information about the live agents, including the estimated waiting time until a live agent is available. This suggests another metric:  the caller should be able to access information about the availability of a live agent.

Objective Metrics
Objective metrics are measurements of time or activity that do not involve subjective judgments by the callers. To capture objective data, VUI developers must: (1) insert logging instructions at strategic points in the dialog code, which (2) record the times of specific activities to a log file, and (3) summarize the recorded times using a scoring program, which aggregates and calculates scores for a variety of objective metrics. Objective criteria include:

5. Time-to-task
6. Task rate
7. Task completion time
8. Correct transfers
9. Abandonment rate
10. Containment rate

5.  Time-to-Task
When customers call an airline, bank, or other business, they generally have a task in mind or a problem to solve. Time-to-task measures the amount of time it takes for a caller to begin the task he called about. Lengthy instructions, references to a Web site, untargeted marketing messages, or other irrelevant information at the top of a call delay callers from their tasks.
 
Criteria  Definition  Calculation
Time–to-task Time it takes from answering the call to the time the caller starts performing the desired task The time elapsed from the beginning of the call until the first prompt or relevant information is presented to the caller
 

Other Important VUI Criteria

The workshop attendees felt that the following criteria are important, but these criteria do not lend themselves easily to a metric that can measure their successful application:

  • Cognitive Load — The mental effort required to use the VUI should not exceed the mental capabilities of callers.
  • Branding — The VUI should build the brand value of a company in the mind of the caller.
  • Perceived Affordance — The degree to which the system is effective in communicating to the caller how it may be used: that is, what is the functionality that the system has to offer and what must the caller say in order to have the system perform that functionality.
  • Error Handling — The successful application of strategies for recovering from problems that occur during human-computer interactions.
6. Task Rate
Callers typically rate their automated experience quite high if they are able to accomplish their task.  An automated application can be divided into different conceptual tasks (e.g., authentication, account balance, forms request, payment locations) and each task can be flagged with start- and end- points.  Task rate comprises two related measurements:
  1. Task Initiation Rate (TIR)—Appropriate for evaluating all type of tasks. With informational tasks, TIR is a more suitable measurement because there is a clear starting point, but not necessarily a well-defined end-point. For instance, if callers request a summary of insurance benefits and hang up or opt out before hearing the full summary, it is uncertain if they received the needed information they were looking for. 
  2. Task Completion Rate (TCR, also known as transaction completion rate) — Appropriate for transactional tasks, which have clearly defined end-points (e.g., changing an address, transferring funds).
Criteria  Definition  Calculation
Task Initiation Rate (TIR) Percentage of calls that trigger a specific task start-point Number of times a specific task start-point is triggered divided by the number of calls
Task Completion Rate (TCR) Percentage of calls that trigger a specific task end-point Number of times a specific task end-point is triggered divided by the number of calls where this task was initiated
 
7.  Task Completion Time
Task completion time (also known as transaction duration) measures the time a caller takes to complete a specific task. Generally, a shorter task completion time is desirable for the caller and for the service provider. Gupta and Gilbert have recommended two target task completion times5:
  1. Maximum Task Completion Time—The maximum acceptable duration for a task for a specific application. The time taken for callers to complete the task can be compared against this metric.
  2. Expected Task Completion Time—The time taken by expert callers to complete a task for a specific application. This can be used as the basis for comparison with the time taken by all callers of the voice user interface. Over time, this metric can be adjusted to reflect the task completion time of typical callers using the interface.
The task completion time is highly correlated to TIR minus TCR.
 
Criteria  Definition  Calculation
Task completion time Time to complete a specific task Time between the start of a specific task and the end of the same task
 
8. Correct Transfer Rate
Callers may be redirected from an automated system to a live agent if either (1) the caller cannot proceed with the automated dialog or (2) the caller requests to be transferred. If a call is misrouted, the agent must redirect it, delaying the caller’s task and increasing costs.
 
Criteria  Definition  Calculation
Correct transfer rate Number of calls successfully transferred to the correct party Divide the number of correctly routed calls by the number of routed calls
 
9.  Abandonment Rate
Abandonment rate has traditionally been used in call centers to determine the percentage of callers who hang up while waiting in queue to speak with an agent. Similarly, this measurement can be applied to VUIs; namely, the percentage of callers who hang up before carrying out a task in an automated system.6 Abandonment rates will be higher in situations where there are frequently misdialed calls or where the introduction asks callers for information they currently may not have (e.g., account number), resulting in callers hanging up to find that information before calling back.  To make this metric precise, it should be associated with a specific task rather than the entire telephone call.
 
Criteria  Definition  Calculation
Abandonment rate Percentage of callers who hang up before carrying out a specific task in an automated system Number of callers who hang up before completing a specific task divided by the total number of callers beginning the task
 
10.  Containment Rate
Since a common objective of voice applications is to reduce call center costs, automation success is frequently assessed in terms of containment—the percentage of calls not transferred to human agents.  However, concentrating on containment rates may result in an application design that blocks or hides the exit, so callers cannot access a human agent easily.  This can have disastrous consequences because callers quickly learn alternative ways of escaping from the automated system and transferring to a human agent (e.g., pressing keys randomly, "playing possum" until the system transfers them). This also adversely impacts customer satisfaction with callers spending valuable minutes venting their displeasure to the human agent.
 
Based on data from 60 studies conducted by Vocal Laboratories, Inc., difficulty reaching an agent accounted for 61 percent of the variance in caller satisfaction levels and 49 percent in first-call resolution rates.7  Companies that made it harder to reach a human saw much lower first-call completion rates and more repeat calls as well as frustrated callers. While a high containment rate is desirable, this goal should not prohibit callers with complex and difficult requests from connecting to a human agent. 
 
Containment is often gauged using the reverse measurement of opt-out rate (i.e., the percentage of calls that transfer to an agent). Some VUI specialists count callers, not calls, and treating multiple calls from a single person as a single event. This takes into consideration the situations when the caller gets lost and redials in order to reset or when the caller hangs up to find additional information and before redialing.
 
Criteria  Definition  Calculation
Containment rate Percentage of calls not transferred to human agents Calls completed within the IVR divided by the total number of calls
 
This report focuses on 10 widely used VUI measurable metrics. The authors have agreed upon the definitions and calculations presented in this toolkit.  Apply the criteria from this toolkit to measure your VUIs, both to see if changes actually improve your VUIs and to compare your VUI with similar VUIs.  By using the same terminology and performing the same calculations, the number of misunderstandings and misrepresentations in the speech industry will decrease.
 
References:

1 Norman, Donald A. 2004.  Emotional Design: Why We Love (Or Hate) Everyday Things. New York, NY: Basic Books.
2 The Likert scale was named for Rensis Likert, who invented the scale in 1932.
3 A Likert scale may also be used to solicit non-subjective criteria from callers, such as "Did you complete the task?"
4 Nederlof, Ad and Jon Anton.  2002.  Customer Obsession: Your Roadmap to Profitable CRM.  Santa Maria, CA: The Anton Press, pp. 186­189.
5 Gupta, Priyanka and Juan Gilbert. "Usability Metrics for Spoken Language Systems." International Journal of Speech Technology.
6 Abandonment rates can measure either hang ups before a task has been initiated or hang ups within a given task.
7 Leppik, Peter, 2005.  "Does forcing callers to use self-service work?"  http://www.vocalabs.com/resources/newsletter/newsletter22.html .
8 For call routers, effectiveness is measured according to whether the VUI routes to the appropriate destination and efficiency is the speed with which the caller reaches the appropriate destination.

SpeechTek Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues