Speech Technology Magazine

 

How Good Is Good Enough?

Setting metrics for measuring the success of speech applications goes beyond recognition rates
By Susan L. Hura - Posted Jan 30, 2007
Page1 of 1
Bookmark and Share

A persistent question for those of us who design and implement speech-enabled IVR applications is deciding when the application is performing well enough. Tuning and other analyses allow us to measure application performance and to make improvements. But we're often left with the question of how good is good enough?

Many people see objective measures, such as recognition rate, in-grammar/out-of-grammar rates, call containment, time on task, error rate, and first call resolution as the answer to what's good enough. Simplistically, this is true: a 92 percent recognition rate is better than an 87 percent recognition rate, for example. But is it a smart decision to try to make this improvement? 

If the speech project team establishes baselines on objective measures in advance, and then sets goals based on these baselines, it should be clear whether a goal has been met. However, in the real world of budgets and deadlines, the picture is often not so clear. Baseline measures are not always available, and when they are, they can be difficult to interpret holistically. (What does it mean when some measures improve and others drop after a so-called improvement?) Moreover, when you're working from incomplete baseline data, it becomes difficult to determine how much improvement we could reasonably expect. It can also be tough to figure out how to measure a goal, especially goals often expressed by non-technical project sponsors, such as better customer service, spending less money in our call centers, etc.

A common thread among all these concerns is that objective measurements alone do not provide all the data needed to make crucial project decisions. Meeting any performance goal necessarily requires an investment of time and resources. The underlying issue is whether the time and effort required to reach that goal will be worth the effort. While 92 percent is indeed better than 87 percent, one still has to decide if it makes sense to try to get there. 

There is a rich source of data available to any organization willing to seek it. User opinion data is cheap and easily accessible, and offers rich insights into making smart business decisions. Yet, most organizations deploying speech applications proceed without tapping into this data. Most organizations do a reasonable job at recording and analyzing user behavior, but the value of opinion data often goes unrecognized because it's just opinion. I used to be this sort of data snob, believing that only hard, numerical data from observable, quantifiable, repeatable events is worthy of consideration. I now realize that subjective opinion data from the end users of technology can provide the missing factor in analysis of hard data.

By matching user opinion data with objective measures like recognition rate, it becomes a simple matter to determine what's good enough, and therefore how to set goals for improvement. If an 87 percent recognition rate is accompanied by poor opinion scores for recognition performance, then it is easy to expend the resources to improve recognition rates. But if opinion scores indicate that most users are satisfied with that 87 percent rate, then investing time and resources to improve recognition rate may not be worth it. Opinion data also provides a means to measure squishy goals like better customer satisfaction because you can define opinion measures that directly relate to such goals. 

In all of this it's important to remember that opinion data must be collected as rigorously as any other data. Opinions are most meaningful if they are collected from representative users immediately following an interaction with the application under realistic conditions. And as with any user testing, having an adequate sample size is vital.

Perhaps more important, though, is using an appropriate metric for collecting opinions. While anyone can dream up a few questions and produce a survey, this data would be of dubious value. In the social sciences, there is an established method for constructing surveys that produce meaningful, predictive data, often with few test participants. There is not yet a commonly accepted opinion metric for speech, much to our detriment. A rigorously designed user opinion survey would provide many benefits to the speech industry as well as to our clients.


Dr. Susan L. Hura is the founder of consultancy SpeechUsability and a member of the board of directors of AVIOS (the Applied Voice Input Output Society). She can be reached at susan.hura@speechusability.com.

 

Page1 of 1