John Moffly, Voice Architects
What did Voice Architects discover about the applications created during the Speech Solutions CHALLENGE?
John Moffly The results of the AlphaCheck on the apps validated the premise of the CHALLENGE, that is, using today's tools, you can actually build a fully featured speech application prototype in a matter of hours. This simply was not possible two years ago, and we can thank the software and tools venders, primarily the participants, who have made this possible.
What are some reasons that speech applications need to be tested?
JM Speech applications, like any other software project, need to be tested to validate that they function as designed. Traditional software testing includes unit and system code tests, latency to the back-end, stress tests, and others. Unfortunately, speech has an additional requirement. A perfectly implemented speech application may not actually function all that well, simply because caller behavior is different than anticipated. Interface testing allows the VUI designer to true up anticipated caller behavior and interface performance, with the reality of what actual callers say.
Why is it necessary to do systematic usability testing?
JM For the same reason data from actual callers should be used in interface testing, you should also be systematic in the process. Which means use of a common set of metrics on a statistically significant caller population. By far, the toughest part of coming up with our testing program, AlphaCheck, was definition of a common set of metrics. The metrics had to be easy enough for a call center manager to understand, yet useful for VUI designers to use for tuning. They must be generated on data from any speech engine and telephony platform at both the application and state level. Finally, they should be simple to generate and calculate We were looking for something like ASA, that everyone knows the definition of, and applies across all call centers regardless of size or vertical. The result was our set of six Key Usability Indicators, or KUI's.
Can you describe the testing process for the seven applications?
JM AlphaCheck is a simple process. Historical call data from recognition logs is imported into our LogSense analytics tool, which does several things. It creates a template of the application, where we can configure things like recognition states and invalid exit states for reporting, it generates KUI's at the state and application level, and it provides us with a tuning dashboard that facilitates subsequent tuning activities such as off-line recognition, audio review, and error aggregation. For a full AlphaCheck, VA then analyzes quantitative results, performs qualitative call analysis, and then generates a laundry list of hotspots, which we call our QuickFix Plan.
Was it difficult to have statistical data prepared in less than 24 hours for the audience at SpeechTEK? How long would this process normally take?
JM A full AlphaCheck takes about 4 days to perform, but the KUI generation process is relatively quick. The greatest challenge was collection and manipulation of large data files accessed from our laptop perched on the wobbly tables in Starbucks on 9th Avenue, through their equally wobbly wireless service. Other than that it was a snap!
What did the results show? What can you interpret from the data?
JM The KUI's that probably have the most relevance are the Average Errors per State (AES) and Raw Recognition Rate (R3). At 0.35, the AES is roughly 50 percent higher than what we see as an average for deployed applications (0.23). The Raw Recognition Rate was 64 percent vs. Voice Architects' benchmark of 77%. These numbers seem quite solid at one level, that is, it's remarkable that you could build an application in 6 hours that could return a recognition result 3 out of five times. Of course, not all these recognition results were correct. Low scores in Confirm Percent Yes (CPY) indicate a large number of mis-recognitions, or false accepts.
What are the strengths and weaknesses of each application based on your test results?
JM Perhaps the most interesting correlation was between the Application Error Rate (AER) and the Average States per Call (ASPC). The correlation may be a result of dialog design strategies. Structured, stepwise dialogs produced fewer errors, and more concise strategies, because they tackled harder recognition tasks, failed more frequently. However, the correlation also might be a result of another obvious point: apps with low usability have higher error rates, and hang-ups, and therefore shorter calls. In any case the data clearly indicates that effort spent on grammars and prompts (as opposed to call flow and persona) will improve recognition, reduce errors, and improve usability.
Explain the difference between testing and tuning.
JM Voice Architects' views testing as the assessment of interface performance, understanding the dynamics and establishing performance goals. Tuning is the process that optimizes the interface, essentially eliminating errors. So, generally, you would test before and after you tune. Tuning is now recognized by the speech industry as a critical part of the VUI lifecycle. At Voice Architects we tell call center clients, that if they train their agents, they should tune their speech interface. The value prop is identical.