Peter Leppik, CEO, VocaLabs
What did VocaLabs discover about the applications created during the Speech Solutions CHALLENGE?
Peter Leppik There were seven different teams competing in the CHALLENGE, and they each had a different approach to creating an application for scheduling a car repair appointment. We evaluated each application with an average of about 500 different people, each of whom was asked to schedule an appointment for a particular time, date, car model, and problem. Perhaps the biggest surprise was how well all seven teams did, given the severe time limits. Remember, the teams were not told what the application was until Monday morning. By 7:30 PM Monday evening, we were hammering all the applications with live callers, and posting the results to our Web site in real-time. There wasn't much room for the teams to miss their mark. The other thing we discovered was that the design decisions the teams made had a huge impact on application performance and usability. The team with the longest average call length was almost twice as long as the shortest average call length. The team with the highest satisfaction score did more than twice as well as the team with the lowest score. The team with the highest single call completion rate was 38 points better than the worst single call completion rates. These are dollar-and-cents issues.
What are some reasons that speech applications need to be tested?
PL Speech recognition applications need to be tested for three things: usability, accuracy, and capacity. Usability is the most critical of these, but often gets the least attention. Let's face it, no matter how well your recognition engine performs, if callers don't understand how to use your application, they'll simply refuse to use it. At this point, all the cost savings you hoped to achieve goes down the drain. It is vital to evaluate the application before it starts taking customer calls, and evaluate it with a large number of different people. We generally don't do a study with fewer than 500 different people, and with larger applications, it may be necessary to use several thousand different people. You need all these different callers to ensure that the results of the usability study are statistically meaningful, and that you've tested against a broad cross-section of your expected customers. Accuracy means ensuring that the recognition engine is performing up to par, and that it has been properly tuned. The challenge is getting enough call data before going live to tweak up the recognition performance, but a large-scale usability study can go a long way toward providing this data. Modern speech recognition engines are amazingly sophisticated, but no matter how good the engine or application designer, there will always be out-of-grammar responses and other things which don't appear until the application is evaluated against a large number of calls. Capacity tests ensure that the application is capable of handling the load it is expected to have in a production environment. Even the largest usability study won't provide enough calls to simulate multiple trunks of traffic, but capacity testing can easily be automated, and there are several companies which specialize in it.
Why is it necessary to do systematic usability testing?
PL Building a speech application is one thing. Getting your customers to use it, or even prefer it, is something else entirely. Unless they use it, though, there's no business case for speech, and no point in building it. In our research, we've found that if a speech application is properly designed and tested, customers will actually prefer it to a live operator. Very few applications are that good, but systematic testing and refinement will get you there. For example, we had a client whose speech system was successfully handling 85 percent of its incoming calls. That sounds pretty good, but they used the usability data we gathered to boost it up to 93 percent. That means that the number of calls handled by live people dropped in half. We find that many new speech applications have considerable room for improvement, because there was never any large-scale usability testing. At best, there may have been small-scale evaluations with one or two dozen participants, but nothing which would uncover all the little reasons people can't or won't use the automated application.
Can you describe the testing process for the seven applications?
PL We have a group of about 45,000 people we've recruited for testing customer service operations. For each of the seven Challenge applications, we had about 500 people participate, using a different group for each application. Each participant was asked to call one of the seven applications, and attempt to schedule a repair for a given make and model of car, a particular problem, and on a certain date and time. We recorded every call, and then asked each participant to fill out a survey after the call. The call recordings were matched with the survey answers, and the data put up on our Web site in real time.
Was it difficult to have statistical data prepared in less than 24 hours for the audience at SpeechTEK? How long would this process normally take?
PL VocaLabs always provides study data in real-time, but we normally have about a two week window for completing a study. In this case, we had to run seven studies simultaneously, and collect enough data by Tuesday afternoon to present at our tutorial session. This was far and away the most ambitious project we've ever attempted, but we were confident that it was within our capabilities. I'm pleased that by Tuesday afternoon, less than 24 hours after we began our studies, we had about half the total amount of data we planned to collect. This was enough to keep us on track with our planned Wednesday evening completion, and enough to provide some meaningful insights into the seven applications.
What did the results show?
PL This is probably the first opportunity ever to have seven different teams approach the same speech application, making seven different sets of design decisions, with the chance to compare the results head-to-head. We learned a number of interesting things, and looking at the data could keep us busy for months. One thing we noticed was that some design choices are more important than others. For example, callers seem to prefer those applications which are easy to navigate and which minimize errors. This is more important than, for example, having a shorter overall call length. In fact, the team which had far and away the highest satisfaction score also had nearly the longest average call. In practical terms, unless you have a lot of expert callers - and most applications don't - the design bias should be toward being helpful rather than efficient. We also learned that you have to be very careful when designing a strong persona. One of the applications used the word "darn," which offended a small number of callers when they misheard it as something stronger. This is the kind of problem which can have a very negative effect on a company's image, but which can be very hard to discover without doing large-scale usability studies.
What can you interpret from the data?
PL We'll be looking at the data we gathered for a while, still, but there are some interesting questions we'd like to explore: What are the most effective approaches for error recovery? How does the persona design help the overall usability? What strategies worked for boosting satisfaction, perceived usability, and the caller's opinions of the system? What things frustrated callers the most, causing them to give up without calling back or trying again? What are the strengths and weaknesses of each application based on your test results? Each of the seven applications made different design choices, and scored differently in our studies. To try to make some sense of it all, we boiled our data down into 10 different numerical scores, ranging from call statistics like average call time and the average number of calls participants made; to cost-related data like the percentage of callers able to complete the task on the first call; to perception data like the percentage of callers who were Very Satisfied with the call, and the percentage of callers who felt the application was "Helpful" or "Friendly." No team scored a clean sweep of our 10 statistics, though some did better than others. For example, one application had the highest satisfaction score and the fewest average calls per participant, but nearly the longest average call time. Depending on the design goals of the application, that team either did very well or not so well. Another team had the highest score for single call completion, but scored very poorly for the percent of callers who though the application was "helpful." So, even though callers were demonstrably able to finish their calls, they didn't seem to like the way the application helped them through the process. When building a real application, the next step would be to take this data, and decide where the tradeoffs need to be made. Is the customer willing to trade a longer average call time in order to have more satisfied customers? How important is it for callers to think the application is "friendly," as compared to "efficient"?