Getting the Devil out of the Details

When Convergys deployed a service contract application IVR for a major American automotive company, the vendor faced a big unknown: Would the speech recognition engine properly process callers’ vehicle identification numbers (VINs)? VINs are alphanumeric, which, by their nature, stretches the capabilities of most speech recognition engines. And while usability testing can certainly determine whether the IVR as a whole is navigable, it’s only through rigorous tuning cycles while the application is in pilot that the vendor would be able to optimize recognition.

Tuning—during which, as the saying goes, calls are monitored for quality assurance—has gained recognition in recent years as a crucial step in the development of a speech IVR. It essentially involves updating grammar items, adding pronunciations to the lexicon, and changing prompts where there’s user confusion. But its popularity might be self-defeating. One designer wondered whether tuning was becoming too much of a buzzword within the speech technology industry, considered a quick fix to improve recognition accuracy. However, such a reductive way of approaching tuning undersells its importance and, in many cases, its difficulty.

Dave Pelland, director of Intervoice’s Design Collective, says that when he first arrived at the company, he struggled to understand what exactly the tuning group did. Pelland’s background is in speech science, and his previous employer, he believes, had the misperception that a tuning cycle was merely a speech recognition exercise. "They were always trying to figure out which knob or dial on the speech recognizer to modify in order to improve recognition," Pelland says.

Simply put, if usability testing is the yin, then tuning is the yang. Both provide different but complementary sets of data determining the efficacy of an IVR. The intimacy of usability allows the tester and the subject a one-on-one interaction that reveals the overall customer experience. But because usability tests incorporate participants culled by a recruiting firm instead of actual customers, realism might be lacking.

"In my experience, usability is extremely important," says Judi Halperin, a Speech Engineer from Avaya. "But it’s most important when you’re testing new functionalities or a new way of approaching something, as opposed to something that’s really tried and true. Tuning is much more key in many cases because in usability you have fewer callers and a sterile lab environment."

One problem in usability is experimenter demand, in which people respond based on how they expect the experimenter wants them to respond. Such behavioral alterations sully the reliability of the collected data. In tuning, the data is much cleaner because it consists of recorded conversations of actual customers in actual situations.

"You get to hear people screaming at their wives, screaming at their kids, dropping dishes in the sink, cursing," says Melanie Polkosky, a human factors psychologist specializing in VUI design. "Oh, there’s a lot of cursing. It’s people really living their lives and not focusing on the IVR and not listening to these prompts like it’s the most important thing in the world."

Susan Hura, founder and principal at SpeechUsability, recalls a usability test in which people were required to give account numbers. Because the test panel didn’t consist of actual customers, the testers provided hypothetical account numbers.

"When you get to the real world, you find out maybe it’s not as easy as you thought for people to come up with that account number," Hura says. "You might find out that it’s something [customers] don’t have memorized. You might need to give them a little bit longer time to give a response. You might need to reprompt: If you don’t have your account number, you can also use your date of birth." And this modification, Hura points out, is a big deal. If real customers can’t even get past the log-in, there’s no way they’re going to get anything resembling customer service.

Because tuning cycles occur once the IVR system is in pilot and deal with thousands of calls, the collected data is statistically significant, and those numbers can ultimately justify changes to the system that need to be made. By contrast, usability involves less than 100 people, so it’s riskier to alter a grammar or tinker with recognition accuracy.

"Usually following usability, you’re not going to make many changes to your grammars," Hura says. "You don’t have the data to do it. Two out of 10 [errors] isn’t enough to make a change to a grammar. But 200 out of 1,000? Yeah, I’ll make a change based on that."

But tuning has its limitations as well. "To me, it’s the customer satisfaction piece," says Kristie Goss, a senior VUI designer from Convergys. Gathering customer satisfaction data in a tuning cycle is typically limited to optional surveys taken at the end of the IVR interaction, and usually only 5 percent of callers are given the opportunity. Systems are rated on a scale of one to five, but it’s never clear why a certain rating was given.

Hura warns that interpreting caller intent during tuning is tricky. "It’s really important not to overstate your findings," she says. For instance, if someone calls a cable company and hangs up, is that a bad call, a good call, or a neutral call? "It all depends what the caller had in mind," she says. "It could be a good call if the person was calling in to get a balance. It could be a neutral call because for all you know, they hung up the call that minute because the FedEx guy was at the door. It could be a terrible call when they hang up because they can’t stand IVRs. So in tuning, a big problem is trying to interpret what’s going on in a caller’s mind."

"There’s nothing more valuable than sitting down with the user and asking what they liked, didn’t like, and what they’d change," Goss agrees.

Plus, many ratings are skewed. If customers bother to take a survey, it’s either because they were incredibly impressed or incredibly unimpressed with the IVR. "You never see anything in the middle," Goss says. "Also, some clients only want to survey a contained call"—that is, a call that doesn’t transfer to an agent. To counteract this, Goss prompts the caller to rate the system right before a transfer.

Ultimately, usability testing and tuning cycles "should go hand-in-hand," Hura says. "They fill in the gaps."

Polkosky sees the two as nearly synonymous. "They are forms of the same thing," she says, "and both evaluate a user interface to determine whether usability problems exist and determine appropriate remediation actions."

The Process
The basic data collection process of a tuning cycle is relatively straightforward. Speech engineers lead the tuning cycle; they develop the plan, analyze the data, and generate the final report—all with significant input from the IVR designer. Once the IVR is deployed, 1,000 to 5,000 calls are recorded and transcribed.

While all this seems relatively straightforward, tuning, according to Pelland, is "doing the fine detail work." A router application that simply forwards customers to various departments in an enterprise is easily tuned; more complicated applications with a variety of pathways can be brutal.

"Obviously with less functionality we don’t need as much data because callers are all going down the same path, as opposed to a very complex app where you might not get a customer going down all the different paths," Avaya’s Halperin says. A transcript of each call compares what the recognizer heard versus what the caller actually said, giving the tuner enough information to gauge recognition accuracy and get a sense of caller intent.

For instance, the recognizer may claim that 90 percent of the time it recognizes a yes response when, in fact, the caller is simply choking. "You really can’t know that sort of detail until you’ve transcribed the data and taken a look at what’s going on," Halperin says.

In gathering the data, a tuner should have a minimum of 100 utterances per task to consider. Caller interaction is broken into modules, which the tuner analyzes. If a tuner expects a caller to request milk during a certain phase and 75 percent of callers asked for coffee, then the tuner knows that either coffee needs to be added to the grammar, or that the prompt needs to be changed so callers realize that no coffee may be had at that particular stage in the IVR. It’s this detailed analysis that makes the tuning phase so important. "You get to see the numbers, the data behind the transaction," Goss says, "whereas in usability testing, it’s strictly from a caller experience perspective."

Goss uses a Web-based tool called Open Speech Insight from ScanSoft (now part of Nuance Communications) to tune her systems. Open Speech Insight looks at each module of the IVR independently for problems and generates a call flow, representing a percentage of how many callers uttered a given option. There might, for instance, be out-of-grammar utterances in which the caller says something the system isn’t programmed to recognize. Other problems include false accepts—when the grammar mistakenly believes it recognizes what the caller says—and false rejects—when the caller utters a valid response that, for whatever reason, the grammar doesn’t acknowledge.

Because performance metrics vary from application to application, there aren’t any constant benchmarks. Metrics such as customer satisfaction, abandonment, and containment rate—in which a call never transfers out of an IVR—are typically agreed on prior to the project’s development. The tuning cycle is the speech team’s final chance to meet those set metrics or pay a penalty to the enterprise customer.

Most problems can be fixed by clarifying a prompt or by either adding or subtracting items in a grammar. And major problems should be uncovered during usability, not tuning. "They’re mostly tweaks we’re making," says Jenny Burr, manager of Intervoice’s tuning group. "Changing a prompt to Just give me your account number instead of Tell me your account number is very minor, but it can provide a lot of impact."

When to Test
In general, it’s best to schedule two tuning cycles: a full-blown five- to six-week cycle with 5,000 callers in the first go-around, and a second cycle to ensure that recommendations made and changes implemented after the first cycle have the desired effect.

"Depending on the caller base and company, I’d say a quarterly tuning checkup is a good idea," Halperin adds.
Banking applications, for instance, are generally straightforward. Callers know what they want, and two tuning cycles are typically adequate to make recommendations regarding the system’s functionality. "The callers might call in once a day, once a month, to check rates or make transfers," Burr says. "But if it’s something where the application might be a little more complex and callers don’t call in that often, like a mortgage application where callers call in once a year, we might need to do a follow-up tuning to make sure callers understand how to use the system."

Even systems that might seem typical during one phase might change like a werewolf at a moment’s notice. Halperin recalls an IVR she tuned for a major bank. "And they neglected to tell us that because tax season was coming in three months, they were putting additional messaging and they would have callers calling for new reasons," she says. Thus, the system’s functionality was markedly different than what was originally planned. "I’d recommend another tuning effort as soon as that functionality shift changes," Halperin says. For instance, if a Boston company expands to Los Angeles, it might be a good idea to capture new utterances to accommodate the larger Hispanic population.

Changes in the business landscape can also affect the success of an IVR. Pelland recalls a telecommunications firm whose competitor released a new product and suddenly had to deal with a flood of calls inquiring about that particular product. "It’s like people walking into Burger King and asking for a Big Mac," Pelland laughs.

Enterprises occasionally make the mistake of forgoing tuning cycles even after the system is deployed. Doing so is like scrapping your car’s maintenance schedule. "What’s happened in the past—and this is a problem with existing IVR systems that are so vilified—is people put them in place five years ago and shut the closet door," Polkosky says. Occasionally, vendors get roped into refurbishing a clunker IVR. So in such cases, what should a tuner do? (One person’s response: "Try to get myself off the project.")

Difficulties
For the clearest diagnostic, Polkosky pulls information at the prompt level: prompts with the most errors, prompts that incited the most hang ups, and other statistics demonstrating how callers used the system. She recalls one IVR that didn’t have a particularly efficient or detailed data-capturing mechanism. "We got the best data we could manage, but I won’t say it was easy to get," Polkosky says. "And we had to organize it ourselves, and there were a lot of holes in it."

Even cutting-edge applications can instigate an adventure. When Goss designed the service contract application for an automotive company, she anticipated that the VINs would strain the speech recognition engine and need to be strenuously tuned. "The bottom line is we have to build the grammar for everybody," Goss says. "When we start doing the tuning analysis and listening to recorded calls, we might see that 95 percent of [callers] give the VIN in a certain format. That’s where you tune the grammar, where 95 percent of the callers are more accurately recognized, and you let the 5 percent [that aren’t] transfer to an agent."

While tuning a VUI to accept VINs is a relatively new area for Convergys, the vendor has a great deal of experience optimizing recognition on hard-to-capture areas. One of Convergys’s largest clients is the U.S. Postal Service. Most Postal Service systems need to recognize names and addresses, which historically are difficult to capture. Convergys has a variety of strategies to solve the problem and increase recognition rates. Sometimes, the solution is as simple as offering the option to spell the name.

Another possibility is building a name grammar around another piece of identifying information, such as, in the case of the automotive industry, a service contract number. A grammar can be constructed dynamically around that single bit of information. So instead of stuffing a grammar with millions of surnames, a grammar can be designed to correspond with a single identifying number.

One of the hardest systems to tune is a statistical language model (SLM). The speech recognition engine for an SLM assesses the probability of a word sequence to determine meaning. Because callers have more leeway to say whatever they want, their speech is less constrained than it would be in a less exotic IVR. While a non-SLM could entertain yes or no responses, SLM designers have to consider incorporating uh huh or sure. Thus, a tuning cycle on a non-SLM needs around 100 utterances per task, whereas an SLM requires anywhere from 30,000 to 50,000 utterances.

"We don’t pitch these everywhere, you can imagine," Pelland says. "Most of the customers I’ve been involved in where we sold these have highly segmented skill sets on their agents and have a lot of functionality they want in their IVR."

Think of the SLM as a nightmarish filing cabinet brimming with manila folders. In each folder are possible responses to a prompt like How can I help you today? A tuner sifts through all 50,000 utterances recorded during tuning, trashes the irrelevant responses—like responses that weren’t recognized because the caller was yelling at his son to take the fork out of the socket—and tries to determine where the remaining utterances should be filed.

"Some are tricky," Pelland says. If the caller says, I’d like to pay my bills and transfer funds, he’s essentially requesting two separate things and the tuner needs to figure out in which file that particular utterance needs to go. Tuning an SLM over a non-SLM, Burr estimates, is roughly triple the effort.

A Holistic Approach
When customers purchase a speech recognition system, they don’t always know what they’re getting into. Often they expect too much. "Out there in the industry, a lot of people will throw around numbers," Halperin says. For instance, customers might believe they’ll get 98 percent accuracy from an out-of-the-box IVR when certain qualifications—such as 98 percent accuracy with native speakers only—aren’t clarified. "Customers take it and run without really understanding what it is they’re looking at," Halperin adds. A tuning cycle can ultimately help the customer understand the system more thoroughly. And in understanding that system, better decisions will be made in how to integrate it into the overall business strategy.