Speech Technology Magazine

 

Usability Scorecard

This is the third in a series of “interactive columns” where our readership can participate in auditing self-service systems across nine industry categories. The idea is to provide highlights on companies who score in the upper quartile of the Sterling Audits Usability Index. The index is a standard methodology for rating the overall efficacy of both Web sites and Voice Response Systems. In this issue, we concentrate on voice response systems in the transportation industry.…
By Edwin Margulies - Posted Nov 23, 2004
Page1 of 1
Bookmark and Share

This is the third in a series of “interactive columns” where our readership can participate in auditing self-service systems across nine industry categories. The idea is to provide highlights on companies who score in the upper quartile of the Sterling Audits Usability Index. The index is a standard methodology for rating the overall efficacy of both Web sites and Voice Response Systems. In this issue, we concentrate on voice response systems in the transportation industry. Amtrak came out on top based on the surveys submitted.

Amtrak’s “Julie” Scores Highest on Usability Index
Amtrak is formally known as the National Railroad Passenger Corporation. The company’s passenger rail system spans 22,000 miles and covers 46 states. There are also two Canadian terminals in Montreal and Vancouver. Major cities are covered either directly or via thruway connections (buses). You can even charter a private train for meetings or special events.

The company has been carrying passengers since 1971. Last year, more than 24 million passengers used Amtrak; with daily ridership at about 66,000. As of the fiscal year ending September of 2003, the company booked over $2 billion in revenue but was operating at a loss of well over half that amount. In their most recent financial statement this past February, the auditors at KPMG said that Amtrak historically operates at a substantial loss with high dependence on Federal subsidies to stay afloat. Despite this, the company has a decidedly above average voice response system.

We chose the Transportation segment randomly for this issue and tabulated the scores for this industry using the Sterling Audits Usability Index. The top metrics are: Navigation, Content, Usability, Interactivity and Credibility. We figured the Upper Quartile (Amtrak is the placeholder for the upper quartile here for simplicity’s sake); Median; and Lower Quartile scores for each of the top metrics individually. Then, we add these for a grand total, which is used to figure the top performer. We do this to ensure that the top overall performer can be recognized even if they are not the best in a single category.

Based on all of the surveys Speech Technology Magazine readers and our own researchers submitted, Amtrak came out on top. (See the usability scorecard sidebar). It is important to note that I did not participate in establishing the scores for Amtrak. What I do here is add my own interpretation and observations after having audited the system myself — after the scores were tabulated.

Navigation
Amtrak's Navigation score is 16.43 out of 20, which establishes a wide 7.04 point spread from the lower quartile in the transportation segment. The median for this segment was 12.28. I found some room for improvement after using the system myself. The system has a “persona” and her name is “Julie.”

The Amtrak voice response system uses a directed dialog approach with automatic speech recognition. There's a mix of standard and custom vocabularies including city pairs and states and credit card / transactional grammars. There is no “hidden” natural language capability. You get errors if you try to provide multiple tokens in the same sentence (not accounting for multi-digit input on credit cards, that is).

The "Hello" message is a total of 12 seconds owing to an extraneous Web site advertisement (see Content section). The actual main menu after that is just over 12 seconds, which makes for about 25 seconds before a new user is fully briefed on what to do in the first turn. The main menu (spoken input) choices are: a) Status; b) Schedules; and c) Reservations - flowing from general-to-specific in subsequent menus.

As an alternate to saying "status," or "schedules," or "reservations" in the main menu, you can use equivalent touchtone numbers based on order of the directed speech dialog choices. That is you can use a touchtone "one" for "status," or a "two" for "Schedules," and so on. 

The system also accepts a touchtone “zero” for operator assistance at any point. You can also get an agent by saying “agent” or “operator.” There is no attempt to “hide the zero.” You can also use a touchtone "one" and "two" for yes and no, respectively. You can use a touchtone "*" for help or alternately, you can speak the word "help” to get instructions. “Help” is either a repeated menu or a short explanation followed by a repeated menu.

In 1987, Genesis Electronics, an early maker of voice messaging systems, established the use of "*" as a repeat key. Today, the “*” or “star” key is generally accepted as a standard navigation key for repeating the previous menu. Here, its use is prompted as "help" - but since most help in this system is a repeat, that's a valid use of the key.

Unfortunately, "stop" and "previous” - two logical choices for standard navigation words - don’t seem to be in the vocabulary. "Go back” seems to work sometimes, but that could be false acceptance. There is no strikeout limit on the number of times you can say “help” or “repeat” in the same dialog turn. This means if you say "help" continually at the same turn, you'll stay there forever. There should be global "n" times parameter to trap this - at the very least an operator revert for anyone who repeats more than three or four times in the same turn.

Despite these observations, the Navigation of the Amtrak system is certainly better than most. You can deduce from this that many systems need a navigation overhaul.

Content
Amtrak's Content score is 16.67 out of 20 — an 8.04 point spread from the lower quartile in the transportation segment. The median for this segment was 13.08. What’s readily apparent is the depth of automation in the Amtrak system. All of the most common functions are fully automated — from obtaining a schedule, to booking a reservation, to paying by credit card. This depth of automation is revealed assuming you are able to complete each task without a time-out or error-out to the operator.

It was not obvious that the system offered alternate access to content through other media such as mail or fax. This would be helpful for the computer-less out there who may want a schedule or reservation confirmation faxed or mailed.

Words and their meaning where right on with the Amtrak system. The volume of words was just right - the scripting neither turgid nor terse. I didn't feel compelled to ask for an explanation on anything.

The only content not needed, as mentioned in the navigation section, was the “lower fares on the Web site” disclaimer in the greeting. The brevity of choices after the greeting is a saving grace in this system because the extraneous "lower fares" message tears away at short-term memory. My gut is the lawyers threw that in there despite the objections of the designers.

Sadly, the disclaimer message could be easily inserted before credit card confirmation or before you go to an agent. To put it up front makes for an average opening in what could have been an excellent one.

Usability
The Amtrak system took a dip in the Usability department compared to the median performers. This is the only area where Amtrak scored lower than the median. Amtrak rated a 13.75 on a scale of 20, whereas the lower quartile was 12.38 and the median 14.65. This is the tightest scoring cluster of any of the five top metrics — with only 2.27 points between the top performing quartile and the lower quartile. Other “upper quartile” transportation companies scored higher than Amtrak here, but Amtrak’s total score edged them out.

What strikes you right off the bat is the Spanish language prompt at the end of the main menu (if you don’t skip over it). From an intelligibility standpoint (part of usability), the Spanish prompt is much lower in volume than "Julie" (Julieta?), and full of static. With all the work these folks obviously put in to the persona of this system - this is a real letdown (more on this in the section on "Credibility").

There was no shoptalk, so that helps usability, and callers are able to barge-in or override the prompts so they are not forced to listen. This is especially important for so-called "power users," because they establish a preferred task sequence from frequent use and just blow past the prompts. It's not a good idea to get in the way of this. The designers of this system were smart enough to leave touchtone compatibility in the system to allow for this - power users typically use touchtones for speed.

I mentioned before that the greeting and main menu takes about 25 seconds. If you eliminate the extraneous message we talked about, it would be even shorter. I'm all for that, but in this case, there are 56 words in these 25 seconds, which is a little too fast for most folks. Usability test subjects have proven that task completion can be increased and repeats can be decreased if speech pacing is between 50 and 60 words per minute.

"Julie" has trouble with single-digit input in some places, for example, when you are selecting departure or arrival hours. I observed this when saying “4” instead of "4 o’clock" or "4 PM," in which case I got an error message. Ditto "Six" instead of "Six o’clock." Strangely, “10” works solo and the system simply asks if that’s AM or PM.  Multi-digit, multi-token input behaves better, but I did have to repeat my input several times on the credit card part.

The number of steps to final resolution or final task completion (in this case a complete reservation paid by credit card) figures big in Usability. From start to finish, there are about 35 steps to book a reservation and pay for it using this system. This number may sound shocking, but I can tell you that most of the dialog turns are necessary. But by my reckoning, 25 percent are not necessary. The system uses a lot of explicit versus implicit confirmations. This means that most numbers are repeated back followed by a confirming yes/no query. This gets tedious quickly and is one of the reasons why Amtrak scored lower than it could have in the Usability department.

Turn-taking indicators in this system are pretty obvious - mostly dialog pauses and syntax are used. But considering the obvious voice talent coaching that went into this design, prosody could have played a larger role. Fortunately, the application is consistent throughout, which is always a comfort to users. The system also uses “earcons” (plunka plunka plunka) but seemingly to buy time while fetching host data — not as a turn-taking indicator.

Interactivity
Amtrak’s voice response system scores higher than other performance quartiles in the Interactivity area, but there is still room for significant improvement. The system scored 14.07 out of 20 here, which distances the lower quartile score of 7.82 by 6.25 points. The median was 10.57. Still, it’s the second weakest area for Amtrak in the overall Sterling Audits Usability Index.

From an Interactivity standpoint, there is some good news here. First, the system is able to drive virtually every task to a successful conclusion, even after errors. That means the error recovery routines are fairly robust. Also, host response time on most functions seems pretty quick - between one to two seconds.

Second, error management for this system - except for the "I'm sorry" crutch - see the Credibility section - is pretty good. When I input a bad credit card number for a reservation, I heard: "Let’s try again with a different card...." This is a blameless and fairly sophisticated way of dealing with a bad card. And instead of just opting out to an operator, the caller has a chance to rescue the transaction by simply trying another credit card. After 30-odd dialog turns, I'd be happy to give it one more go just on principle... Modern systems use the Luhn Mod 10 checksum algorithm on card numbers before doing an on-line verification. This saves on transaction costs and speeds the verification process. I don't know if that's what Amtrak is doing - but it's likely.

Third, the dialog style is direct with dominant use of the active voice. You'll hear imperative active voice phrases like: “Please enter your account number,” versus "Your account number is needed." The system does not filibuster in its instructions either, so that helps with interactivity.

Timeouts for starting an entry are pretty flexible. For example, on the first pass, the system waits for several seconds and then says a few words about how an approximate time is "OK" when it's looking for a departure or arrival time. This is very well thought-out — that is, to shut up and wait for the entry until it's obvious you're not going to get one. All too often, systems "step on" the turn just when a caller is getting ready to speak.

The system is not quite as "smart" in dealing with timeouts between multiple typed or spoken digits. Here, it'll wait two to three seconds and say, "I don’t understand." It rankles most users when a machine says it doesn't understand when they’re not done with input. Personally, it just seems weird to me when a machine says it doesn’t understand silence. A prompt like "Do you need more time?" or "Let's try that again" is preferable.

In general, the treatment of silence goes like this: a) first time: “You can also say agent;” b) second passes: "Sorry I didn’t hear you... say ‘help’ or ‘agent;’” c) third pass: "Sorry I still didn’t hear you;" and finally d) last pass: "Sorry I’m having trouble understanding you…(opts out)."

The system allows for “three strikes and you’re out” (opt out) on garbage. Three "*" key presses also gets you an operator. Strangely, the verbal "repeat" three times in a row does not invoke the same action.

Credibility
Amtrak scored higher in Credibility than any other top metric — a 17.93 out of 20. The lower quartile scores in transportation were 14.05 and the median was 16.39.

Most respondents characterize Amtrak as a "friendly-sounding machine." You can tell that the designers are shooting for personification of the system - and it does seem kind of like a person without being too forced. But there are major clues that you're talking to a machine which is just as well. No sense building a completely anthropomorphic interface when there are dead give-aways that it's just a machine.

As an example, the Spanish prompt at the end of the main menu is obviously not done with the same voice talent, and as mentioned before, the volume is lower and the recording is scratchy. Certainly, a person would not be talking to you and then all of the sudden play a tape of another speaker in the middle of the dialog. This is a real letdown, even though most people who use a persona-equipped system are just playing along.

I've got usability test subjects on video actually saying "please" and "thank you" to Julie. Several crack a smile when Julie first introduces herself, which is an indication that they'll probably play along. (It's not often callers crack a smile when they first speak to a human - so most folks are completely aware of what's going on even with a well-designed persona.)

The apologetic tone of the system speaks to Credibility. I counsel designers against being overly apologetic in these systems because it seems insincere. And it gets increasingly so each time a mistake is made. Julie is "over the top" in her apologies. In fact, in some places the talent was coached to actually "stammer" and insert "ummms" as discourse markers. Some of the chattiness is fine. For example, Julie uses "OK" and “got it” to indicate a successful turn. She also says: "Great, I’ll be able to help you…" and "Let’s get started." The confirming phrase: "I think you asked for… is that correct?" is used appropriately as well.

But the chatty discourse loses its credibility and "realness" when you have to repeat and you hear the same prosodic cues and discourse markers as with the original sentence. For example: "…aaand what’s the arrival city?” (note the stretch on “and”) and "Beeefore I pull-up schedule information, will you be needing price information” (note the stretch on the 'before'). Chatty discourse like hanging on the "be" in before works all right on the first pass, but when you say “repeat” and Julie says the same phrase with the same prosodic stretch it sounds unnatural. My suggestion is that the prompts used for repeat should be different than the first pass.

To be fair, the Amtrak system scored very high in Credibility, with clear diction, consistency, and a friendly persona. But there's always room for improvement and that's what usability benchmarking and usability testing can do for you.

We urge you to participate in this column by doing your own surveys of systems you regularly use. See the “Usability Scorecard” sidebar for instructions.



Usability Scorecard - Transportation Segment
Based on all of the surveys completed, Amtrak was tops in all categories with the exception of Usability.

NAVIGATION Sample Question: Main Menu Length
We ask survey respondents to answer many navigation questions. One of the more telling ones deals with how long the main menu is after the greeting. We ask our researchers to “estimate the length of main menu after the ‘hello’ greeting.” With Amtrak, there is an extraneous “try our Web site” message, which most people count as part of the greeting. Afterwards, the three main menu choices, alternate language and operator prompt amount to about 12 seconds.

CONTENT Sample Question: Depth of Automation
Depth of Automation is one of many important indices in the Content part of the Sterling Audits Usability Index. Here, we ask our researchers to tell us if the system allowed them to complete a variety of tasks without going to an operator. Respondents answer on a scale of 1-to-10, one being: “I couldn’t really do much with the system at all” to 10 being: “The system allowed me to complete all the transactions I needed without an operator.” Amtrak scores a “10” here. You can get schedules, make a reservation and even pay for it — all without human assistance.

USABILITY Sample Question: Speech Pacing
Speech Pacing is a big Usability metric. In our surveys, we ask researchers to rate the speed or pace of the prompts in the target system. This is done on a scale of 1-to-10, one being: “way too slow” and ten being: “way too fast.” Amtrak scores an “eight” here. For example, the greeting and main menu take 25 seconds. There are 56 words in these 25 seconds which is a little too fast for most folks.

INTERACTIVITY Sample Question: Fall-Back to Touchtone
Fall-Back to Touchtone figures in to the Interactivity part of the Sterling Audits Usability Index. For speech-enabled systems, respondents are asked if they were able to use touchtones if the speech recognition didn’t seem to work. Amtrak does well here and allows DTMF (touchtone) fallback with consistency.

CREDIBILITY Sample Question: Apologies and Blame
Information on how systems deal with Apologies and Blame are accounted for in the Credibility part of the Sterling Audits Usability Index. Here, we are trying to get the users’ views of whether or not the system seemed overly apologetic when mistakes happened, or if it blamed the caller for all the mistakes. This is tabulated on a scale of 1-to-10, one being: “the machine was overly apologetic” and 10 being: “the machine blamed me for everything.” Amtrak’s “Julie” persona is a bit fawning — scoring a “four” on the scale.

How You Can Participate in this Column
Just log on to the research portal at http://www.sterlingaudits. com/research.html. Sign up as one of our researchers. When you input the company name of the voice response system you wish to survey, put “STM” behind the name. The syntax: “ABC Incorporated STM.” This will allow us to distinguish surveys submitted by the readership versus the regular research staff. Submit a few of the companies you do business with as projects. Once approved, you’ll get a notice to go ahead with the survey the next time you log on. You get a $10 stipend for your trouble. The 10 bucks is nothing, but you’ll be part of supporting research we all can take advantage of. There’s over 200 in-depth questions, so be prepared to spend an hour on the first one. An on-line dictionary explains “shop talk.”


Edwin Margulies is co-founder of Sterling Audits, a firm dedicated to quality improvements in customer service automation and contact centers. The company specializes in benchmarking the usability of self-service systems. As EVP and chief of research, Margulies is responsible for research projects including the Web Site Usability Almanac 2004 and the Voice Response Usability Almanac 2004. He is also on the board of directors of AVIOS (Applied Voice Input/Output Society), where he participates on the marketing and conferencing committees. He can be reached at 702-341-0314 or ed@sterlingaudits.com.

Page1 of 1