The Elements of Style

When a magician fans out his deck of cards face down and asks an audience member to pick a card, any card, he’s not actually allowing her to do so. Through a sequence of carefully-worded prompts, the magician guides her to the card she will ultimately select. If she selects a card the magician expects, the trick is a success and everybody is delighted. If not—and the magician gets flustered—the situation becomes awkward.

The properties of a voice user interface (VUI) are similar in that its success hinges on its ability to guide the caller step by step into eliciting responses that will enable the system to help solve a problem.

While there’s certainly consensus on such a fundamental principle, the conflict emerges at the skeletal level of VUI design, which is comprised of three components that should work in sync with one another. First are the grammars, which encompass all the possible things an end user might say. Second are the prompts, which the system says to provoke a caller response. Third is dialogue flow, which is the way the entire structure is organized.

Grammatical Corrections
"I hate your stupid VUIs!" the end user screams as she hurls her iPhone at her VUI designer husband’s head. "They never understand what I say!" To save the marriage, the husband stuffs more grammars into his next VUI design, thus allowing the system to comprehend a greater variety of end-user responses. Where a menu option allows end users to check account balances, the embattled designer simply adapts the system to also accept account balance, balances, balance, account, and alimony. The problem with this idea, however, is simply statistical. The more items added to a grammar, the easier it is for recognition to go wrong by getting confused for some other homophonic option.

Susan Hura, founder of SpeechUsability and a boardmember of the Applied Voice Input/Output Society (AVIOS), defines the programming
of grammars as a balancing act between adding enough that the system recognizes common phrases, but not so much that recognition as a whole decreases. And, according to Melanie Polkosky, a human factors psychologist specializing in VUI design, grammar creation overcomplicates when the people at the helm become weary of errors. Yet a good design should have a good error recovery system.

Dave Pelland, director of the Intervoice Design Collaborative, cites an example in which two people with similar names are listed in the directory of the same company, making ambiguity inevitable. In such a case, the designer can add dialogue to distinguish between the two:
End user: Call Tim Knight.
IVR: Was that Kim Knight?
End user: No.
IVR: Then how about Tim Knight?
End User: Yes.
IVR: Dialing…

Because IVRs don’t have a human set of comprehensive skills, designers must find ways to limit what users say. When the IVR asks whether the desired contact is Kim Knight, it’s already limiting the end user’s response. According to Pelland, before a speech application goes out, the designers collect utterances to see what people might say during each point in the dialogue. Most grammars contain between 10 to 20 phrases.

Juan Gilbert, an associate professor in the Computer Science and Software Engineering Department at Auburn University who teaches a class on VUI design, emphasizes the importance of this tuning cycle. "A grammar is a living thing," he says. "You don’t design a grammar and then deploy it and then it’s done. Some unpredictable response might end up being common."

If a pattern of vocabulary recognition errors emerges in which a segment of the population misinterprets a prompt and responds in an unrecognizable way, it’s time to hone the system. Sometimes that means simply adding more grammars.

Communicating in a Marriage
That brings up the familiar issue of bloating a system. For grammars to function at their unambiguous best, they need to be closely linked with the prompts. Designers coming from software backgrounds tend to segregate the two, developing grammars independently, as if they were strings of code. This is the wrong mentality, according to Gilbert.

"Your prompt and your grammars have to be tightly coupled, like a husband and wife pair," he says. "Think of them as a team, as a partnership." And a good partnership between prompt and grammar constrains what people say to something predictable, something that can be programmed into the system’s expectations.

Anyone who has discovered lost pieces of their children’s Lego sets by stepping on them knows that the bricks consist of different shapes and sizes, some more painful than others. Each piece has a specific function. The part shaped like a wing, despite its astonishing efficiency at lacerating a foot, would be superfluous adorning a Lego car. Likewise, it’s important that the spoken components of a VUI come together in a certain way. "All conversations with VUIs are made out of building blocks, which you see over and over again," says Robby Kilgore, creative director of professional services at Nuance Communications. "We all have the same toolbox of so many kinds of building blocks. How you use those and how you fit them together is the finesse-y part."

For instance, when asking for a phone number, an IVR might say either Please tell me your phone number, starting with the area code or Can I get your phone number, starting with the area code? The latter causes some VUI designers to bristle. What if, they ask, the end user simply says Yes?

"But people don’t do that," Kilgore argues. "There might be one person on the planet who will say Yes, you can, and nothing further. But 99.9 percent of the people will just tell you their phone number. There are certain things that are natural to the way people talk."

Of course, in cases that might elicit a broader response, it’s best to keep questions from being too open-ended. These types of questions facilitate a variety of unpredictable answers, disrupting the system’s comprehension and the serenity of the end user. Consequently, many designers include examples of possible responses.

A prompt that has been successfully used at Intervoice goes as follows:
IVR: <Jingle> XYZ Company <beat> Para continuar en español, marque el numero ocho.
In a few words, tell me what I can help you with today. <beat> You can say anything from I want to pay my bill to my screen is black.

The first block of speech contains a brief greeting, followed by an opt-out for Spanish. The next block emphasizes that the end user must use "a few words." This is followed by examples worded in ways that make it clear they’re examples and not part of the menu proper.

Modeling a proper response is particularly important, even for something as simple as a date of birth, for which end users might say June 15, 1984 or 6/15/1984 or 6/15/84. Even explaining how to say a birthday can be verbose: Tell me the name of
the month, the date, then the year. To simplify, Hura suggests: Tell me the date, like June 15, 1984.

Yet, unlike Intervoice’s prompt, Hura prefers to keep her model limited to one. "What doesn’t work is if you give people multiple examples," she says. "[Some users] think those are the only options they’ve got. Obviously, not every caller is going to believe that, but enough people do believe that it’s a menu." Instead of toggling with the syntax, the most direct remedy, for Hura, is to give people only a single example.

Additionally, excessive instruction might irritate callers. While it’s important for end users to stay oriented within an IVR, there’s also the danger of making the system sound pedantic. Most VUI designers realize the best way to handle instruction is to distribute it throughout the script, presenting it only when callers are likely to use it. For instance, if the IVR asks for an account number and the caller doesn’t give a valid response, the system could react by saying, To move on, I need your account number. You can find your account number in the blue box just under your name on your account statement.

It does little good to have, at the beginning of the call, a prompt that chirps, Remember, at any point in this call, you can say help, repeat, main menu, or goodbye, because at any point during the call, end users are more likely to forget and simply hang up.

Distinction needs to be drawn, however, between providing instruction and a help feature that, if a caller voices a need for it, switches to a prompt that provides more detailed direction in navigating the system. "There are people who will still say you have to allow callers to say help everywhere and respond to it with a so-called helpful prompt," says Phillip Hunter, vice president of the Voice Interaction Group at SpeechCycle. "There’s another camp that says help doesn’t get used, and when it does, it gets used poorly."

Hura clarifies: "From my experience in usability testing, when people say Help, what they really want is a person. When you offer them help and give them anything other than a transfer to a human being, they feel fooled."

Ever So Formal
The storm over whether or not to program a personality into an IVR has calmed, with most VUI designers settling for a happy medium.

Different demographics have different expectations, so the question of formality needs to be assessed on a case-by-case basis. "Words have a lot of power," Nuance’s Kilgore says. And formality, he adds, exists on a spectrum instead of two separate poles. Even at its most refined, a system’s formality shouldn’t be distracting. I’ll look up your account now is polite, yet natural. By contrast, I will now look up your account plunks the adverb in an auditorily awkward place and extends the contraction, as if the designer were writing a Victorian novel and not a VUI prompt.

One way to keep a system from sounding too stilted is through idioms that don’t feel idiomatic— that have become so embedded in the language design they often go unnoticed. "I’m often asking our designers to keep an eye out for phrases that are naturalistic," Kilgore says.

For example, while a prompt that says OK, you’re good to go, might be too hokey, one that says You’re scheduled for an appointment on Thursday at 2 p.m. Does that still work for you? is less noticeable. Others include What’s a good day for you? or You’re all set. All are phrases that have become so commonplace in everyday conversation—even conversation with human customer care representatives.

Of course, an IVR isn’t a human and there’s still debate about whether a system’s language should include the first person. "Almost everybody now is used to hearing an I’m sorry apology if the system can’t recognize you," Intervoice’s Pelland says. "The question is really how much do you put that in all of the prompting?"

According to Gilbert, it depends on the end user. Teenagers who spent their formative years raised by a Playstation are used to personas and would naturally be more accepting of an IVR that had a personality. By contrast, if "you have another group of people who are not technology-savvy, you want your system to come across as this dumb computer," he says.

But even still, the programming of the first person is a nuanced practice. "First person is a lot trickier than people make it out to be," Hura says. For instance, when the system needs to take responsibility for what’s happening—say it misunderstands the caller—it might say Sorry, I didn’t get that. We need to get your account number. But there’s a distinction between the first person singular and the first person plural. In the first sentence, the machine misunderstood the caller, so it apologizes. In the second, however, it uses we not to indicate itself, but to indicate the entire institution it represents.

An alternative wording might simply go: Sorry. We need to get your account number. In this instance, sorry isn’t so much an apology—which might sound unnatural coming from a machine—but a synonym for pardon me, a way of acknowledging that something in the discourse went wrong. This avoids the possibility that the machine is making a bigger deal out of an error than it should.

"Errors in communication are rampant in every conversation," Polkosky says. "If you design [the system] right, your initial prompts and error recovery structure appropriately, then [callers] won’t perceive them as errors."

Putting It Together
So how do you design a system so that it makes for a pleasing customer experience? For Kilgore, whose background working with popular media often had him assembling music and film, the entire process is like putting together a plot line in which the writer has to decide whether a given sentence pushes the narrative forward. "When somebody calls up a system," he says, "they’re listening to a three- or five-minute story, not thinking of the atomic details."

When Polkosky revised a technical support call center for a major consumer electronics company, she faced a creaky system with a verbose, aggressive style. So she resequenced the prompts to offer the easy questions first and the harder questions later. "One of the pieces of feedback when we implemented that [new] system was that it delivered customers to reps much calmer and ready to be helped," she says, "as opposed to being even angrier than they were before."

Polkosky is particularly in tune with the pragmatic, social uses of language. A system that states Please say X, Y, or Z might seem too dictatorial. Would you like X, Y, or Z? aligns more with the customer-friendly environment a caller expects. While some VUI designers might dispute that such a prompt is too indirect, Polkosky argues that the tone allows the system space to be more direct if necessary in places such as error recovery.

Regardless, VUI designers tend to agree that the initial prompts should be clear and not too numerous—maxed out around three or four to avoid confusing the caller, particularly at such an early stage in the interaction. SpeechUsability’s Hura, however, disputes that, and says that short-term memory is more complex than that. Accordingly, a more important factor than the total number of options in a list would be how distinct each prompt is from the other.

Whichever side is correct, both agree that the clarity of the prompts is paramount. "If somebody calls into a bank application and wants to get a copy of last month’s bank statement, you have to make sure it’s pretty obvious at that top menu which option they need to choose to get to their bank statement," Pelland says.

With every rule of thumb, there is a caveat, alternative, or addendum. Part of this is because the approach to each VUI should be assessed individually and design-based decisions should be dependent on caller demographics and each individual client’s wishes.

The industry’s increasing focus on customer satisfaction has changed the way designers approach their craft. Truman Capote said that "Writing has laws of perspective, of light and shade, just as painting does, or music. If you are born knowing them, fine. If not, learn them. Then rearrange the rules to suit yourself."

VUI design has a different set of rules than the art of fiction— but it’s also a markedly newer form, still too immature, most designers contend, for any rules except the broadest to be firmly entrenched.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

The Elements of Style

Triton Digital Partners with ekoz.ai on Voice-Cloned Podcast Ads

Soul App Launches Full-Duplex Voice Model

Mistral Unveils Voxtral Open-Source AI Voice Model

Vonage Partners with AWS for AI Voice Agent Integration