June 30, 2003
By Bill Byrne Senior Voice Interface Engineer - Google, Inc.
Features

"Conversational" Isn't Always What You Think It Is

During the past several years, many in the speech industry have focused a great deal of energy on promoting "conversational" interfaces in the name of usability and profitability. The claim is that this type of speech interface provides the best user experience and therefore will produce the best return on investment. In general, this is correct: Designs based on common conversational schemas should improve usability as callers are able to base their interaction on well-established mental models. However, the term "conversational" has been over generalized to refer to only one kind of conversation, in particular, the kind you'd expect to have with a consumer-based call center agent who has never met you. This is not surprising since many of the most prominent speech applications have been deployed in consumer-facing call centers. But conversation has many different forms, from the familiar, informal style one might use with a spouse or a close friend, to the terse, matter-of-fact style one might expect to hear between a chef and the wait staff in a busy restaurant, to the ultra polite, highly formal style one would experience in a British bank. As speech moves beyond the call center and into the enterprise, we are finding user requirements that do not fit the familiar conversational mold we're used to. To achieve a truly usable speech interface, conversational style must be appropriate both for the task at hand and for type of relationship the caller expects to have with the virtual agent. What's more, for some speech applications, the basic tenets of conversational interface design may not even apply.

Conversation comes in many forms

As mentioned above, human conversation is very dynamic and not all speech applications should emulate interacting with the familiar consumer-facing call center agents often heard on show floor demos or speech application vendor Web sites. In fact, the best speech interfaces can sometimes be rather terse or even seem impolite when taken out of context. And yet they are still quite "conversational." Let's look at two different examples.

First, take the following recommendations found in the recently published book The Art and Business of Speech Recognition: Creating the Noble Voice by Blade Kotelly (p.28, Addison-Wesley 2003):

In the first dialog below the clerk fails to explicitly acknowledge that he's heard and understood the driver's responses which, as Kotelly points out, would most likely be perceived as impolite in a scenario where the clerk worked for the department of motor vehicles attending customers he didn't know.

CLERK:Make of car?

DRIVER: Uh Mercedes

CLERK: Model year?

DRIVER: It's a 1970.

CLERK: Color?

Kotelly's second dialog below shows how the situation could be improved.

CLERK: What's the make of your car?

DRIVER: Uh Mercedes

CLERK: OK. And the model year?

DRIVER: It's a 1970.

CLERK: Got it. What's the color?

He then goes on to say, "If the motor vehicle department employee were acting more like a human and less like a machine, the conversation would feel more polite, friendly and natural." But the first, more terse dialog is not any more machine-like or less human than the second. Rather, it's simply less appropriate for the context of this particular application. Suppose instead that the two participants are co-workers on an automobile assembly line. Suppose the "clerk" in this case has the job of writing the specifications down for each vehicle that passes by and the "driver's" job is to read the specifications back to him. Let's also say that the two are paid by the number of forms they complete each day. In this case, the first dialog is much closer to what we would actually expect and the second would in fact be quite odd. In other words, the first dialog is just as "human" and "conversational" as the second. It's just that the type of conversational language we use depends on the scenario and task at hand.

Another example: When pilots do a pre-flight check of an airplane, they must go through a lengthy list of items to ensure the airplane is safe and legal for flight. When a co-pilot or passenger is available, the pilot will often ask for assistance with this task. The resulting conversation sounds much like the following:

Passenger: Fuel?

Pilot: Check.

Passenger: Flaps?

Pilot: Check.

Passenger: Ailerons?

Pilot: Check.

Passenger: Antennae?

Pilot: Check.

Now, if the conversation described above were a speech recognition application, many would say that it was not "conversational" at all. After all, it doesn't sound as "friendly" as what you'd hear from a good call center agent. But that's irrelevant. The conversational schema for this task is well-entrenched in the minds of those who use it and it's the designer's job to make sure the application style takes advantage of this.

"Personality" does not mean "a lot" of personality

We've seen now how "conversational" can be misinterpreted by designers which may prevent them from using language and style most appropriate for a given speech application task. A related problem is found in "personality" or "persona" design. In particular, there is a growing tendency for interfaces to feature overly chatty characters, which erodes usability instead of enhancing it. The reason for this unfortunate trend is related to a basic misunderstanding of what "personality" refers to, as Byron Reeves, professor and chair of the communication department at Stanford University, acknowledges in a recent e-mail exchange. He writes, "Personality (at least on the street) usually means 'a lot' of personality. That often results in over-the-top interfaces that can overdo what real people (even those with great personality) would do in similar face-to-face encounters." A brief synopsis of how persona design came to be such a prominent part of speech application development can shed further light on this problem.

The requirement for speech application development to include "personality" or "persona" design first entered the speech industry after the 1996 publication of book The Media Equation: How People Treat Computers, Television, and New Media Like Real People by Byron Reeves and Clifford Nass (CSLI Publications). In brief, the book put forth the strong hypothesis that "mediated life equals real life." In other words, people can't help but behave the same way in computer-human interactions as they do in human-human interactions. The experiments described took well documented human-to-human behavior and then substituted one of the humans for a computer (or other medium). The results? Humans displayed the same behavior with the computer as they did with each other.

This research helped to improve speech interfaces in two particular ways: First, interacting with a well defined persona facilitates our innate tendency to attribute human qualities to non-human agents. That is, if humans can't help but perceive of the voice interface as human, it behooves designers to make sure it is both consistent and natural. Second, personas can be designed to artfully and accurately complement a company's brand identity. For example, the decisions companies make in choosing the right actor and image for a TV or radio commercial should also be reflected in designing the speech interface.

Despite the benefits described above, we cannot conclude from Reeves and Nass' research that interacting with speech applications is the same as interacting with real people, as some mistakenly have done.

Designing applications for frequent use

An oversight in conversational interface design has been the tendency to focus on design for first-time callers. Whether featured on company Web sites, in design texts or on speech conference show floors, application demonstrations are almost always geared to "wow" potential customers with witty phrases and responses that are impressive the first time you hear them but quickly become annoying even after the second experience. For example, the following supply chain management has logic to respond sympathetically to callers who report a delay by triggering the prompt "Oh, sorry to hear that" and includes similarly intelligent sounding prompts at the end of the dialog:

System: This is the delivery tracking center. Tell me your four-digit delivery number or enter it on the keypad.

Caller: 4-8-3-3

System: 4-8-3-3 Is that right?

Caller: Yes.

System: OK, hold on…(logs into system)…What's your status? You can say arrived, departed or delayed.

Caller: I'll be delayed two days. There's a big storm.

System: Oh, sorry to hear that! Let me confirm. I have delivery number 4-8-3-3 delayed for 48 hours due to weather. Is that right?

Caller: Yes it is.

System: Great. Hold on…OK. It's in the system. Hopefully you'll be on your way soon. I'll talk to you when you arrive. Drive safely.

This application has never failed to impress both internal and external partners and customers who have interacted with it in our offices or at trade shows. After all, the prompts are well designed and professionally recorded, directed and processed. The prompt concatenation accurately recreates the rhythm and intonation of spoken discourse. The grammars are well tuned to accept common variations of different phrases and also allow the caller to fill three fields in one phrase (e.g. delayed, two days, big storm). The application is "live", i.e. it runs against the real back-end and posts a "delay" that can be seen immediately on the corresponding Web application after the delay is recorded in the system.

But while the design of this application might be perfect for occasional use. (perhaps one call every month or two), it's not appropriate for its true target users: drivers who need to update the status of their delivery at least once a day, if not more. In fact, after responding with overwhelmingly positive comments after the first call, 8 out of 12 participants in a usability study done on this particular application complained about having to hear the prompt "Oh sorry to hear that" and "Drive safely," etc. after they had called it approximately five times. For them it had lost its effectiveness and now just seemed to get in the way. This is not surprising. Humans are extremely tuned to the dynamic nature of language. In fact, in an entire lifetime, you'll never hear another human utter the same thing twice in the exact same way especially within a span of a few minutes. If you did, it would seem eerily startling and not at all "conversational." And yet this is what we experience over and over again in what are touted to be today's "best" speechA interfaces. (You might want to try this yourself. Give your favorite speech recognition application a call right now, and then call it back every hour for the rest of the day.)

Finding a balance

We're left with an interesting problem: How do we maintain the conversational flavor of speech applications while designing for more frequent use?

First, remember that "conversational" doesn't have to emulate the typical first-time call center agent interaction. Rather, an application's conversational style should be based on the caller's expected relationship with the system and on the task to be accomplished. Second, there are techniques one can use to alleviate some of the repetitiveness found in applications that experience semi-frequent use by the same caller. For example, when history trackers built into the code (or application design tool) they allow designers to specify prompts that should be either randomized, changed or deleted to fit the needs of novice vs. intermediate vs. expert users. As long as the same caller doesn't need to use the application on an hourly basis, these design "extras" can significantly enhance the user experience.

Finally, industry designers will have to admit that, for certain high frequency-of-use applications, some of the basic tenets of conversational interface design described above may no longer even apply. Or, perhaps other basic user requirements take on a higher priority. In the cases described below, the human expectation of what a conversational interface will deliver is simply more than what today's technology can provide. For example, imagine using a speech application for four hours at a time to take inventory in a warehouse where the system prompts the employee with a bin number and the employee responds with a natural number. Or imagine using another speech application to unload a truck, where the employee says the box number and the system responds with the number corresponding to the shelf location where the box should be stored. How would our ideas of persona design and conversational prompting change in these contexts?

Conversational interface design is crucial to usability, but only if the basic tenets are well understood: Conversation has many forms, its personality should never be overdone, and it doesn't sound like a broken record. However, while we hate to admit it, new applications of speech technology will force us to let go of some of the favorite design pastimes in order to find new ways to make speech truly valuable to our customers.

Dr. Bill Byrne is the manager, Voice Center, SAP Labs and consulting assistant professor, Symbolic Systems, Stanford University. He can be reached at william.byrne@sap.com.

"Conversational" Isn't Always What You Think It Is

DeepL Launches Voice-to-Voice

Aircall Acquires Vogent

SpeakON Launches MagSafe AI Button

Deepdub Introduces Agentic Dubbing Co-Worker