The Great Debate

Some debates are so historic and complex they will never be resolved—storied arguments destined to be passed down from generation to generation, epic disputes that will outlive us all: liberalism versus conservatism, capitalism versus socialism, pro-life versus pro-choice, Coke versus Pepsi, the Yankees versus the Red Sox, tastes great versus less filling.

In the world of speech technology, an ongoing dispute about voice user interface (VUI) design standards and universal commands rages on with just as much furor. And while the debate about the value and necessity of VUI standards and universal commands will not likely be resolved any time in the near future, the landscape of the speech industry is always changing, and with it, the factors and dynamics that surround the creation and assessment of standards are changing as well.

With a chorus of differing voices—some louder and more strident than others—the speech industry finds itself with few answers and a host of questions about the issue of standards: What should these standards look like? How should they be determined? Who should determine them? How would they be enforced? How would they evolve over time?

On a very basic level, the lack of a regulatory organization prevents the creation of any true VUI design standard, according to Jim Larson, an independent consultant, VoiceXML trainer, and co-chair of the World Wide Web Consortium’s Voice Browser Working Group.

“There is not a standards group to specify what these standards might be and bless these as official standards that everybody ought to follow,” says Larson, who notes that vendors also fail to agree on the necessity of standards. “The VoiceXML Forum has never suggested that there ought to be standard voice user interface guidelines. AVIOS has internally talked about it, but has never done anything. The lack of a standards body is the primary [reason] why we don’t have any official standards.”

Larson points to the Association for Voice Interaction Design (AVIxD), an organization formed less than a year ago out of an informal voice user interface designers’ group at Yahoo!, and Bruce Balentine’s books as good sources of informal guidelines and best practices, but stresses that none of these are hard and fast recommendations or industry standards.

Larson readily acknowledges both the positive and negative aspects of standards and universal commands. He says universal commands would simplify the work of developers by requiring them to implement only a basic set of commands while also simplifying speech for users who would simply learn a set of commands universal to every application. However, he notes that every application would likely need additional words and commands that would not be a part of the universal set.

Getting Weaker

“So, in some sense, the standard is weakened by different applications because different applications will require different words,” Larson says.

He further notes that additions to the universal set would make porting applications to other platforms difficult and cause confusion for users. “Every vendor will extend the basic set for what they feel are crucial extra commands that will make their products different from other competing products. And of course, this diminishes the concept and the usefulness of the standard.”

Deborah Dahl, principal at speech and language technology consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group, is also less than optimistic about the viability of design standards. “I find it hard to believe that you could ever have a hard standard for the user interface because applications are so different, and people are different, and demographics are different,” she says. “You can have best practices or guidelines or rules of thumb, things like that, but when you’re dealing with human beings, it’s hard to have a hardcore standard like VoiceXML…. People are just not that cut-and-dry.”

Dahl says rigid standards would prevent the adaptation of applications to the specific needs of different users, pointing to the different requirements of various demographic groups, such as the elderly. However, Dahl adds that without some guidelines or best practices, designers are relearning the same lessons again and again via trial and error, which can be extremely expensive.

“Having some general best practices is a really good idea to prevent that kind of reinventing the wheel,” she adds.

Who Needs It?

Blade Kotelly, a visiting lecturer at MIT and author of The Art and Business of Speech Recognition: Creating the Noble Voice, takes the argument against design standards a step further, questioning their basic necessity. “There are, of course, already guidelines involved in creating these systems…in the same way we have social guidelines that tell us what to do when walking down the street,” he says, noting that an unwritten set of guidelines is borne out of firsthand experiences with speech systems.

Kotelly says anyone who has designed speech systems, listened to what callers are saying, performed usability testing, or analyzed calls already knows the basics. “We know this intuitively,” he says. “So there are already de facto guidelines in people’s heads. The real question is should we move beyond the set of guidelines we have.”

Kotelly questions the viability of creating standards that meet the needs of every population, demographic, service, question, and vertical. He also asserts that standards and universal commands will not make people better designers.

“The standards can only go so far,” he says, noting that a standard guaranteeing a caller’s right to a live agent is somewhat superfluous. “That’s just insane. Of course you should have the ability to get to the agent if the company can afford that and if it’s their business desire. The speech community can’t enforce that. It doesn’t make any sense.”

Kotelly is equally unconvinced about the value of universal commands. He points out that having a single, universal command to connect callers with an agent isn’t necessary because of the sophistication of today’s speech technology.

“Because it’s not 1995 anymore, we don’t need to agree on one command,” he says. “Because the technology is so robust, you can have as many commands as you’d like to have to connect to an agent, and people know what to do intuitively.”

Kotelly asserts that standards only tackle low-level problems and fail to address complex issues. “Most standards are only going to cover the most basic stuff, like how to get to an agent or transferring a call,” he says. “At that point, if you don’t know how to do it already and you’re selling your services, you’re pretty weak.”

Within the industry, opposition to standards appears to be the norm. Both Mike Ahnemann, principal voice user interface designer at Angel.com, and Dave Pelland, director of design collaborative and relationship technology management at Convergys, expressed resistance to hard standards.

Pelland points out that if standards existed for cell phone design, we might not have the iPhone today. “It’s pretty clear that the iPhone is a good leap forward in terms of user interfaces in a lot of areas,” he says.

The same applies to speech systems. “You see what Nuance and these other technology vendors are coming out with on a yearly basis. It’s really cool stuff that we can do really cool, new things with,” he says. “I’m not personally prepared to lock myself down with a set of standards and say this is how we do it from here on out.”

Ahnemann agrees. “On a more general level, I’m very wary of trying to standardize things,” he says.

A Different View

Ahnemann suggests looking at VUI design standards through the lens of Web standards—which he says ensures Web pages work in any browser. “If you take that parallel in the speech world, we’re already meeting that standard. Applications work on any phone,” he says.

Ahnemann is currently working on a bill of rights for end users. “It’s a subtlely different term, but the idea is rather than having hard and fast standards, we’re starting to pull together a list of guidelines, or a bill of rights, for what you should include in an IVR,” he explains.

Susan Hura, principal and co-founder of SpeechUsability and a founding member of AVIxD, takes a measured approach to standards. “There are places where I think some amount of standardization would really benefit not just us as designers. It would benefit the entire community if there were a little more predictability, for example, about what a VUI design specification is going to look like,” she says. “It would be so terrific for all of us in the industry for there to be some standard format for documenting designs.”

In sifting through the differing opinions about standards, Dahl notes that consumers and customers generally support the idea of guidelines—something that could lead to an improved public perception of speech applications. “People would like universal commands because they make the system predictable so that you know what to say. You know how this thing works; you don’t have to fumble around and guess,” she says.

At the same time, though, she argues that people wouldn’t like having to learn the universal commands.

Despite the controversy surrounding design standards, Larson says the speech industry would be very different with VUI standards and universal commands. “Users would be able to pick up and use new applications easily because they know how to work them,” he says. “Just as you can pick up a new PC application and start to use it because you understand the concept of Windows and pulldown menus and dialogue boxes. So what you learn using one system can be applied to another system.”

Additionally, Larson says users would be able to switch between applications more easily and no longer have to remember the idiosyncrasies of each individual application. “It would be very positive, but the companies that build user interfaces have not felt the need to create standards,” he says. “They always feel that I can build a user interface that’s better than anybody else’s user interface so people migrate toward mine. And that causes a lot of confusion in the marketplace.”

In looking to the future, Larson sees speech system designers evolving toward similar practices, but does not anticipate the creation of a set of standards any time soon. “I don’t see a body to do that,” he says. “I see difficulty in finding a collection of user interface designers who can ever agree on what the best practices should be—at least in a meaningful way. And as technology advances, we’ll discover new and innovative ways to do some of the things that would be standardized, which would become out of date.”

Dahl echoes Larson’s sentiments, noting the lack of a single standards organization. “Some leader has to start a project to pull together all the knowledge,” she says. “I don’t think a standards organization like W3C really has the right knowledge to put together guidelines.”

Dahl doesn’t foresee the creation of a formal standards organization. “I don’t see that happening, mainly because you can’t really have hard and fast standards when you’re dealing with people,” she says. “Standards should be reserved for connections between technology, not between people and technology.”

Larson also admits the idea of standards and universal commands is more popular among end users, citing the positive reaction to the guidelines originally suggested by Paul English and the GetHuman movement. “When [English] gave these guidelines, all the consumers were ecstatic because they all have trouble using IVR systems,” Larson says.

He and many others also note, however, that GetHuman was formed largely as a consumer advocacy group and its suggestions have been largely ignored by the business community. “Some of those guidelines imply a business strategy that a lot of businesses don’t agree with,” Larson states emphatically.

Forget GetHuman?

Larson says many businesses try to prevent users from reaching live agents in an effort to save money. “In general, those guidelines were not accepted, because some of them violated current business practices or current business strategies,” he suggests.

Kotelly is less enthusiastic about the GetHuman perspective, calling it “both useful and useless.” He points to business policies hindering the acceptance of the GetHuman principles, and says many companies purposely make getting to an agent difficult as a way to save money, a fact that diminishes GetHuman’s usefulness.

“All these companies did this knowingly,” Kotelly says, adding that GetHuman would be more useful if it provided some real answers to issues and problems facing designers. “Does [GetHuman] have a set of standards that’s comprehensive, that can be universally agreed upon, that goes beyond his 10 platitudes?” he asks. “Of course you should be able to get in touch with an agent. Obviously. These things are obvious and self-evident. And if we’ve gone so far away from that, then that points to a different kind of problem to solve.”

Still, Larson sees the GetHuman guidelines as a good starting point in the creation of some sort of standards. “I would use the term ‘guidelines’ instead of standards. Standards get people riled up, with their backs against the wall, and they start being negative, so I would suggest guidelines, which can occasionally be violated,” he says. He also says the speech community needs to work through the issue, decide what guidelines would be useful, determine how standards might be implemented, and do research to test and validate the usefulness of standards.

Kotelly takes a different tack. He suggests that it’s not designers—even the bad ones—who need the guidelines, but rather the companies and executives for whom speech systems are built. “There are a bunch of terrible designers out there, but even the bad ones know the general rules of what they should be doing,” he says. “But a lot of them have been beaten down by the fact that they haven’t been allowed to do the right things, they’re not listened to, and they don’t get the support from executives above them.

He argues that standards are not going to make a company change the way it is run. “It’s about getting executive support on issues that appear to be small but that will affect user experience adversely,” he says.

Pelland agrees. He says it is hard to argue against the basic goals of GetHuman, but describes some of its suggestions as “ideals in a world where money doesn’t matter.”

“Some of these are more pie-in-the-sky-type ideals where, in a world where businesses [didn’t] exist to try to make money, you could do some of these,” he states. “But it’s not the case…. In a practical world, guess what, this isn’t how it works.”

Understanding People

Hura doubts the controversy surrounding standards will be resolved any time soon. She says she would welcome “strongly worded recommendations for best practices in doing design in terms of actual process,” but notes that too many designers still spend too much time trying to justify everything they do each time they get feedback from users.

“Maybe if there was some kind of standard methodology out there we would spend less time spinning our wheels,” she says. “And we’d be able to spend more time doing the design if there was a standard to point to and say, ‘See, this is the right way to do it.’ Then that might be helpful.”

But for Dahl, the issue comes back to callers—the people on the other end of the phone. In creating standards or guidelines, Dahl says general principles should be derived from designers’ knowledge of human beings. “You should understand how long it takes people to understand something and start speaking,” she says. “It’s based on the fact that you have a human being with all their physical and psychological properties that evolved over a billion years ultimately.”

To this end, Dahl suggests basic concepts, such as ensuring that people can understand and hear prompts, providing adequate response time, minimizing cognitive load to prevent confusion and mistakes, and paying attention to users’ memory and attention spans. “There are too many possibilities with people,” she adds. “We should be thinking of these things as best practices.”

GetHuman, Version 1.0

The GetHuman standards have been designed with simplicity and directness to eliminate ambiguity and enable testing and certification. There might be more than one way to accomplish each, but the result must be as follows:

The caller must always be able to dial 0 or say “operator” to queue for a human.
An accurate estimated waiting time, based on call traffic statistics at the time of the call, should always be given when the caller arrives in the queue. A revised update should be provided periodically during hold time.
Callers should never be asked to repeat any information (name, full account number, description of issue, etc.) provided to a human or an automated system during a call.
When a human is not available, callers should be offered the option to be called back. If 24-hour service is not available, the caller should be able to leave a message, including a request for a call back the following business day. Gold Standard: Call back at a time that she has specified.
Speech applications should provide touch-tone fall-back.
Callers should not be forced to listen to long/verbose prompts.
The caller should be able to interrupt prompts (via dial-through for touch-tone applications and/or via barge-in for speech applications) whenever doing so will enable him to complete his task more efficiently.
Do not disconnect for user errors, including when there are no perceived key presses (the caller might be on a rotary phone). Instead, queue for a human operator and/or offer the choice for a call back.
The default language should be based on consumer demographics for each organization. The primary language should be assumed, with the option for the caller to change that language. In the United States, for example, English should generally be assumed with a specified key available for Spanish. Gold Standard: Remember the caller's language preference for future calls. Gold Standard: Organizations should ideally support separate toll-free numbers for each individual language.
All operators/representatives of the organization should be able to communicate clearly with the caller. Accents should not hinder communication, and representatives should have excellent diction and enunciation.

The Great Debate

GetHuman, Version 1.0

Aircall Acquires Vogent

Krisp Launches VIVA 2.0, an Infrastructure for Voice AI Agents

DomoAI Launches TTS and Integrates OpenAI's GPT Image 2.0 in Talking Avatar Workflow

Copperline Golf Launches AI Voice Caddy