When Not to Standardize
This column has always talked about voice standards and standards in progress and how they can save money, improve applications, and open new doors for innovation. But not every aspect of an application should be standardized. Let’s talk about what shouldn’t be standardized and why. To keep things simple, we’ll focus on standards whose goal is interoperability between systems as opposed to other kinds of standards with different goals, such as safety, privacy, and accessibility.
There are three general cases where interoperability standardization shouldn’t be done:
- When standardization has no real effect on interoperability.
- When technologies are changing so fast that any standards will soon become obsolete.
- When standardization requires changing human behavior, which is very hard to do.
If we want standards to improve the interoperability of voice systems, then it’s clear there is very little benefit in standardizing low-level details of system components like speech recognition algorithms. These components will likely never need to interoperate with other voice systems. And because these components are undergoing extensive technical development, standardization can actually limit innovation.
On the other hand, the results of speech recognition could very well be used by other systems. It is in fact quite valuable to develop standards (like EMMA) for speech recognition results, allowing results from one speech recognizer to be used by a different vendor’s natural language processing system. But internal processing within speech recognition systems should be left up to the system developers.
Likewise, we shouldn’t try to standardize technical areas where very active research will make any proposed standards quickly obsolete. I would argue this is true of the voice user interface, especially dialogue design. Dialogue processing is an active and rapidly evolving field, and the related technologies on which dialogue processing rely—speech recognition and natural language understanding (NLU)—are also evolving rapidly. A good example is the evolution away from prompts that present limited voice menus, like a doctor’s office that asks, “Are you calling to schedule an appointment, check your lab results, or get a flu shot?” It’s much more common now to use open-ended prompts: “Can I ask your reason for calling today?”
In the early days of voice user interfaces, user choices had to be presented in limited menus because speech recognition and NLU weren’t powerful enough to cope with the variety of responses they’d get from open-ended prompts. Because of these limitations, indispensable voice user interface guidelines evolved to address how many choices should be in voice menus, based on how many users could remember. The recommendations were usually around three to five choices. Since then, speech recognition and NLU have improved so much that open-ended prompts are routine, and they can be processed quite well. If there had been a standard requirement for voice menus to be limited to, say, five options, that standard would have become essentially obsolete by now.
Similarly, any dialogue standards designed to coax users into utterances that speech recognition and NLU can handle would become obsolete as the technologies become better able to manage increasingly complex utterances and intents. New abilities such as handling multi-intent utterances (“I want to book a flight and a hotel”), off-topic utterances, multilingual utterances, and utterances that require determining user intent are all active topics of current research, and these capabilities will soon find their way into commercial systems. It would be pointless to try to standardize any dialogue strategies that attempt to guide users into avoiding these kinds of utterances.
The last aspect of voice applications that shouldn’t be standardized is anything that requires users to change their behavior—i.e., requiring them to learn specific commands. After all, the promise of voice interaction is that users are free to speak their requests in their own words. A good example of a standard that requires the user to remember a list of spoken commands is ETSI ES 202 076 V2.1.1 (2009). This standard defines dozens of commands in 30 languages, including commands ranging from simple control (such as “stop,” but not “quit” or “halt”) to complex media control. It’s doubtful that most users would take the time to learn all of these commands, and it’s not clear how widely this standard has been adopted.
Low-level system components, rapidly changing technologies, and human behavior shouldn’t be corralled by attempts at standardization. Reserve standardization efforts for where they belong—improving the interoperability of systems.
Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversationaltechnologies.com.
Related Articles
Assessing the output of genAI systems is easier said than done.
17 Apr 2024