Standards in the Voice User Interface
Voice user interface (VUI) design is one of the liveliest topics at every speech industry conference. Questions like “How many items can I put in a voice menu?” “Should I use a male or female voice?” and “What kind of persona should the system have?” are frequently the subjects of active discussion. These are extremely important questions because correcting bad user interface decisions after a system is implemented can be very expensive and time-consuming. Prompts might have to be rerecorded, grammars rewritten, or a new voice talent found.
It’s also important to get the VUI right because problems can easily make the difference between a system’s success or failure. It’s time to move the industry toward developing a reliable body of answers to these questions.
Voice and multimodal interaction technologies are relatively new, so there are no generally accepted standards that address the user interface. In addition, because human beings are so complex, it’s unlikely we’ll ever have really hard and fast standards that are applicable under all circumstances. We’re more likely to end up with guidelines and best practices that might need to be adjusted depending on specific users and applications.
But a reliable set of guidelines and best practices would nevertheless be extremely useful. We have two complementary sources of information about what works and what doesn’t work in the VUI from which we can draw to establish guidelines. On the one hand, best practices are emerging as applications are implemented and the industry gets more hands-on experience. On the other, complementing this in-the-field experience, a large body of relevant psychological research on people and their perceptual, motor, cognitive, and social characteristics is building. This research can be exploited to lay the foundations of design principles.
Best practices in VUI design will be different from the standards that define how computers interoperate. Obviously people are different from computers. People are very good at learning, and they are much more flexible than computers in what they can understand. On the other hand, their basic abilities, like memory and attention, are difficult or impossible to change—and they have to be accommodated in the application. On the computer side of an application, you can define requirements, like processor speed, bandwidth, and memory, but you can’t do that with people. It’s not possible to ask for “a garden-variety human being” as a system component because not everyone is the same. Disabilities, normal variations, cultural differences, and age-related differences all mean that the VUI has to take into account not just specific capabilities, but a range of capabilities.
Short Is Sweet
One of the best-known findings in voice interaction is that limitations of working memory require voice-only menus to be fairly short. Both hands-on experience and cognitive psychology tell us that menus should contain only three or four items so users can remember the choices. However, there are individual differences in working memory capacity, so this can’t be a hard and fast rule. We also know that working memory capacity tends to decline with age, meaning applications designed for older users might work better with even shorter menus.
And although the guideline about menu size is very reliable for voice-only interfaces, it does not apply if we’re talking about multimodal applications. In a multimodal application—where choices can be displayed on a screen—the screen can supplement the user’s memory.
All of these can be considered as a starting point in developing best practices in menu size for different types of users and applications.
A number of resources are available on these emerging principles. Several books have been published on VUI design. Industry conferences provide a forum for discussing the newest ideas and talking to practitioners. A new professional organization, the Association for VUI Designers, is also a valuable resource. And from the World Wide Web Consortium (W3C), graphic user interface-oriented standards, such as the Web Content Accessibility Guidelines and Mobile Web Best Practices, can also provide insight into the graphical side of multimodal applications. An article from the W3C, “Common Sense Suggestions for Developing Multimodal User Interfaces,” has useful suggestions for multimodal design as well.
It’s clear that a lot of knowledge is out there. As a speech technology community, it’s time to put some effort into codifying the current experience-based knowledge, as well as bringing in relevant research findings, to provide a body of truly useful best practices in VUI design.
Deborah Dahl, Ph.D., is the principal at speech and language technology consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at firstname.lastname@example.org.