Accessibility in Voice and Multimodal Applications
Tim Berners-Lee, the inventor of the World Wide Web, said, “The power of the Web is in its universality. Access by everyone, regardless of disability, is an essential aspect.” This applies to the voice Web as well as the more typical graphical Web. Ensuring information on the Web is available to everyone is the goal of the World Wide Web Consortium’s Web Accessibility Initiative. One of the key W3C accessibility standards is the “Web Content Accessibility Guidelines” (WCAG 2.0).
WCAG 2.0 (www.w3.org/TR/WCAG20) is organized around four high-level principles: Content should be 1) perceivable, 2) operable, 3) understandable, and 4) robust. Of course these principles apply to every application, whoever is using it, but they are applied in different ways, depending on the users’ needs. The guidelines are primarily oriented to the traditional graphical Web, but they also apply to voice and multimodal interfaces.
Though it seems excruciatingly obvious that users should be able to perceive and understand application content, we’ve all encountered (too often) applications that are difficult to perceive and understand. This is because it’s not always obvious how to apply these accessibility principles in practice. The WCAG specification discusses each principle and, more important, offers concrete guidelines that explain how to address them.
For voice-only applications, consider Guideline 1.1, which states, in part, “Provide text alternatives for any non-text content.” At first it seems this would be difficult to address in applications designed for traditional telephones that don’t have a display. But the vast majority of current telephones, especially mobile phones, do have a display. This guideline could be met if the application provides text alternatives to spoken prompts when a display is available.
Another example is Guideline 2.2: “Provide users enough time to read and use content.” Though this guideline is worded toward visual content, it clearly also applies to audio content. For voice applications, this means systems should accommodate users who need more time to hear, understand, and respond to speech.
Although these capabilities are not common in current voice applications, they are listed in the requirements document for VoiceXML 3.0 (www.w3.org/TR/vxml30reqs), which will simplify implementing them in future systems. Another consequence of this guideline for voice applications is that systems should be able to provide longer timeouts if necessary. If the application detects the speech recognizer is timing out frequently, then it should be able to dynamically lengthen the timeout period to give the user more time to speak.
This could be very helpful, for example, for people who stutter (1 percent of the U.S. population). Anecdotal evidence from people who stutter indicates that speech timeouts can be a significant source of frustration when they interact with a speech-enabled interactive voice response system. As a consequence, this frustration can lead to their speech becoming more difficult for the recognizer to process. Other people who might not usually be thought of as having a disability, such as non-native speakers, older users, and even people who are temporarily distracted by another task, might also need more time to process and respond to prompts, so the benefits of this capability actually extend to a much wider population.
When multimodality enters the picture, multimodal interfaces clearly have tremendous potential for making applications more accessible. They also offer plenty of ways to make applications less accessible. Better accessibility is achievable if a person can use the application with different modalities, such as voice or a graphical interface. On the other hand, poorly designed multimodal applications can decrease accessibility if they require the use of multiple modalities. Imagine the accessibility consequences of a mobile voice search application that requires a user to log in with the keyboard, enter searches by voice, and then look at the search results on the screen.
A specific guideline that applies to multimodal applications is Guideline 1.2: “Provide alternatives for time-based media.” For video, this could include captioning for people who can’t hear, or audio or text descriptions for people who can’t see. For audio, it could include a transcript or text description for people who can’t hear.
These are only a few examples of the WCAG guidelines and how they apply to speech applications. Designers of speech and multimodal applications should take a look at WCAG and study the guidelines as they apply to their applications.
Deborah Dahl, Ph.D., is principal at speech and language technology consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at firstname.lastname@example.org.