The Role of Speech in Multimodal Applications
The visually-oriented graphical user interface (GUI) is a powerful, familiar, and highly functional approach to interacting with computers. But, as speech technology becomes increasingly available, it's natural to think about how speech could be used in GUI interfaces as well as voice-only interfaces. Combining speech with graphical input in this way creates a multimodal interface, where the user has different ways, or modes, of interacting with the system. Sometimes people don't see the additional value that speech can bring to a GUI interface, they're concerned that there are no agreed-upon standards for authoring multimodal applications, or they simply see multimodality as new and unproven. By explaining what speech adds to the graphical interface, and arguing that organizations shouldn't wait for the final standards to get involved in multimodality, we can possibly dispell the idea that multimodality is new and unproven by giving some examples of multimodal applications that are being deployed now. Multimodality can add real value to applications in the near term and get you thinking about how you can apply it to your own organization's needs. We'll start by looking at three promising kinds of applications for speech in a multimodal environment. Speech as the Object of the Application Sometimes an application would be impossible without speech recognition. The clearest examples are systems where speech itself is the point of the application. Products like reading tutors, speech therapy systems, systems for assisting deaf users in learning to speak, and systems for foreign language learning all fall into this category. Some examples:
- Products from Rosetta Stone and Auralog are applying speech recognition to foreign language learning.
- Speech therapy systems based on speech recognition are also being successfully used with patients who have suffered a brain injury (aphasia) that affects their speech and language abilities.
Hands-Busy/Eyes-Busy Applications Speech-based applications, as discussed above, apply to relatively limited markets. A much more general class of applications where speech is valuable is in hands-busy/ eyes-busy situations. Driving, in particular, is a good example of a hands- and eyes-busy task. Features are continually being added to cars, requiring more controls to be added, consequently making it more difficult to simultaneously operate the car and the controls. For this reason, car companies are now looking at multimodal interfaces that can be used while driving. For example, SpeechWorks has worked with Ford to add a multimodal interface to the Model U concept SUV. The multimodal interface has a number of functions - setting drivers' personal settings, enabling the driver to use a cell phone, controlling an MP3 player and adjusting the heating and air conditioning. Hands-busy/eyes-busy considerations also apply to people with disabilities. There are many people whose disabilities prevent them from making full use of a GUI interface - they can't use their hands, they have difficulty using their hands, or they can't see the display well enough to use a GUI interface. Or, perhaps they have permanent visual disabilities or maybe they just can't find their reading glasses. Similarly, people who have difficulty walking can benefit from using speech in applications like home control. Mobile Applications While speech-based applications and hands-busy/eyes-busy applications provide strong motivations for speech input, some of the most exciting applications for multimodality today are in the mobile environment. Adding speech to a GUI interface on a mobile device is particularly compelling for three reasons:
- The decreasing size of mobile devices is making keypads and screens smaller and more difficult to manipulate.
- Users are often standing or walking while using mobile devices. This makes it difficult to use the hands to access the interface.
- Because going down menus is difficult on small devices, the dramatically flattened menu structures that speech makes possible are especially useful. As an example, consider a simple function like setting the time on a cell phone. On my cell phone, it takes ten key presses just to access the time setting function, four more key presses to set the time, and three more key presses to get back to the main menu. (This is the best case, assuming that I remember which menus to use). It would be much faster and easier if I could just say "set the time" to access the time-setting function, and then say "ten a.m." to set the time.
Kirusa, a company focusing on multimodal wireless platforms and applications, identified mobile applications as the most compelling area for multimodality when they were starting their business. Inderpal Singh Mumick, the CEO and one of the founders of Kirusa, points out, "Clearly, multimodality can be used in desktop applications just as well as with mobile applications. However, we believed (and still believe) that the pain is felt most acutely for mobile devices, and hence the business value is significantly higher for mobile applications." What Applications Aren't Right for Speech? Speech isn't right for every application. Those considering using speech for a specific application, should consider the following:
- Privacy - it's impossible to make speech input completely private, and there's some information that people would just prefer not to say out loud. This includes pins and security information as well as personal information such as weight or age. Unless you can be certain that the interaction will take place in a private setting, avoid applications that require speech input of private information.
- Fear of disturbing others - the ubiquity of annoying cell phone usage in public might lead one to believe that people aren't typically concerned about disturbing bystanders when they speak. However, there are still many situations where speech input is considered socially inappropriate, such as meetings and presentations.
- Cost - Speech input has to add enough value to justify its additional cost.
- Accuracy - Speech recognition isn't perfect, so error correction has to be designed into applications. Applications that require very large or complex grammars may not have high enough accuracy to realize the benefits of speech.
These considerations argue that when speech is added to a GUI application, there has to be a solid requirement for speech that overcomes any potential drawbacks. About Multimodal Standards Because there aren't yet any agreed upon standards for interoperable multimodal applications, it can be tempting to wait for the standards before getting involved in multimodality. There is always a tradeoff in new technologies between the risks that a new standard will make the approach they've chosen obsolete, versus the advantage of being early in the marketplace. But in the case of multimodal standards, there are two considerations that make it worth getting involved in multimodality now.
- There's a tremendous amount of interest, hard work and energy being devoted to developing good multimodal standards. So standards will be coming. As the chair of the Multimodal Interaction Working Group of the World Wide Web Consortium, which is developing these standards, I know this first-hand.
- As important as standards are, they represent only a small part of a full deployment. They don't address technical concerns such as the application development process, best practices in user interface design, platform architecture and scalability, or marketing questions such as understanding what applications are the most compelling and profitable. By waiting too long for standards, companies will lag in developing an understanding of these important practical questions.
Moving Forward— Real Experiences with Multimodality How is the industry actually moving forward with multimodality? Many companies are beginning to supply the platform infrastructure for multimodality, based on two available multimodal specifications - Speech Application Language Tags (SALT), developed by the SALT Forum, http://www.saltforum.org, and XHTML+Voice (X+V), http://www.w3.org/TR/xhtml+voice, developed by IBM, Opera Software and Motorola. Just a few examples include:
- SALT platforms coming from HeyAnita, Intervoice, MayWeHelp.com, Microsoft, Philips, SandCherry/Kirusa and an open source SALT browser from Carnegie-Mellon University
- X+V browsers announced by IBM, Opera Software and V-Enable
- A platform that will support both SALT and X+V announced by VoiceGenie
In the area of applications, there have been desktop products using speech recognition for language learning for over ten years. In the more general marketplace, the industry is rapidly moving forward with multimodal applications. There are numerous demonstrations and prototypes, and some applications are deployed or nearing deployment. Here are some examples of multimodal applications that are deployed or are very near deployment.
- Kirusa has been involved in trials with France Telecom, testing multimodal applications, since September, 2002. Real Orange and France Telecom subscribers are interacting with multimodal applications, including messaging (email, voicemail) and Citiguide applications
- Kirusa is also working with Bouygues Telecom in France with consumer applications, such as searching for movies. For example, when searching for a movie of interest playing in a nearby theater, a subscriber can speak the name of the city and the genre of the movie rather than trying to type them on the 12-key telephone keypad on the mobile device.
- LogicTree is also taking multimodality into interesting directions. They're building multimodal systems that not only allow one user to use multiple modes for input, but also allow two or more users to work together, each accessing the system with different modalities. As an example, LogicTree has built a system for Ann Arbor Transit Authority in Ann Arbor, Michigan. This system allows riders who can't use the normal fixed-route buses to confirm pickups and check the status of their pickups. For the end user, this system is a voice-only system, but it's integrated with a GUI system used by the operator. If a call needs to be transferred, the system passes the current state of the call from the voice system to the operator. A screen appears at the operator's terminal with the confirmed information from the voice interaction filled in. The operator can then fill in the remaining fields using the GUI interface while talking to the customer. Prefilling fields with information collected from the voice interaction allows the call to be completed more quickly and is less frustrating to users, who appreciate not having to repeat the information they just provided to the system. This system is now in the final stages of acceptance testing.
- LogicTree is also working on a similar system for the Washington D.C. Metro Authority. This system allows the operator to fill in the remaining fields using the GUI interface while talking to the customer, send the information to the application, and then transfer the caller back to the automated system to hear the result of the transaction. This system is fully deployed and released to the public.
Insights into user behavior in multimodal systems Whenever new interfaces get into users' hands, interesting and unanticipated things happen. During its trials, Kirusa has learned a lot about the human factors of good multimodal interface design. For example:
- Feedback to the user on the system's status is important. With mobile devices, narrowband wireless links can cause unpredictable delays. Users who aren't aware that the system is in a network delay will do things like impatiently push the push-to-talk button because they don't understand the source of the delay. Icons and earcons that make the state of the system more clear have been found to be very helpful.
- Customizability is important. To allow users to operate the system in eyes-free or hands-free settings the user should be allowed to switch on or off the speech or GUI interface. In these situations, experienced users gravitate towards a speech-in/visual-out mode for a majority of tasks.
LogicTree has also identified some important factors about working with multimodality in contrast to a voice-only interface. 1. Dialog structure may need to be different in different modes. One example is dealing with complex inputs that may be misrecognized in the speech mode. A user can type an address all at once, but when speaking the same address, the user may be required to speak the city and state first, then the zip code, street name, and finally the street number for the best recognition accuracy. 2. Confirmation strategies for complex information also change with the input mode. It makes sense to confirm a complex input when the user speaks it in voice mode, but this would be meaningless on the screen, where the user can type the information and see the results. 3. When users go through a voice interaction to obtain information (such as a phone number), it makes sense to offer them an electronic copy via email or SMS, but this doesn't make sense in GUI mode, as they already have an electronic copy. The synergy of platforms, progress toward standards and experience with applications is starting to provide a solid infrastructure for multimodality. Speech input isn't for every GUI application, but there are many applications where speech can be very important and in many cases vital to the usability of a system.
Dr. Deborah Dahl is the principal at Conversational Technologies. She is the chair of the World Wide Web Consortium's Multimodal Interaction Working Group and is also active in the Voice Browser Working Group. She can be reached at email@example.com.