Multimodal Interfaces Discussed at SpeechTEK /AVIOS 2006

Speech technology discussions often focus on recognition accuracy and voice user interface design of IVR's for reservations, customer care and banking. However, the field has started to focus on coupling speech with graphics and several cutting-edge multimodal applications have been developed. Multimodal Usability is becoming more important to ensure that different modes are not just velcroed together, but rather work symbiotically to improve the service. The SpeechTEK West/AVIOS 2006 conference featured several papers and presentations that illustrate how voice is combined with graphics and placed in successful products and services, building on the complementary strength of different modalities.

Most presentations on multimodal technology describe handheld, mobile, PDA-like devices, and calls made to call centers. A call to an airline reservation center resulted in (downloaded) video clips of instructions on how to use the service, displayed airline schedules on the screen, and then accepted voice input for flight requests and reservations. A call to a car rental agency displayed pictures of available car types, accepted voice for making reservations, and, if the caller then chose to discuss the reservation with a live agent, showed more video clips of the car. Even calls to a generic call center sent video clips to the caller's device, driven by business rules that selected content based on the customer's record. A consumer research study found that multimodal applications are meeting the customer need. Indeed, a strong preference was shown for receiving both voice and visual (video) data on handheld devices.

Multimodal applications on mobile telephones can provide a "one stop" messaging capability with voice dialing and voice to create and send text messages as well as to manage contact lists, settings and call status. A complementary visual interface displays the names and addresses that are entered. Future mobile phone capabilities include taking a picture or video and sending it using voice commands. Another capability is surfing the Web by speaking a query to a search engine that returns the results as speech or graphics.

A display-screen telephone uses a multimodal user interface to select song or movie downloads. When a song is selected using the graphical display, a short video clip of the song is played, and a video tutorial shows how to use voice to order the song or video. ASR accuracy over the mobile telephone was about 81 percent, so use of a smaller set of keywords was encouraged, with an n-best list displayed for the user to resolve low-confidence results. The application is better received if voice is not used at every screen, and the keypad is always available for data entry.

Multimodal interfaces also work well for the mobile workforce (sales, craftspersons, real estate agents, etc.). Hand-held devices are valuable for entering field data, ordering, or requesting troubleshooting instructions. They support travel management, brokerage transactions, health care status and reports, and inventory control. A graphical interface presents the fields to be filled and a voice interface supplies input data to fill-in the fields, navigate between screens, or request additional information.

To support and encourage development of multimodal applications, standards have been developed and continue to evolve. The next draft of the W3C MM specifications was presented for review in March 2006. The standard defines four components of a multimodal user interface: the container, Interaction Manager, data, and delivery media. Considerable effort is being directed towards the Interaction Manager, the component that evaluates input and state in order to communicate asynchronously using the available modalities. EMMA notation is preferred to communicate the user's input to the Interaction Manager. The next steps in standards are to define the API between the Interaction Manager and the modalities, and decide on the structure of the Interaction Manager. In the meantime, multimodal applications can successfully use VoiceXML, which is capable of handling video as well as audio clips.

It is still too early to create guidelines for successful multimodal user interface designs since so few applications are available for evaluation and since target devices vary in size and capability. Other issues must also be addressed, such as which modality is preferred for specific tasks, and how error correction can be supported with a complementary modality. With strong interest developing in multimodal products and services, expect rapid development of applications and solid suggestions for successful multimodal user interfaces. Be sure to attend the SpeechTEK West/AVIOS 2007 conference next Spring and prepare to enjoy presentations by industry leaders about the latest developments in this rapidly evolving area.

Matt Yuschik is a human factors specialist at the Convergys Corporation in Cincinnati, Ohio. Yuschik has a doctorate in electrical and computer engineering. He has been on the AVIOS Board of Directors for eight years where he currently is the treasurer.

Multimodal Interfaces Discussed at SpeechTEK /AVIOS 2006

Deepgram Launches Streaming Speech, Text, and Voice Agents on Amazon SageMaker AI, Integrates with Amazon Connect

Wispr Raises $25 Million to Build Its Voice Operating System

Curantis Partners with nVoq

Read AI Introduces Operator Mobile and Desktop Apps