July 15, 2008
By Deborah Dahl Principal - Conversational Technologies
Standards

A Framework for Multimodal Apps

You’re about to leave your hotel, on your way to corporate headquarters for a business meeting. You just found out that the CEO is coming and you need to make copies of your presentation. Where should you go? You pull out your mobile phone and say, Show me copy stores near here. A map pops up listing locations of nearby copy stores. You point to one and say, This one. A new screen shows the hours, location, and a store rating. That store is definitely close by, but it only has one star. You tilt your phone and a new screen shows you a second store. This one is a little further away, but it has five stars. You say, Route me there, and a map appears with your route. A voice interface guides you to the store. You make your copies, your presentation wows the CEO, and you get backing for your project, all thanks to a multimodal application that combines the modalities of speech, graphics, and motion.

As useful as this application would be, it turns out that actually building it isn’t so easy. Coordinating different modalities and managing the technologies behind them, all while providing a seamless user experience, is a complex job.

The World Wide Web Consortium (W3C) Multimodal Interaction and Voice Browser Working Groups are working on standards that will simplify the process of building multimodal applications like this one and lay the groundwork for even more compelling applications. Multimodal applications can certainly be done with proprietary approaches, but the difference in complexity between standards-based and proprietary solutions can make all but the simplest proprietary applications extremely expensive.

That’s why the W3C has drafted a specification for multimodal applications, called the Multimodal Interaction (MMI) Architecture. The keys to this MMI Architecture are a careful definition of system components, a clear delineation of their responsibilities, and a well-defined communication process. These features allow new modalities to be easily added. In addition, they make it possible for application functions to be distributed over local devices, like cell phones, as well as remote servers. The four major parts of the MMI Architecture are:
1. Modality Components (MCs), which provide modality-related services, such as speech recognition, handwriting recognition, speaker verification, and visual display.
2. The Runtime Framework (RF), which coordinates the MCs and manages communications.
3. The Interaction Manager (IM), which defines the overall flow of an interaction between the user and the system. A good example of a standard for implementing an IM is State Chart XML (SCXML).
4. Life-cycle events, which allow the components of the MMI Architecture to communicate. These events include generic functions like start, stop, and pause, plus more detailed ones that are specific to particular components, such as providing the grammars for a speech recognizer.

One of the architecture’s most important features is that MCs can be built independently from other parts of the system by experts in that modality. That is, speech components can be built by speech experts, handwriting recognition components can be built by handwriting experts, and biometric components can be built by biometrics experts. Once developed, these independent components can then be combined as needed to create multimodal applications. All that is necessary is for a modality component to support the life-cycle events and to provide ways to access its modality-specific functions.

Multimodal interaction will play an important role in realizing the dream of the Web anytime, anywhere, and on any device. The problems of using small keyboards and endless menus for input are solved by speech. Visual displays allow information to remain available long after speech has faded away, and enable applications to present visual information that can’t be conveyed by speech.

Although multimodal applications are important for mobile users, they can also be useful on the desktop. In the call center space, consider a multimodal application where customers can view instructional or troubleshooting graphics on a desktop browser while a coordinated voice application on the phone talks them through the process. Think about kiosks in retail stores where customers can ask about the location of a product and see it on a map, or similar kiosks for finding services in an airport. Multimodal applications also have tremendous potential for literacy and language training.

The MMI Architecture is still in a working draft stage; comments are welcome and can be sent to www-multimodal@w3.org. The full specification can be found at www.w3.org/TR/mmi-arch.

Deborah Dahl, Ph.D., is the principal at speech and language technology consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversational-technologies.com.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

A Framework for Multimodal Apps

Voice Deepfake Fraud Surged 1,300 Percent

Sanas Unveils Simultaneous Real-Time Speech-to-Speech Translation

ESTsoft Partners with ElevenLabs

Deepgram Launches Voice Agent API