May 1, 2011
By Deborah Dahl Principal - Conversational Technologies
Standards

Making Modalities Play Nicely Together

Multimodal applications have incredible potential to enrich the interaction between users and applications. But multimodal capabilities, like speech recognition, gesture recognition, and biometrics, are themselves highly complex, and few companies have the resources to create independent versions of each technology they want to use in an application. Simply mastering separate APIs for multiple third-party components can be problematic.

Standards for integrating technologies from different vendors would make it easier for third-party technology experts to supply components to integrate into multimodal systems. The Multimodal Architecture and Interfaces specification (MMI architecture) is one such standard. The architecture enables various vendors’ technologies to work together to create innovative applications, opening the door for smaller players.

At its essence, the MMI architecture defines a set of cooperating components for processing modalities (modality components), coordinated by a central interaction manager and communicating through a standard set of life cycle events.

By standardizing interfaces between modality components and the interaction manager, the MMI architecture lets vendors offer pluggable components that can fit into other companies’ multimodal applications and provide specific functionality, such as speech or handwriting recognition. But, because this involves a high-level architecture, designers need more specifics to create pluggable components. To make it easier to design interoperable components, the W3C Multimodal Interaction Working Group recently published a companion document to the MMI architecture, called “Best Practices for Creating MMI Modality Components.” Included are eight important guidelines, three design suggestions, and three examples of components.

The eight guidelines should be included in the specification of a component so that third parties can use it in MMI architecture-based systems. Think of the guidelines as a “spec for a spec” for modality components.

The first, and most important, guideline is that a modality component must implement all of the life cycle events. That is, it must accept commands to start, stop, report its status, pause, and resume, as well as the other life cycle events defined in the MMI architecture. In addition, any events specific to the component must be defined.

Supported natural languages, audio formats for speech recognition, communication protocols such as HTTP, scripting languages, and error codes also are important to define. With that information, someone who’s building an application and considering using another vendor’s component can know whether it would be suitable.

The three design suggestions are also helpful for designers of interoperable components. They involve the internal architecture of components. As discussed in the MMI architecture spec, components can be simple, complex, or nested.

Simple components are internally black boxes, but in some cases it may be useful to create components with internal structure. A complex component combines two or more distinct functions that are bundled for convenience or efficiency. An example would be an avatar that needs to communicate closely with a TTS to synchronize lip movements to the synthesized speech. The third type of components, nested, may include an internal interaction manager. The best practices document gives suggestions for deciding which architecture—simple, nested, or complex—is appropriate.

Finally, the best practices document has three examples of definitions of modality components that follow the guidelines. The definitions describe APIs based on the MMI architecture for each component. Sample components include face recognition, handwriting recognition, and video display.

The video display component explains such details as the fact that it is designed to use the H.264 codec, employs HTTP for transmitting media, and defines error messages such as “codec not supported” if it is asked to start displaying a video file that uses an unsupported codec. Similarly, the face-recognition example defines error messages such as “known users not found” if the database of previously enrolled faces is not found. The handwriting component example includes information such as the fact that the handwriting recognition is based on InkML input and that it uses EMMA for representing the recognition results. Those three examples also include specific XML examples of life cycle events used with each example modality.

The MMI working group is interested in hearing your feedback; post your ideas to the the public mailing list, at www-multimodal@w3.org.

You can find the MMI architecture spec at http://www.w3.org/TR/mmi-arch/ and the best practices spec at http://www.w3.org/TR/mmi-mcbp/.

Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversational-technologies.com.