EMMA Success Leads to New Challenges
Extensible MultiModal Annotation 1.0 (EMMA), the W3C standard for representing user input, became an official standard in 2009. Since then it's been incorporated into a variety of systems, including commercial speech systems, research systems, and Microsoft Office. As the standard becomes more extensively used and additional modalities become available in devices, new requirements have emerged, motivating updated versions of the standard. In this column, I'll discuss three exciting ideas the W3C Multimodal Interaction Working Group is working on for future versions of EMMA. We have two current publications discussing these ideas—a Working Draft of EMMA 1.1 that includes incremental features and a note on use cases for future versions of EMMA that includes broader changes.
One proposal in EMMA 1.1 is to include information on the location of the user's device in EMMA results. With the increasing availability of GPS on devices, adding GPS information to EMMA is a natural extension. The Working Draft of EMMA 1.1 supports the inclusion of location information in EMMA documents, in a format based on the W3C's Geolocation specification. The location information includes not only latitude and longitude, but also altitude, heading, and speed. This should make building location-aware applications much simpler. It also opens up interesting possibilities for analytics that integrate information obtained from the user's location as well as the user's speech.
A second proposal in EMMA 1.1 is based on the recognition that EMMA addresses two different, but important, use cases. Initially, EMMA was envisioned primarily as a way to represent speech recognition output in a vendor-neutral way. EMMA addresses this use case by making it much easier for different recognizers to interoperate with different VoiceXML platforms. However, as EMMA began to be used more widely, it became clear that a second important use case was to support offline analysis of data from systems in use. These two use cases place different requirements on EMMA. An EMMA document that's going to be used on a mobile device should include only the minimal information that the application needs to conduct a real-time dialogue, but an EMMA document that's going to be used for analytics needs to be much more detailed, with a wider variety of metadata describing the input and its context. To satisfy both of these use cases, a reference mechanism is proposed for EMMA 1.1. Using the new reference features, a minimal EMMA document used in real time by a client dialogue manager can point back to a fuller EMMA document stored on a server, providing a way to do a detailed analysis of the users' inputs without burdening a mobile device with information that's not needed.
The third feature is a longer-term effort to look at extending EMMA to represent system output as well as user input. EMMA 1.0 was focused on representing user inputs, but it would also be valuable to represent system outputs in EMMA. The clearest situation is when the system is speaking. An EMMA representation of a system's spoken output would support the development of a variety of dialogue analytics, because then we would have both sides of the conversation represented in a common format. For example, an EMMA output document would make it easier to study how variations in system prompts affect user responses. With a uniform representation of system input and output, it would be easy to look at metrics like the length of time between the end of the system prompt and the beginning of the user's response, which could be an indicator of user confusion. It would also simplify finding the answers to such questions as what system prompts most often lead to the user asking for an operator. Even more fine-grained analytics would be easy as well—we could correlate system prompts and user responses at specific times and even across different locations, using the new location element. EMMA support for system output other than speech also raises interesting possibilities for representing graphical outputs such as HTML or scalable vector graphics and synchronizing those with speech and text output.
This overview only touches the surface of the new features that we're working on for EMMA. We welcome new participants and ideas about how EMMA can be used. The simplest way to get started is just to join the public mailing list at email@example.com and comment on the specs and proposals for new features.
Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium's Multimodal Interaction Working Group (www.W3.org/2002/mmi). She can be reached at firstname.lastname@example.org.
Getting more out of multimodal inputs