June 14, 2016
By Deborah Dahl Principal - Conversational Technologies
Standards

EMMA 2.0 Lets Applications Decide What to Tell You—and How

To carry out user requests, dialogue applications like IVRs and virtual assistants need a standard, vendor-independent way to describe the meanings of what their users say. The Extensible Multimodal Annotation language, known as EMMA, was designed by the World Wide Web Consortium (W3C) with this need in mind, and now it has received an exciting update.

EMMA was created as a way to represent user inputs, particularly the kinds of rich, complex inputs possible with spoken natural language. EMMA 1.0 became a W3C standard in 2009 and has since been used to link processors like speech recognizers and natural language understanding systems with platforms such as VoiceXML and Web browsers.

Since the publication of EMMA 1.0, the W3C Multimodal Interaction Working Group has received much feedback from implementers, who suggested new features based on their experiences with the language. The Working Group soon realized that the scope of the changes amounted to a new version; the first working draft of EMMA 2.0 was published in September 2015.

A significant new feature, support for system output using a new “output” tag, opens up some intriguing possibilities. Just as EMMA 1.0 allows an application to deal with the meaning of multimodal user input, abstracted away from the user’s exact words or even whether the user spoke, typed, or used a touch screen, EMMA 2.0 provides similar benefits for system output. A single format for user input and system output makes it much easier for an application to decide what should be presented to the user, without worrying about how to present it. If this goal sounds familiar, it’s because it echoes a Web development style called responsive design, but the EMMA approach is much broader. Responsive design involves adapting the same content gracefully to screens of varying sizes; EMMA 2.0 supports adapting content not only to various screen sizes but to entirely different presentation formats—including speech, graphics, combined speech and graphics, even robot actions.

Consider a travel planning system that finds three flights that satisfy a user’s request. At the application level, the system generates an EMMA document that simply specifies that the user should be informed of the flights, without specifying exactly how. The next stage of processing would refine the EMMA document, adding details on presenting the information. This stage can account for the user’s general preferences as well as current context. If a visual presentation is appropriate for the user and context, then a graphical display is generated. If a spoken presentation is appropriate, the output is spoken. The presentation could also include both graphics and speech.

How do users gain from this type of adaptation? Devices with small screens, like smart watches, or without screens, like the Amazon Echo, clearly need or benefit from spoken output. Spoken output is also suited for eyes-busy tasks like exercising or driving. Applications designed to be used in public or noisy environments, on the other hand, will profit from graphical output.

Another benefit is accessibility. If the users’ inputs and system outputs are treated as generic meanings by the application, the core user-system interaction logic doesn’t have to change much to accommodate the different types of presentations that might be preferred by users with disabilities.

Representing input and output in the same format is also valuable for developers. It becomes easier to develop applications that use different devices, modalities, and presentation formats because, again, the basic application logic can be separated from the presentation.

Finally, this approach is useful for the maintenance and tuning of running applications. Because the user input and system output can be contained in the same EMMA document, with basically the same metadata (confidences, timestamps, alternatives, and so on), uncovering problematic relationships between the system outputs and the users’ responses becomes easier. Is there an unusually long lag between the end of the system prompt and the beginning of a user’s response? Perhaps that prompt is confusing and needs to be reworded. These kinds of correlations are easy to spot when the prompts and user utterances are packaged together.

Take a look at the spec (https://www.w3.org/TR/emma20/) for more details about the output tag and all of the other useful new features—support for location information, incremental results, and partial results are just a few—of EMMA 2.0.

Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversational-technologies.com.