January 10, 2010
By Leonard Klie Editor, Speech Technology and CRM magazines
FYI

W3C Tackles Emotion and Multimodality

In addition to its efforts on VoiceXML 3.0, which it released in working draft form in early December (see Deborah Dahl's Standards column for an interview with Dan Burnett, one of VoiceXML 3.0’s primary authors), the World Wide Web Consortium (W3C) has been hard at work on a number of other specifications that will have serious implications for the speech community.

The first is the first public working draft of Emotion Markup Language (EmotionML) 1.0, which allows for representations of emotions and related states in technological applications. The language is conceived as a plug-in language suitable for use in manual annotation of data, automatic recognition of emotion-related states from user behavior, and generation of emotion-related system behavior.

“That might be useful, for example, in a call center for representing the output of an emotion detector that would detect an angry or upset customer,” explains Deborah Dahl, chair of the W3C’s Multimodal Interaction Working Group, which released the draft.

Outside of the call center, the standard would be valuable for generating emotional output in a text-to-speech system, and in working with avatars so the facial expression and the voice match, according to Dahl.

In programming such technology, however, one of the main problems is the vocabulary needed for an application often depends on the context. Using EmotionML, programmers can build some emotions into pre-established vocabularies and do custom work for the others, Dahl explains.

The Multimodal Interaction Working Group also recently published an updated working draft of the Multimodal Architecture, which creates a framework for all the components of a multimodal application to work together.

Dahl says this draft is significantly different from the previous one, with clarifications to its relationship with the Extensible Multimodal Annotation (EMMA) specification, simplified architecture constituents, a description of HTTP transport of life cycle events, and an example of a handwriting recognition modality component.

“This is the W3C’s language for handling the communication among the components of a distributed multimodal application. For example, speech recognition and dialogue logic might take place in the cloud, while the GUI display would happen locally on the device,” Dahl explains.

The architecture, she adds, will be open so that as new input methods become available, they could be incorporated easily into the larger application. “You can add modalities. It would be just one more application layer that fits into the architecture,” she says.

W3C Tackles Emotion and Multimodality

Deepgram Launches Streaming Speech, Text, and Voice Agents on Amazon SageMaker AI, Integrates with Amazon Connect

Wispr Raises $25 Million to Build Its Voice Operating System

Curantis Partners with nVoq

Read AI Introduces Operator Mobile and Desktop Apps