W3C Tackles Emotion and Multimodality

In addition to its work on VoiceXML 3.0, which it released in working draft form earlier this month, the World Wide Web Consortium (W3C) has been hard at work on a number of other specifications that will have serious implications for the speech technology community.

The first is the first public working draft of Emotion Markup Language (EmotionML) 1.0, which allows for representations of emotions and related states in technological applications. The language is conceived as a plug-in language suitable for use in manual annotation of data, automatic recognition of emotion-related states from user behavior, and generation of emotion-related system behavior.

“That might be useful, for example, in a call center, for representing the output of an emotion detector that would detect an angry or upset customer,” explains Deborah Dahl, chair of the W3C’s Multimodal Interaction Working Group, which released the draft.

Outside the call center, the standard would be valuable for generating emotional output in a text-to-speech system, and in working with avatars so the facial expression and the voice match, according to Dahl.

In programming such technology, however, one of the main problems encountered is that the vocabulary needed for an application often depends on the context of use. Using EmotionML, programmers can build some emotions into pre-established vocabularies and do custom work for the others, Dahl explains.

The Multimodal Interaction Working Group also recently published an updated working draft of the Multimodal Architecture, which creates a framework for all the components of a multimodal application to work together.

Dahl says this draft is significantly different from the previous one, with clarifications to its relationship with the Extensible Multimodal Annotation (EMMA) specification, simplified architecture constituents, a description of HTTP transport of lifecycle events, and the addition of an example of a handwriting recognition modality component.

“This is the W3C's language for handling the communication among the components of a distributed multimodal application. For example, speech recognition and dialogue logic might take place in the cloud, while the GUI display would happen locally on the device,” Dahl explains.

The architecture, she adds, will be open so that as new input methods become available, they could be incorporated easily into the larger application. “You can add modalities. It would be just one more application layer that fits into the architecture,” she says.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues