Speech Technology Magazine

 

The Right Standard Makes Developers’ Jobs a Lot Easier

You'll never know whether a standard's a good fit if you don't try it out
By Deborah Dahl - Posted Jun 18, 2018
Page1 of 1
Bookmark and Share

We know that using standards can have huge advantages in application development. Vendor independence, the availability of knowledgeable developers, standard tooling, and good documentation and training options are just a few of the benefits. But every application has its own requirements, and the available standards might not be a good fit. How do you decide if they are? 

You can plunge right into the (possibly dense!) specs and start coding away, and if the standard isn’t quite right, you’ll have to chalk up the wasted effort to lesson learned. Fortunately, plenty of resources for learning and exploring standards don’t require a significant commitment. General information can be found in the usual places—blogs, sites like Stack Overflow, and books like W3C Standards for Multimodal Interaction (which I edited). There are also code resources like open-source implementations, compilers, editors, and visualizers, which can often be found in places like GitHub. 

Let’s look at some of the more prominent standards for speech, natural language, and multimodal systems. 

Speech Synthesis Markup Language (SSML) is a W3C standard that puts the finishing touches on how speech is pronounced at a detailed level. SSML instructs the text-to-speech (TTS) synthesizer about specifics of pronunciation, such as which words should be emphasized and where pauses should be. These adjustments can make a huge difference in how the synthesized speech affects listeners, like comparing the line readings of a skilled actor and an ordinary person. A moving speech recited by James Earl Jones will likely sound flat and unconvincing in the voice of your neighbor’s 15-year-old son. With SSML, developers aren’t limited to a TTS system’s default pronunciations; they can make them sound exactly as they want. 

SSML is widely supported by TTS systems; the TTS technology used in the Amazon Alexa Skills Kit, Microsoft Cognitive Services, IBM Watson, and Nuance cloud services all have SSML commands as an option, and most of these products offer online demos. An open-source TTS platform, the Mary system (http://mary.dfki.de/), allows you to experiment with SSML. Authoring SSML directly can be difficult, but authoring tools like the Chant VoiceMarkup Kit or the open-source SSML builder on GitHub can help.

State Chart Extensible Markup Language (SCXML) is another popular standard with open-source support. SCXML is a powerful tool for defining state-based speech and multimodal dialogues. When a user says something or interacts with the screen, an SCXML-based system can react and move to a new state, triggering a display change or a spoken prompt. The state-based approach is helpful for defining how the users progress through an app. 

SCXML resources are available for many platforms, including server, desktop, and mobile apps. On a server or desktop, Apache Commons SCXML, a Java-based interpreter, and PySCXML (for Python) are options. JavaScript-based SCMXL interpreters like SCION can be used directly in browsers.

To make authoring SCXML easier, several editors and visualizers have been developed, such as SCXMLGUI and VisualSC.

VoiceXML, the standard for defining voice dialogues, is widely implemented and needs no introduction. An open-source implementation, JVoiceXML, is available and would be a good way to start experimenting with VoiceXML.

While SSML, SCXML, and VoiceXML are probably the most widely implemented speech standards, implementations of some of the newer standards can also be found. 

The Multimodal Architecture is an approach to integrating multimodal functions like speech recognition, emotion recognition, and face recognition into interactive systems. Java, JavaScript, and ActionScript libraries for this standard are available at https://github.com/w3c/mmi.

Emotion Markup Language (EmotionML), a language for representing emotion, has also been implemented in the Mary TTS system. EmotionML can change pronunciation at a higher level than SSSML; rather than emphasizing a word, EmotionML can make a voice sound angry or happy. The Mary website has an online demo for using EmotionML to tweak the emotions expressed by a synthetic voice. 

Because of all these resources, you can explore and learn about voice and multimodal standards without making a huge investment in time and money. Even if you find a standard doesn’t fit the bill, what you learn might be relevant for the next application. And if a standard does meet your requirements, then your project will have a big leg up over applications where the components are proprietary. 

Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversational-technologies.com.

Page1 of 1