In the Future Everything Will Be Modular—Or at Least VoiceXML

As we barrel toward 2010—the year we drop “two thousand” from our pronunciation and start saying “twenty-ten,” the year General Motors promised us flying cars, and the year economists promise we’ll come out of this recession—the future for speech fanatics holds one shining possibility: the dawn of VoiceXML 3.0, a new standard to rule them all.

For some time now, VoiceXML 3.0 has been in the works by the World Wide Web Consortium (W3C). It still remains in what Jim Barnett, a member of the architecture team at Genesys Telecommunications Laboratories and editor-in-chief of the SCXML specification for the W3C, calls the “early stages,” but its high-level workings have all been mapped out. Just this past August, in fact, the W3C released its third draft of the speculative language for consideration by the world at large. VoiceXML 3.0 aims to add a number of new features to the language and make some of its stickier points easier to digest.

But before we get into that, a little history is in order.

Long ago, back in 1999—when the Internet was finally being called just “The Internet” and not “The World Wide Web” or, even worse, “The Information Superhighway,” and when, with that name shift, the Internet was starting to prove its value as a potent way for people to interact with information—various telecommunications companies got together and decided something analogous ought to be done with voice. Players including Motorola, AT&T, Lucent, and IBM formed a consortium devoted to achieving a standard language for voice browsing. Prior to that, they each had their own separate, proprietary languages. Within a year, they had realized their goal, and VoiceXML 1.0 was a blinking new babe in the world.

From there, the vendor consortium turned the language over to the W3C, which set about standardizing the language and produced the version in 2004 that become VoiceXML 2.0.

During 2.0’s development, and before it even became a recommended standard, there was a lot of uptake of the language, says Deborah Dahl, principal at speech and language consulting firm Conversational Technologies and chair of the W3C’s Multimodal Interaction Working Group. “Before VoiceXML was on the horizon, if you wanted to develop an IVR application with speech, it was a really big job,” she explains. “You had to do a lot of low-level integration of the controller and the speech engine, and it was a huge effort. Not that many people could do that.”

The Next Iteration

A few years later, based on reactions and recommendations from users, some minor changes and added functionalities were incorporated into a new version, and VoiceXML 2.1 reached recommendation status in 2007. Now, just a few years later, a larger change is under development, and the W3C is looking to release version 3.0.

The need for VoiceXML 3.0 evolved out of the development and increasing complexity of current voice applications and the acknowledgment that the future of speech is going to rest with multimodal applications. While the implications here might suggest that a major shift in the language is coming, the hard changes are actually more modest—something akin to Windows 98 over Windows 95 rather than Windows Vista over Windows 3.1. Instead of changing the underlying architecture of the language dramatically, 3.0’s developers are reconceptualizing how it’s going to be used and where it could be plugged in.

VoiceXML 3.0 will be, above all, a superset of VoiceXML 2.0/2.1, meaning everything contained within the old language will be available to developers. They will still be able to code the same way they always have and produce the same results. While it retains legacy compatibility, though, 3.0 is being pushed out of its standalone container and partitioned into several modules or units of functionality. For instance, there will be grammar, prompt, speech synthesis markup language, form, and field modules. The modules themselves will be organized into a profile, or syntax, of several modules.

“For example, a ‘legacy’ profile incorporates the syntax of most of the modules corresponding to the VoiceXML 2.1 functionality. Thus, most VoiceXML 2.1 applications can be easily ported to VoiceXML 3.0 using the legacy profile,” explains Jim Larson, a consultant, VoiceXML trainer, and co-chair of the W3C’s Voice Browser Working Group. “Profiles enable platform developers to select just the functionality that application developers need for a platform or class of applications. Multiple profiles enable developers to use just the profile [language] needed for a platform or class of applications.”

In the new version programmers will gain increased modularity and explicit access to new functionalities, like speaker verification and multimodal capabilities.

This first point, the increased modularity, is one of the most anticipated changes that will be available to VoiceXML coders. The new version of the language is being built to work with a number of other available standard languages, like EMMA (Extensible MultiModal Annotation), which renders various modalities from voice to handwriting output into a standard annotation language, and XHTML to support voice-enabled Web pages. In short, VoiceXML 3.0 could be used with really any XML language and a number of other standard languages, like the Pronunciation Lexicon Specification (PLS), used to standardize pronunciation in text-to-speech applications to provide the voice functions for mashup applications.

Creating Partnerships

Its most significant XML partner language, however, may be SCXML (State Chart XML), a standard control flow language that is actually being developed by the W3C in conjunction with VoiceXML. One of SCXML’s aims is that it corrects some of the major headaches that developers of VoiceXML 2.0 have had for some time: the form interpretation algorithm (FIA).

In version 2 and 2.1, the FIA has been used to define the mechanism by which a VoiceXML form is interpreted to control the interaction with the end user. Basically, the FIA defines the call flow of a contact center and moves a caller through its various stages. It is, however, loosely defined and can be difficult to manage in complex applications.

“It was loosely defined to the point where it confused a lot of people,” says Bill Scholz, president of the Applied Voice Input/Output Society (AVIOS). “It didn’t have a clear, rigid definition of how it would behave given any sort of abnormal event, such as misrecognitions, or a long silence period, or barge-in enabled or disabled. It was somewhat difficult to anticipate, from a given segment of VoiceXML code, how the FIA might behave.”

The new iteration of VoiceXML, however, alleviates that by allowing developers to use an entirely separate and more robust standard controlling language, SCXML. SCXML has no built-in speech capabilities. It’s a pure workflow language that can do everything, even code commands for a nuclear power plant or rocket ship. The language provides the capability to call on other languages for functionality, like speech recognition.

Genesys’ Barnett provides an example for the contact center. “If you want to do something like get [a caller’s] credit card, validate the credit card, then check to make sure he’s really ordering what he wants and that you have his address, you put that high-level logic in SCXML,” he says. “But then the specific stuff, like asking ‘What is your credit card number?’ and the speech recognition, that’s in VoiceXML. The idea is that SCXML is the controlling language that calls VoiceXML. This keeps your VoiceXML simpler and smaller.”

SCXML has a go-back capability that lets users back up and start over if the interaction goes astray, Barnett adds. It also has the ability to run parallel tasks simultaneously, like making special offers to a customer while she is having information validated. In addition to this, Barnett points to time savings that developers can reap by using SCXML.

“If you have been inside a company for a while and look inside the products, you will find dozens and dozens of ad hoc state machines that people have written,” he says. “SCXML is a standard state machine language. We can save people an enormous amount of time if you write SCXML once and just plug it in anywhere someone wants a state machine. You will save a lot of time.”

The modularity of VoiceXML 3.0 and its relationship with SCXML will also make it easier to reuse snippets of VoiceXML code. If a coder already has lines for collecting credit card numbers, she can reuse that on new apps without having to rewrite VoiceXML code. Instead, she would have to rewrite only the portions of SCXML around it.

A separation of labor between VoiceXML 3.0 and SCXML could also help it fit more organically into most developer cultures. In larger applications, groups of programmers are already working on the logic of an application somewhat separately from user interaction experts. The division between VoiceXML and SCXML, at least ostensibly, recognizes that and appropriates it.

“I think that separating the abstract routing information into SCXML versus the low-level turning the recognizer off and on will make it a lot easier to find dialogue,” Dahl says.

Direct Contact

Given that, if in the future VoiceXML coding becomes a matter of working only with a recognizer or a TTS engine and it becomes so specialized that it has dominion over only the speech functionalities of an application, whatever the modality, we might see developers with less code-heavy backgrounds working directly in VoiceXML rather than directing high-level architecture and leaving the nitty-gritty to coders.

Further pushing things along in that direction, Larson foresees a time when tools and code generators will be built to hide some of the intrinsic “warts” and “uglies” of writing in VoiceXML, as is the case with many of the other XML languages. He expects this will enable voice user interface designers to quickly create prototypes and experiment with them.

All of this might mean increased specialization in speech application development and a change in the kinds of people who make their way into the field.

VoiceXML 3.0 developers will also have access to new modules being developed by the W3C that handle database access, speaker identification verification, and more robust media (video and audio) playback.

There is little doubt among most technologists that version 3.0 will dominate speech capabilities in the contact center and other traditional spheres of influence as its predecessors already have. How quickly it will come into use, however, is another story. Because the new version is a more modest refresh, don’t expect developers to jump through hoops to port into 3.0. If an application works well in 2.1 or even 2.0, then it’s doubtful whether enterprises will want to commit the resources to refresh it. As a point of fact, business cycles are such that there are some speech systems operating in the world that are more than a decade old—dusty beeping boxes sitting in a closet somewhere, as Nancy Jamison, principal analyst at Jamison Consulting, referred to them at August’s SpeechTEK 2009 conference.

Also in question is whether VoiceXML 3.0 will make the same kind of impact in the multimodal space as it has in speech. While VoiceXML’s boosters hope its modularity will make it part of that multimodal solution, others think it might not be robust enough for that.

“It’s a contradiction of terms,” says Juan Gilbert, a professor of computer science and software engineering and chair of the Human Centered Computing Lab at Clemson University. “It’s VoiceXML. If it has multimodality in it, it’s MultimodalXML. If you’re going to take VoiceXML and make it more object-oriented, then you run into serious issues. The issues would be things we’ve seen before. We had the [graphical user interface] and the desktop, and then people came along and wanted to add voice. It was an afterthought. And here we are doing the same thing in reverse. We have a language for [voice user interface], and we’re going to add multimodality. That’s an afterthought. Inherently, when you do these kind of things, they don’t work well.”

Gilbert thinks a stronger push in multimodality will come from within EMMA because it is innately a multimodal-facing language.

Scholz agrees, but with some hesitation. “There is not a single strong conclusion,” he explains. “I would lean in the direction of Juan Gilbert’s remarks, but I don’t think that leaning needs to be taken as highlighting an insufficiency of V3. It’s even there to some extent in V2. I’m doing multimodal applications in VoiceXML 2.1 every day, and they work fine.”

Other competing languages exist, according to Scholz. For instance, many multimodal applications being developed today are for Apple’s iPhone. The device’s applications are developed in Objective C. If the big multimodal push keeps coming from the iPhone, then we might see contact centers, as they become multimodal themselves, looking to be compatible.

“There’s a slight chance of [Objective C making its way into IVRs], but I don’t think the V3 developers and IVR developers need to cower in fear,” Scholz says.

His general hesitations on looking to the future are entirely justified at this point. The language’s release is still off in the future. While its contours are laid out, the particulars of its inner workings are still being hashed out. Clearly, the language will face challenges as it moves ahead, but for many in speech development, it holds a tremendous amount of promise and potential.

A Look at EMMA

EMMA, or Extensible MultiModal Annotation, is a markup language that renders multiple modes of input (haptic, voice, visual, etc.) into common output. According to Deborah Dahl, principal at speech and language consulting firm Conversational Technologies, chair of the World Wide Web Consortium’s Multimodal Interaction Working Group, and author of the EMMA standard, the language began as an effort to create a standard way of representing the results from speech recognition to other platforms, like VoiceXML. Before its release, the results were often proprietary, and the code to integrate them was tedious and “kludgy”—programmer-speak for lacking elegance and grace.

“The interesting thing about EMMA,” Dahl says, “is that it evolved to support more than just speech. It can represent any kind of human language input: handwriting that might be recognized by a handwriting recognizer, text typing, even motions like mouse clicks or the accelerometer on your iPhone. The really interesting power of that is that you can do these combined multimodal inputs, like having a map on your screen and being able to circle it and say, ‘Zoom in here.’”

That agnostic approach to modalities is what makes the standard powerful today. VoiceXML 3.0’s ability to leverage it might be the key to its longevity in a multimodal future.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

In the Future Everything Will Be Modular—Or at Least VoiceXML

A Look at EMMA

Conversational AI to Reach $41.39 Billion by 2030

Voice Deepfake Fraud Surged 1,300 Percent

ESTsoft Partners with ElevenLabs

Deepgram Launches Voice Agent API