NLU Results Shouldn’t Be Proprietary
The growth in voice interfaces is amazing. According to the Smart Audio Report, the number of smart speakers in the United States grew by 78% last year, with 115 million smart speakers in U.S. households at the end of 2018. We’re also talking to our mobile devices, cars, TVs, and so on.
It’s even more exciting to see all the new speech and natural language understanding (NLU) tools becoming available to developers. The Alexa Skills Kit, Microsoft LUIS, Nuance Mix, SAP Conversational AI, and IBM Watson are all very capable and easy to use. In most cases, simple demos and prototypes can be put together in a few hours. So far, so good.
The hard work comes in when we want to scale up these capabilities and use them for real commercial or enterprise applications. We have to collect data, build and test models, and—perhaps most important but often overlooked—integrate the NLU results with back-end enterprise systems. For example, let’s say we have a voice-enabled shopping application and a user says, “I’m looking for boys’ sweaters, size 4T, under $25.” A well-trained system will have no problem interpreting this utterance and delivering the information in a structured format, something like “category: boy’s; size: 4T; price: under $25; article: sweater.”
But what happens if an enterprise decides to switch NLU vendors? This can happen for a lot of reasons—cost, missing functionality, compatibility issues with other systems. Now the natural language result will look different, and the integration process will have to be repeated. In short, changing vendors requires a lot of work. In our example, the “size” feature might be called an “entity,” a “slot,” or a “concept,” depending on the vendor, and different vendors’ results will be structured in different ways.
There’s no technical need to have vendor-specific ways for expressing what is essentially the same information; this realization points to an area where a standard is needed. Agreeing on a common format would make it much easier to migrate from one platform to another, or to mix and match different natural language tools in the same application.
Where could a common format come from? An existing standard could show the way forward.
The Extensible MultiModal Annotation (EMMA) standard was published by the World Wide Web Consortium (W3C) a few years ago as a format for natural language processing results. It can represent the same kinds of natural language information as current toolkits—after all, the human side of the computer-human interface hasn’t changed and isn’t going to anytime soon. EMMA can represent simple entity-value results like the sweater example and is also powerful enough to represent multimodal inputs combining voice, GUI, biometrics, even emotions. EMMA can also include many kinds of metadata, such as confidences, alternative results, and times and locations of utterances.
One drawback of EMMA is that results are defined using the older and less popular XML format, not JSON. So the question arises: Can the industry agree on a standard JSON format for natural language results? A draft proposal being discussed in the W3C’s Voice Interaction Community Group, “JSON Representation of Semantic Information,” shows how EMMA concepts could be formatted as JSON with a fairly direct mapping. Besides the basic intents and entities of natural language inputs, some of the proposal’s features include the ability to represent the more advanced features of EMMA.
An obvious concern is whether vendors are willing to abandon proprietary formats, in which they may have put significant investment, for a standard. In any case, it wouldn’t be difficult for third parties to write code to convert the vendor-specific formats to the common format, thereby getting around having to change complex integration code when making a switch.
Anyone who’s interested should take a look at the draft and send comments to the group’s public mailing list, email@example.com. Better yet, join the Voice Interaction Community Group and help create the standard yourself!
Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at firstname.lastname@example.org.
With the proliferation of smart speakers, voice interaction with home devices is becoming increasingly common, and on the horizon are voice interactions with an ever greater number of smart environments—cities, offices, classrooms, factories, and healthcare settings. Developers will need to be on the same page
At the 2019 SpeechTEK conference Yves Normandin of Nu Echo, Inc. and Deborah Dahl of Conversational Technologies, will present "A Comprehensive Guide to Technologies for Conversational Systems." Conference chair Jim Larson talked to Normandin and Dahl to get a sneak peek of the session, and learn about conversational system technologies.
Text-to-speech for webpages has to tackle modern English's dizzying complexity
You'll never know whether a standard's a good fit if you don't try it out