The Internet of Things Needs a Lingua Franca
With the proliferation of smart speakers, voice interaction with home devices is becoming increasingly common. On the horizon are voice interactions with an ever greater number of smart environments—cities, offices, classrooms, factories, and healthcare settings.
Voice interaction with devices is simple and natural from the user’s perspective; behind the scenes it relies on a complex stack of technologies, from microphones to cloud servers back to devices and speakers. Everything has to work together seamlessly. Because of the complexity of each individual interaction, along with the huge variety of possible devices, the connections between the components have to be uniform. What if turning on the lights in a house, a car, and an office all required a completely different set of interfaces? This would make the developer’s job very difficult and slow down the adoption of voice interactions.
In theory, this uniformity could be achieved if one company made every smart device and everyone used its technology. But it’s hard to imagine that a proprietary approach could ever cover all the options. A much more realistic approach would be for device components to be based on standards that have been agreed upon among device manufacturers, smart speaker vendors, and mobile device manufacturers.
There’s some progress here, but there’s also a lot of work to do.
The W3C Web of Things (WoT) Working Group has published some specifications for standardizing interfaces for smart devices. In May 2019 it published two documents on WoT standards: (1) the WoT Thing Description document, which specifies how to describe metadata and interfaces that apply to physical objects like smart devices; and (2) the WoT Architecture document, which describes the abstract architecture for the W3C WoT, including scripting for device control.
These documents are important steps toward standardizing the interactions among things, but they don’t cover human interaction with things. There are a few ideas in progress here. One proposal from the W3C Voice Interaction Community Group describes a standardized representation of semantic information (JSON Representation of Semantic Information). This would define a uniform interface connecting natural language systems to back-end application servers. It was published in February 2019 and is an early draft that needs to be expanded and tested.
Standardization efforts are also needed in other areas of human-computer interaction, including dialogue authoring languages and voice user interface guidelines. What’s available for these?
VoiceXML is a dialogue authoring language that was first developed in the early 2000s. It is still pretty sophisticated compared to most authoring tools. VoiceXML provides useful built-in capabilities such as slot-filling, tapered prompts, and barge-in. It remains the standard authoring language for IVR applications, but current virtual assistants use their own proprietary authoring formats, possibly because the developers of the proprietary systems weren’t aware of VoiceXML or wanted to use JSON instead of XML. But even if VoiceXML isn’t used directly, it remains a rich source of dialogue management concepts that could be used in other systems. Another very recent proposal for dialogue management is from the W3C Conversational Interfaces Community Group. In April 2019, this group published Dialogue Manager Programming Language (DMPL), a draft of a declarative dialogue authoring language. Like JSON, this draft needs to be expanded and tested.
As for user interface guidelines, how users interact directly with systems, hard-and-fast standards can be difficult to define. For example, a good voice user interface (VUI) guideline is “keep system prompts short.” “Short” is a subjective concept, though (is 10 words short enough?), and prompts can’t be so short that a user won’t understand them. This principle has to be more of a guideline than a standard. In any case, since the VUI is where the user meets the system—and many users probably equate the interface with the entire system—user interface guidelines are critical for application success. The Association for Conversational Interaction Design (ACIXD) has done a comprehensive job of collecting guidelines for designing VUIs, guidelines based on both human factors research and many years of hard-won experience with VUI implementations. These guidelines are available on the ACIXD website.
Voice interaction with the Internet of Things will be pervasive and transformative as smart devices and smart environments become more ubiquitous. Establishing voice interaction standards for all the components of smart environments is critical for accelerating this process and expanding the range of applications. x
Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at firstname.lastname@example.org.
Jim Larson talked to Dr. Deborah Dahl, Principal, Conversational Technologies about the increasing importance and capabilities of natural language processing, speech recognition, and
Conversational Technologies Principal Deborah Dahl lays out a plan for making more virtual assistants more effective in this clip from her keynote at SpeechTEK 2019.
A common format for natural language tools would make everyone's life easier