WebRTC and WebAudio add speed and simplicity.
For most of the history of the World Wide Web, users have primarily used the keyboard and mouse (and, more recently, touch) to interact with Web applications. Spoken input, whether spoken to an application or to another person, has always been a niche option, restricted to proprietary technologies and specific applications, such as video conferencing. Similarly, handling real-time audio other than speech, such as music, has also been very limited, although there are many potential applications for this kind of functionality. The music search service SoundHound, which can identify hummed or sung music, is one example, but potential applications include remote music lessons and distributed musical groups. Such applications are becoming more feasible as Web interaction in general moves from the desktop to mobile devices and wearables, with their ubiquitous and continually improving video and audio capture capabilities. However, proprietary technologies for audio and video capture, such as Flash, do have disadvantages—they require downloads, don't allow developers to reuse their Web application development skills, and are often restricted for use only on specific platforms.
Two emerging standards from the World Wide Web Consortium and Internet Engineering Task Force are starting to make audio and video first-class user interface options. WebRTC and WebAudio provide standards for capturing live audio and video from a device and sending it to a server (or even directly to another browser in a peer-to-peer connection). These standards will make it much easier to integrate speech and graphical interfaces. Standard audio capture functionality will make it possible to easily send speech to a cloud-based speech recognition service directly from a standard Web browser.
With these new standards, it should become much simpler to build multimodal Web-based applications for customer service. This applies to both automated applications and interactions with agents. Automated applications can take advantage of the graphical capabilities of desktop computers, smartphones, and tablets to overcome some of the limitations of purely voice-based IVR applications. For example, a multimodal IVR application using WebRTC could show users speech recognition results directly, instead of having to go through time-consuming confirmation dialogues. Enabling voice-in, text-out capabilities could speed up automated IVR interactions because the user wouldn't have to wait for the system to finish speaking in order to see what the system's response is. Support for interactions with live agents could also be made much simpler and more effective. For example, the user could show the agent a damaged part that might be difficult to describe in words. Or the agent could show the user how to operate a device, rather than having to try to describe it. This will speed user/agent conversations such as:
"How can I reset my printer?"
"Do you see the red button on the top left, next to the green button?"
"Do you mean the one that says 'power'?"
"No, the one under that."
"I don't think I have a red button there."
"Oh, are you sure you have model 617T5?"
"It looks like I actually have model 617C5."
"OK, then the red button will be on the top right..."
Instead, the agent could just say, "Show me your printer." Then, "You need to press the red button on the top right."
Despite the fact that WebRTC is still in a working draft stage (which means the specification might still change), it has been implemented in Chrome, Firefox, Android, and Opera. With that broad base of browser support, it is well worth exploring the standard and prototyping applications now.
Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium's Multimodal Interaction Working Group. She can be reached at email@example.com.