With Conversational AI, the Standards Work Heats Up
The World Wide Web Consortium, or W3C (www.w3.org), has been working on developing and publishing standards for the World Wide Web since 1994. Over that time, the W3C has published more than 150 formal standards, called Recommendations, written by dozens of focused working groups. In addition to the well-known foundational web standards like XML and HTML, the W3C has also published many standards aimed at improving the interoperability of voice applications. We have covered many of these voice standards in previous Standards columns.
Speech Technology magazine readers will be most familiar with the W3C’s work on conversational AI standards such as VoiceXML (https://www.w3.org/TR/voicexml21/) and EMMA (https://w3c.github.io/emma/emma2_0/emma_2_0_editor_draft.html). But the W3C continues to publish other standards directly and indirectly related to conversational artificial intelligence. Here, let’s review some of this recent work and talk about how it can be used in speech applications.
One of the most interesting recent efforts is being carried out by the Verifiable Credentials Working Group. Verifiable credentials is an approach to representing digital credentials based on open standards. The information in a verifiable credential can correspond to a physical credential like a driver’s license or to an electronic credential like the ownership of a bank account. Verifiable credentials are usually in the form of JSON objects that contain information like the holder of the credential, the issuer of the credential, and cryptographic information that can be used to ensure the integrity of the credential. Verifiable credentials could be used in all kinds of identification scenarios, including authorizing access to voice applications.
You can find the latest verifiable credentials specification (published March 3, 2022) at https://www.w3.org/TR/vc-data-model/.
The W3C has a long history of promoting accessibility for web applications and includes an active Working Group on Accessible Web Platforms (APA). Accessibility, of course, is an important use case for speech technology since speech recognition and text-to-speech (TTS) can make it possible for blind or visually impaired users to use the web. This working group has created a task force on spoken pronunciation that is working to ensure that written text is spoken correctly when read by programs like voice assistants or screen readers. The task force has recently published the specification “Spoken Presentation in HTML” (https://www.w3.org/TR/spoken-html/), which reviews markup strategies for enabling web pages to be pronounced as intended by their authors.
Web Neural Network
The Web Machine Learning Working Group is working at the intersection of the web and machine learning. On May 19, 2023, it published a draft specification for a browser application programming interface (API) that supports hardware acceleration of machine learning models. What does this mean in practice? Using this specification, a browser could allow local, privacy-preserving applications of neural net applications like computer vision, natural language processing, and speech processing without requiring any data to be transmitted to the cloud. In addition, this API also allows the applications to take advantage of the local computer’s hardware to speed up processing. Besides the advantage of local processing, having a standard, browser-based API means that neural net applications will be able to run locally on any web browser that supports the neural net API as long as the host computer has enough processing power. The most recent specification can be found at https://www.w3.org/TR/webnn/.
Web of Things
The W3C is actively working on specifications for the Web of Things (WoT). This is an important use case for speech technology since speech enables users to control things in their environments without the need to touch them or use remote controls. The Web of Things Working Group has recently published several specs related to the WoT, including “Thing Description 1.1,” which describes the metadata and interfaces of entities that are part of the Web of Things. The group has also published an architecture specification for Web of Things entities and a discovery specification, which provides a way to find these entities in a distributed environment. You can find out more about the work of the WoT Working Group at https://www.w3.org/WoT/.
The goal of the Immersive Web Working Group is to bring high-performance virtual reality and augmented reality to the web. The group develops APIs to interact with virtual reality and augmented reality devices within web browsers. Last November, the Immersive Web Working Group published a new specification, the Augmented Reality Module. This publication can be found at https://www.w3.org/TR/webxr-ar-module-1/.
The web audio API defines methods for processing audio data within the web browser environment. This includes reading, writing, modifying, and synthesizing audio in the context of games, music, educational applications, and for accessibility; these methods could be used, for example, to slow down audio output for language learners in web browsers. The most recent publication of this group is the Web Audio API (https://www.w3.org/TR/webaudio/).
Web Speech API
This proposal by the W3C Web Incubator Community Group provides an API for accessing speech recognition and TTS functionality in web browsers. The most recent draft of the specification is located at https://wicg.github.io/speech-api/.
Putting Them Together
One of the most exciting things about web standards is the way they all can work in the same browser environment and build on each other’s capabilities in new combinations. We can easily see this looking at the standards we’ve gone over in this column. For example, speech interaction with Web of Things devices could be built on from the Audio API, Neural Net API, and the Web of Things. It would also be very natural to combine speech technology with the immersive web to provide a more natural and realistic way to interact with virtual environments than pressing buttons. What kinds of innovations can you come up with?
Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at firstname.lastname@example.org.