November 10, 2014
By Leonard Klie Editor, Speech Technology and CRM magazines
Features

HTML5 Is Live

says Glen Shires, a software engineer and speech specialist at Google and one of the driving forces behind the Web Speech API.

Shires adds that the Web Speech API “provides great flexibility in configuration and usage of speech recognition and speech synthesis, and it provides developers the freedom to define and implement custom [voice-only and multimodal] user interfaces.”

It also allows programmers to decide whether to use a browser’s default speech engine or select from other available engines. “The Web Speech API provides very direct use of all of a computer’s speech recognizers,” Burnett adds.

The Web Speech API will have very broad appeal for use in voice Web search; voice command and control of applications and devices; collecting domain-specific inputs, including those where later inputs depend on those given earlier; continuous recognition of open dialogues, such as dictating an email; and filling in Web forms. Other use cases will include translations, reading incoming email messages and dictating outgoing ones, dialogue systems, multimodal interactions, and driving directions.

The W3C Speech API Community Group published the Web Speech API in October 2012, largely at Google’s behest. A few months later, Google threw its weight behind the Web Speech API, which developers can now use to integrate speech recognition capabilities into their Chrome and Android Web apps in more than 30 languages.

Shires notes that Google uses the Web Speech API for voice input with its search and translation engines. Google Chrome users can also fill in text fields using speech. When speech input is enabled, a small microphone icon appears on the screen. Clicking on this icon will launch a small tool to show that voice recording is taking place.

The one caveat, though, is that the browser will still require an external service to handle speech-to-text conversion. As such, users will need an Internet connection for speech input to work. As a developer, this is something to keep in mind if you plan for your Web application to work offline.

Some Smaller APIs

A subset of the Web Speech API is the Speech Synthesis API, which will allow Web sites to speak or, more specifically, read aloud their content. The number of voices and languages available will depend on the browser used: Google’s Chrome browser, for example, currently supports nine languages and several male and female voices.

Programmers can also adjust the speed, pitch, and tone of the voice right within the coding level. Other controls, such as volume, start, stop, and pause, can be coded right within the application layer.

With the Speech Synthesis API, it only takes two lines of code to get Web apps talking to users. All the programmer needs to do is create a new instance of <SpeechSynthesisUtterance> and paste in the text that he’d like spoken. This utterance object also contains information about how the text should be spoken, including the language, voice, pitch, tone, volume, and speed.

Developers can choose to visually highlight the words or phrases that the application is synthesizing or coordinate the synthesis with animations, such as an avatar speaking.

Via the associated HTML5 Speech Recognition API, programmers can create situations where JavaScript will have access to a browser’s audio stream and be able to convert it to text. The user will be asked to allow the app access to his computer’s microphone. If he allows access, he can start talking, and when he stops, the results of the speech capture will be made available as a JavaScript object.

A WebRTC World

Another companion piece to HTML5 is Web Real-Time Communications (WebRTC), an open-source gateway that allows for video and voice communications between multiple computers over the Internet.

WebRTC seeks to embed real-time voice, text, and video communications directly into Web browsers. End users do not have to download special software or even use the same Web clients or browser plug-ins to communicate directly with one another.

The WebRTC specification allows Web apps access to the microphones and cameras on computers or mobile devices. “It is important for Web apps that want to use speech recognition, because they do not need special plug-ins that give them access to the audio and video,” Burnett explains. “It allows for communication between devices. It allows Web applications to do something

Previous Page Next Page

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

HTML5 Is Live

Some Smaller APIs

A WebRTC World

Growing the Pool of Speech Developers

WebRTC Enables Dual-Language Speech

Media Standards for the Web: WebRTC and WebAudio

The 2014 Speech Luminaries

Triton Digital Partners with ekoz.ai on Voice-Cloned Podcast Ads

Soul App Launches Full-Duplex Voice Model

Mistral Unveils Voxtral Open-Source AI Voice Model

Leena AI Launches Agentic AI Colleagues