Standards Hit Speech on the Web and Voice Biometrics
The American National Standards Institute (ANSI) and the World Wide Web Consortium (W3C) each spent a busy summer working on standards that will affect the global market for speech.
ANSI was the first to come out with a speech-related standard when it published INCITS 456: Speaker Recognition Format for Raw Data Interchange (SIVR-1), which governs the type and format of data that should be included with shared audio files used for speaker identification and verification (SIV).
According to Judith Markowitz, president of J. Markowitz Consultants and editor of the standard, that data includes the bandwidth used to make the recording; date and time of the recording; type of channel that was used to record the data, such as a wireless or landline phone; information about the speaker, such as gender, age, language, and accent; input device used; security used, such as the type of encryption; and sampling rate. “The standard is for creating a way of describing the data so that one organization or part of an organization can effectively communicate with another and share data,” Markowitz explains.
The data format is generic in that it might be applied to and used in a wide range of application areas where automated and human-to-human SIV is performed. It is also intended to be vendor-neutral. Through its XML orientation, this standard reflects recognition of VoiceXML’s overwhelming dominance in speech processing.
Among the organizations that stand to benefit most from the standard are military and intelligence operations, law enforcement, and security, though it also could prove to be useful for financial services and telecommunications organizations that share data about suspected fraud and known fraudsters.
“There’s a lot of sharing now between state, local, and federal law enforcement agencies, and when there’s a threat, this information is useful to have,” Markowitz says.
The standard, according to Markowitz, does not apply to real-time data, but rather to recordings that are shared “after the fact.”
A few weeks after the release of the ANSI standard, the W3C released its latest version of Speech Synthesis Markup Language (SSML), which extends speech on the Web to an enormous new market by improving support for Asian languages and multilingual voice applications. The SSML 1.1 recommendation, released September 7, provides control over voice selection as well as such speech characteristics as pronunciation, volume, and pitch.
SSML is part of W3C’s Speech Interface Framework for building voice applications, which also includes VoiceXML and the Pronunciation Lexicon (for providing speech engines guidance on proper pronunciation).
“With SSML 1.1 there is an intentional focus on Asian language support, including Chinese languages, Japanese, Thai, Urdu, and others, to provide a wide deployment potential,” says Dan Burnett, co-chair of the Voice Browser Working Group, director of speech technologies and standards at Voxeo, and co-author of the standard. “With SSML 1.0 we already had strong traction in North America and Western Europe, so this focus makes SSML 1.1 incredibly strong globally.”
The multilingual enhancements in this version of SSML result from discussions at W3C workshops held in China, Greece, and India.
SSML 1.1 also extends TTS control to more parameters. The trimming attribute, for example, enables different extracts of prompts or audio files to be rendered according to context, the language attribute allows any voice to speak any language, and lexicon activation/deactivation facilitates use of multiple, conflicting lexicons according to different contexts.
“SSML is an important part of the overall ecosystem of W3C standards enabling speech across a variety of applications,” Burnett says. “SSML, in particular, provides a key way to render richer, more natural-sounding speech.”