Tomorrow's Technologies, Today's Applications
Innovative speech applications require new speech technologies. Our current speech recognition technologies of context-free grammars and statistical-language models have served us well for today’s voice recognition applications. But these technologies limit our thinking and the kinds of speech applications that are possible.
In addition to using context-free grammars and statistical-language models to convert speech to text, two basic technologies will greatly extend the capabilities of speech applications: natural language understanding algorithms that derive the meaning from sequences of words produced by speech recognition engines, and semantic representation languages that represent the meaning, not just the words spoken by one or more users.
The World Wide Web Consortium’s Semantic Interpretation for Speech Recognition (SISR) language is already widely used within VoiceXML applications to convert sets of synonyms (such as yes, affirmative, sure, or of course) to a single representation that means "yes." SISR is also being used to convert complex user utterances, such as Move $200 from checking to savings, to a format that represents the meaning.
In effect, SISR parses uttered phrases into sophisticated semantic representations useful for processing by the underlying application. This is the academic meaning of the term "natural language processing," not the one that has been corrupted by marketers to mean using statistical-language models and clever dialogue specification techniques to make a dialogue appear natural.
Other W3C languages can augment SISR to extract the meaning from sets of utterances. The Resource Description Framework and Word Ontology Language describe words and their relationships. This information can be used to locate additional information related to the words recognized in user utterances. For example, if the user says checking, the Word Ontology Language could provide additional information indicating that checking is a type of account.
Extended Multimodal Annotation (EMMA) represents information derived from various modes, like voice, pen, or keyboard, so the information can be combined and processed. For example, in a multimodal environment, the user could say Transfer $200 from, click the checking icon, say the word to, and then click the savings icon. EMMA represents the first click as checking account and the second click as savings account. From this, a general-purpose unification algorithm would derive the semantic representation.
Semantic representations are important because they represent the meaning of sets of utterances. From the meaning, algorithms can be applied to derive the following useful representations of the semantic information:
- Summarize—generate an abstract or summary for the content of utterances of one or more speakers;
- Search—select segments of a set of utterances that are relevant to specific search criteria;
- Personalize—tailor the contents of a set of utterances to the specific needs and desires of the user, including simplifying the language for easier understanding;
- Question/answer—identify information from a set of utterances that are relevant for a specific query; and
- Translate—convert an utterance spoken in one language to another language.
These technologies will enable future speech applications beyond today’s elementary interactive voice response (IVR) and voice search capabilities. Imagine speaking into a telephone and receiving a summary of an important news event; asking about blister rust, a disease that attacks trees, and listening to the cause and cure for it; listening to the contents of a legal contract expressed in simple, everyday English; asking for a document about Shakespearean sonnets on your computer that you wrote three years ago; or verbally exchanging information with a college student who does not speak your language.
Much work still needs to be done to transfer these technologies from the lab to commercial applications. They depend on natural language understanding and meaning representation. Let’s stop kidding ourselves that we have natural language processing today. Until we address the representation of what a user means, not just the words he says, we will be limited by the IVR and voice-search dialogues that respond to what he says, not what he means.
James Larson, Ph.D., is co-chair of the World Wide Web Consortium’s Voice Browser Working Group, and author of the home-study guide
The VoiceXML Guide (www.vxmlguide.com). He can be reached at firstname.lastname@example.org.