November 9, 2006
By James A. Larson program co-chair, SpeechTEK 2021
Forward Thinking

Toward Natural Language Processing

When many people hear the words, "natural language," they immediately think of Star Trek's android, Data, who speaks and understands everyday English. Some software vendors claim anything beyond discrete speech recognition (in which users must pause between speaking each word) as "natural." Here are four technologies that are more natural than discrete speech recognition:

System-directed dialogues Carefully crafted prompt messages encourage users to answer by speaking one of a small number of carefully specified words or phrases. If users fail to respond appropriately, error handlers encourage users to say one of the appropriate words by rephrasing the question, providing hints and examples, and encouraging users to respond. To users, the dialogue seems somewhat natural, much like a teacher helping a student to perform a difficult task. This approach is supported by VoiceXML 2.0 and is widely used.

Statistical grammars Users respond to the prompt, "How may I help you?" with a phrase, which the automatic speech recognition (ASR) system compares to statistical representations of thousands of other phrases previously spoken in response to the same prompt. The ASR assigns the user's request to one of a small number of categories. The user then engages in a traditional, system-directed dialogue specific to that category. To the user the dialogue seems natural, much like responding to a telephone operator who asks, How may I help you? A special extension to VoiceXML 2.0 is needed to use the How may I help you? approach, which can be expensive to implement due to the large number of user utterances that must be collected and analyzed.

Mixed-initiative dialogues Using special techniques available in VoiceXML 2.0, implementers can turn system-directed dialogues into dialogues in which the user takes more control. For example, the user may interrupt a prompt before the dialogue is finished by speaking the answer, speeding up the dialogue for experienced users who are already familiar with the prompts.

By speaking values for multiple fields of a voice form, users may also respond to prompts that have not yet been presented to the user. By using an advanced technique from VoiceXML called "form-level grammars," users speak values for multiple fields of a VoiceXML form all in the same utterance. This can dramatically speed up the dialogue. If the user fails to provide a value for a specific field, the dialogue manager will present the prompt associated with that field, in effect, reverting back to a system-directed dialogue for that specific field. Although supported by VoiceXML 2.0, mixed-initiative dialogues are not widely used yet because they can be difficult to implement and slow to process.

Complex grammars and semantic interpretation Grammars may contain special instructions called "semantic interpretation" instructions. These instructions examine text returned by the ASR to extract and transform specific words.

For example, consider the following sentence: Move xyz.doc into History. Semantic interpretation instructions extract the word move and place it into a field called Command, extract the word xyz.doc and place it into a field called Source, and finally extract the word History and place it into a field called Target.

Given the following sentence, Place xyz in the History folder. The word place translated into move, which is placed into the Command field; the word xyz is translated into xyz.doc and is placed into the Source field; and History folder is translated to History and placed into the Target field. Using this approach, users are given wide flexibility in how they structure their words and sentences.

A topic of ongoing research is how to extend this approach to handle a variety of complex utterances, including the following:

References to objects or actions spoken in previous sentences In the request, Rename the file abc.doc as xyz.doc. Move it into the History folder, the word it refers to xyz.doc and not abc.doc or History folder.
Disambiguation of words with multiple meanings For example, in Move xyz.doc into History, the word History refers to a folder, not a file.
User corrections For example, if the user says, Move def.doc, err, xyz.doc into the History folder, then the word xyz.doc would replace the word def.doc in the Source field.

Embedding semantic interpretation instructions into complex grammars is both time-consuming and tedious. Dialogue designers attempt to do this only for specific tasks in specific domains in which the grammar and semantic interpretation complexity can be managed.

Recommendation

There are many techniques that can be called natural language. When claiming that their products support natural language, ask vendors precisely what they mean. System-directed dialogues with barge-in may provide just enough natural language and are easy to implement. There may not be a strong need for more sophisticated types of natural language dialogue. Proceed with caution when jumping on the natural language bandwagon.

James A. Larson is cochair of the SpeechTEK Conference and Exposition and is the author of the home study guide and reference, "The VXML Guide" www.vxmlguide.com. He can be reached at jim@larson-tech.com.

Toward Natural Language Processing

Eltropy Expands Voice Authentication Ecosystem with Illuma, IDgo, and Pindrop

Modulate Expands Velma with Voice-Native Real-Time Conversation Intelligence

Corti Launches Symphony for Speech-to-Text

Why Voice AI’s Next Big Challenge Isn’t Accuracy. It’s Relationship Design.