March 5, 2010
By James A. Larson program co-chair, SpeechTEK 2021
Forward Thinking

Grammatically Speaking

Conversational speech dialogue systems, like VoiceXML, require developers to specify context-free grammars (CFGs) for all of the possible words or phrases a user might say in response to each prompt. CFGs are often denoted using a grammar language, such as the Speech Recognition Grammar Specification, Java Speech Grammar Specification, or Nuance Communications’ Grammar Specification Language. Sometimes these grammars are small, containing only a few words. Other times, they can be big and complex, such as those needed to denote dates and times. To make their platforms desirable, VoiceXML platform vendors supply sets of reusable CFGs. However, these grammars often can only be used with the vendor’s individual platform, making portability difficult.

For some situations, there is a promising alternative to CFGs: statistical language models (SLMs). Developers are beginning to use SLMs to transform user utterances into a small number of categories. SLMs are ideal for the first prompt in a dialogue system, such as How may I help you? SLMs transform any user utterance into one of a small number of categories, such as departments, offices, or sections. The VoiceXML platform then routes the call either to the appropriate category or corresponding VoiceXML application, which then continues to converse with the user to determine what he needs.

Generally, developers follow these steps to create an SLM:

capture user utterances (often numbering hundreds or thousands);
manually tag each utterance with one of the target categories; and
apply sophisticated statistical analysis routines that automatically create the SLM.

SLMs might be able to process utterances that were not initially captured by using results from the statistical analysis to map utterances to one of the target categories. For example, the initial utterances Repair my Ford engine and My Taurus has broken down are both marked with the category auto repair shop. Then the resulting SLM could map the user utterances I need to have my Ford Taurus repaired, My Ford broke down, and The engine in my Taurus sounds ragged to the single target category auto repair shop. Developers don’t need to specify complex CFGs to determine the topic category of a user utterance.

SLMs can be problematic, though. They can be expensive, complex, and notoriously difficult and time-consuming to design and deploy due to the thousands of utterances that need to be collected and tagged. SLMs can be tricky to update if developers need to add, replace, or remove categories. It might also be necessary to capture additional utterances, assign tags to the new utterances, revise tags on existing utterances, and run statistical routines again.

SLMs are not perfect. Sometimes the user is routed to the wrong category. The conversational system then has to identify that the user reached the wrong department or application and determine where she should be routed.

A Lighter Workload

SLMs work because they don’t need to understand all of the words in an utterance. All they need to do is identify the topic category that the user wants. SLMs are like a telephone operator who can route you to the appropriate department, but can’t tell you if the Black & Decker toaster is on sale or help you transfer funds between accounts. To do these tasks the language processor would have to produce a detailed semantic representation of the user’s request, which SLMs don’t do.

In VoiceXML, semantic representations can be constructed for utterances processed with CFGs by using Semantic Interpretation for Speech Recognition. However, sophisticated semantic interpretation is very complex, and few VoiceXML applications use SISR to analyze spoken input except to identify simple command operands.

Do SLMs support natural language? Some vendors say yes; it is very natural to respond to the How may I help you? prompt with any reasonable statement and have a high probability of being routed to the right place. But SLMs can’t produce a semantic representation of the user’s utterance; they have no understanding of what the user desires. Without understanding, real natural language processing can’t occur.

While some vendors say SLMs support natural language, I suggest SLMs support only a part of what most users consider natural. Vendors claiming their systems with SLMs support natural language are misleading.

SLMs are more than just a passing fancy. They provide a real solution to the difficult problem of routing calls. But without producing a semantic representation of user utterances, there is no understanding of what users want. With no understanding, there is no natural language.

Jim Larson, Ph.D., is a speech application consultant, co-chair of the World Wide Web Consortium’s Voice Browser Working Group, and program chair for SpeechTEK 2010 and SpeechTEK Europe 2010. He can be reached at jim@larson-tech.com.

Grammatically Speaking

Eltropy Expands Voice Authentication Ecosystem with Illuma, IDgo, and Pindrop

Modulate Expands Velma with Voice-Native Real-Time Conversation Intelligence

Corti Launches Symphony for Speech-to-Text

Why Voice AI’s Next Big Challenge Isn’t Accuracy. It’s Relationship Design.