January 25, 2008
By James A. Larson program co-chair, SpeechTEK 2021
Forward Thinking

Escaping from Directed Dialogues

Most of today’s IVR dialogues are system-directed, where the system asks questions that users answer. Novice users might be comfortable with directed dialogues, but experienced users often find this style tedious and time-consuming. The answer to serving both user types could be mixed-initiative dialogues that allow users to take charge and speak commands, but many designers do not use the form-level grammars and other techniques required to implement them.

Call routing is a special type of automatic classification system that prompts users with a general phrase, such as How may I help you? The classification system translates users’ utterances to one of several predefined destinations based on a statistical model derived from a corpus of typical user phrases and their corresponding categories.

Every automated classification system must be trained by analyzing a corpus of typical phrases and their corresponding categories. Developers collect thousands of user responses to the prompt How may I help you? and annotate each response with an appropriate category. Statistical algorithms analyze the phrases and annotations and create a statistical model of how to map the phrases to categories. The automated classification system then uses this statistical model to transform user-spoken phrases into categories.

Automated call routing uses some of the most successful classifiers. Examples of automatic classifiers could include:
• Speech recognition Maps strings of phonemes to strings of text.
• Optical character recognition Maps shapes into characters.
• Handwriting recognition Maps pen gestures into text.
• Computer vision Maps captured images or video into text.
• Information retrieval Maps a query into one or more documents that are likely to match the query.
• Case-based reasoning Maps a collection of symptoms and behaviors into a probable diagnosis.
• Spam detection Maps text sequences into either of two categories—spam or potentially interesting messages.

Automated classification systems look like magic to many people, who sometimes use the misleading emotional term "artificial intelligence." I prefer the term "supervised learning," which is a machine learning technique for creating a function from training data. Speech recognition is not mysterious to someone who understands the basics of neural network or hidden Markov model technologies. Likewise, automatic classifiers use off-the-shelf technologies, such as neural networks, support-vector machines, k-nearest neighbors, Gaussian models, and decision trees. Automatic classifiers can replace traditional Speech Recognition Grammar Specification (SRGS) grammars.

Given a corpus of annotated user utterances, statistical algorithms calculate the internal representation of the transformation algorithms. A dialogue designer doesn’t need to write a grammar—a complex and tedious process that must be tested extensively. Writing grammars for large, complex sentences may be humanly impossible.

Classification systems successfully categorize sentences not in the original training corpus. Users are not limited to uttering sentences covered by a prespecified grammar.

While automatic classifiers sound easy to create and implement, they are not. Collecting a sufficient number of utterances—sometimes in the thousands—and annotating each with the appropriate category can be a large task for dialogue designers. Each change to the corpus or categories requires repeating the entire automated training process.

Can we generate a training corpus? Several attempts have been made:
• Generating corpus: Write a simple SRGS grammar and then use the grammar to generate sentences and the corresponding category annotations. The resulting automatic classification system should correctly classify not only all sentences that were generated from the grammar, but also other related sentences not generated from the grammar.
• Active learning: Annotating each phrase in the corpus can be time-consuming and expensive. Researchers have modified the training system to carefully select examples for presentation to users for annotations. By avoiding similar examples, the number of examples to be annotated is significantly decreased.

If classification systems can process user responses to the How may I help you? prompt, they could similarly process additional user requests, resulting in very flexible, mixed-initiative dialogues. No longer must users play a game of "20 questions" to get the appropriate response from an IVR, but instead formulate a request using simple, open-ended natural language; then, the magic happens.

James Larson, Ph.D., is co-program chair for the SpeechTEK 2008 Conference, co-chair of the World Wide Web Consortium’s Voice Browser Working Group, and author of the home-study guide The VoiceXML Guide (www.vxmlguide.com). He can be reached at jim@larson-tech.com.

Escaping from Directed Dialogues

DentScribe Launches DentScribe Perio Charting 3.0

Krisp Launches Voice Translation v3

Treble Technologies and Hugging Face Benchmark ASR Models

Why Better Client Tracking Starts With Better Capture of Spoken Clinical Interactions