September 26, 2019
By James A. Larson program co-chair, SpeechTEK 2021
Q & A

Q&A: Bruce Balentine on the Basics of Conversational Chatbots

Bruce Balentine, Chief Scientist Emeritus, Enterprise Integration Group and Chief HCI Engineer, Intelligently Interactive, recently answered the following questions about building conversational chatbots:

Q: Tell us about the three-hour workshop you will present on April 29 at the SpeechTEK Conference in Washington, DC.

A: Features needed for future conversational advancement—turn-taking, prosody, multi-leveled confidence, and noise characterization—are limited or non-existent in today’s most popular streaming-text black-box recognizers. We are at risk of losing experiential skill with basic speech technologies. Knowledge of speech matters. For this hands-on interactive session, we will use newly-developed tools from Intelligently Interactive to explore basic ASR technology—learning about terminology, error types, prompting methods, turn-taking and dialogue design philosophies. The tools use open-source Sphinx technology running on iPhone/iPad platforms to support under-the-hood examination of speech recognition, voice-activity detection, basic dialogue design, and usability testing methods. If you are an iOS user, keep the tools and continue your research after you leave. If not, we will provide shared resources and you can apply the knowledge to other products and platforms. Attendees will be provided with instructions for downloading software to their iPhones or MacBooks prior to this workshop.

Q: Why are today’s streaming-text black-box recognizers not sufficient for some of today’s innovative applications?

A: A conversation – like all user interfaces – has an input side and an output side. Spoken input needs to be delivered upward to domain-knowledgeable layers in as rich and flexible a format as possible. The speech industry keeps talking about intelligent conversations that are emotion-aware, with tightly-coupled turn-taking protocols, effective social skills, and human-like awareness of both self and user. But the big ASR vendors have moved to a black-box model wherein they simply convert speech to text and then stream it to application servers. Such architectures are insufficient.

In addition to words spoken, designs today need information about (1) prosody, (2) affect (3) self-confidence, (4) environment (noise classification), and (5) input-output alignment. What’s more, they need this information in real-time. In the reverse direction, applications need downward parametric control over speech recognition processes to properly manage feedback.

Q: What skills do developers need to develop new conversational applications?

A: Developers alone cannot possess the skills required. Only an interdisciplinary team of generalists/specialists will move to the next stage. And each team member must possess both interest and skills at layers above and below their comfort zone. For developers (programmers), that means improved knowledge of the signal-processing and acoustical characteristics of human speech and speech technologies. Skill with music and audio engineering have also proven helpful to both designers and developers. Designers need to move down a level or two and develop a better understanding of engineering and technology issues (as opposed to merely script-writing and task-sequencing). Speech scientists need to develop respect for HCI and usability. Business analysts need to be better at goal-setting, documentation, and respect for testing.

All need to have some hands-on experience with raw speech technology.

Q: What are some of the complexities of turn-taking that designers face?

A: There are two broad approaches to turn-taking. In the half-duplex model – the most common today – user and machine trade turns one after the other. I speak, then you speak. You hand the turn to me, and then I bat it back to you. This is known as the “tennis-match” model and is limiting. In the class, we will explore the important implications of half-duplex turn-taking. The full-duplex model allows users to interrupt the machine at any time, allowing speech to overlap. This is more like a “three-legged race,” requiring cooperative protocols if user and machine are to synchronize and coordinate the conversation. Full-duplex is the next required advance and is very difficult to accomplish. Mere “barge-in” won’t cut it. This class will touch on and explore the issues of full-duplex as time permits.

Q: What can designers learn about user interface shortcomings by using the new tools introduced in this workshop?

A: We will explore several error-types, exemplifying them with a batch-recognizer tool that replicates ASR tests and logs user speech. We will then discuss behavioral solutions for detection, recovery, correction, and stabilization. Half- and full-duplex turn-taking will be studied with a “SmartWindow” tool that allows attendees to specify various controlling parameters that induce, demonstrate and replicate various turn-taking errors. Attendees will leave with a deeper understanding of timing and its role in coordinating user-machine interaction.

A: Is there a taxonomy of error types and strategies for resolving each type?

A: Every ASR exhibits various types of errors, and the terminology within the industry has never been standardized. The key is to understand that speech recognition is not simply "right" or "wrong." There are several categories of speech error, and each is managed with different interventions. For example, practitioners must distinguish between substitution, false acceptance and false rejection – including acceptance/rejection of user speech versus background noise (two very different kinds or error) – as well as insertions and deletions caused by segmentation errors. What’s more, certain turn-taking errors and prompt-response stumbles occur at levels above the base ASR, and must be fixed in different ways. In the class, we will use an ASR tool called TSBR to explore these error types – discussing each in turn and exploring design solutions to handling them.

Table 1—Speech Errors (Examples)

Error/Condition	Description & Example	Comments
Substitution	One word is substituted for another. User says: "I hate blue food." ASR recognizes: "I hate red food."	This is often what is meant by "misrecognition." Some assume that this is the only kind of error, calculating accuracy as a "percentage" by measuring this class of error—a misleading characterization.
Insertion	A word is inserted into the result string. User says: "3-1-9." ASR recognizes: "3-1-1-9."	This is sometimes a false acceptance (usually of noise), and sometimes a segmentation error.
Deletion	A word is deleted from the result string. User says: "Port Saint Lucie Florida." ASR recognizes: "Port Lucie Florida."	Often caused by over-compensating for possible insertion errors.
Segmentation Error	The ASR fails to correctly detect the boundaries around speech segments. User says: “May.” ASR recognizes “May 8”	Over-segmenting causes insertion errors, while under-segmenting causes deletion errors. Such issues may apply to levels above and below the word level.
False Acceptance (noise)	A noise is recognized as a legal word. Input: "<cough> seven-two-eight." Output: "six-seven-two-eight."	ASR can have difficulty distinguishing speech from noise. This error is sometimes the cause of an insertion.
False Acceptance (OOG speech)	Out-of-grammar speech is accepted as a legal word or words. User says (to somebody else in the room): "Actually, that was funny." Alexa replies: "I've added butter to your shopping list."	It's very difficult to detect that the user is saying something not represented in the grammar. ASR should reject the input (refuse to recognize it) but instead accepts it as in-grammar speech.
Score (confidence) Rejection	A phrase or a word within a phrase is accepted, but the scores make it suspect. User says: "I need a ticket to Austin." ASR hears “Austin” followed by “Boston” on the nBest list. Scores are close (confidence is low).	A more abstract version of rejection—usually implemented by the application rather than built into the ASR—in which scores and nBest list identify a potential error, using dialogue design to correct it.
False Rejection	An in-grammar word or phrase that should be recognized is instead rejected. User says: "I really like red cars." ASR recognizes: NULL hypothesis	False rejection is similar to deletion, and some practitioners use the word deletion to describe false rejection (the phrase was legal and yet was deleted).

Q: What are the three key learnings attendees will take away from this workshop?

A: The key messages in the “back-to-the-basics” course are (1) ASR technology does not work the same way that human speech processing does; getting to know the differences helps with design and development; (2) What you do as a puppet master (pulling the strings of your machine) has a direct effect on what your end-user does when speaking back to that machine – giving you a great deal of power to shape user perceptions and speech behaviors; and (3) detection and proper classification of recognized results (including turn-taking cues) is the first step in managing conversations.

Q: How does knowledge of speech, using tools from only one ASR, transfer to other platforms – especially those that do not support the visibility exposed by this class?

A: Speech is transferrable to other platforms. With direct experience, designers and developers will better understand the “under-the-hood” behaviors of technology and can anticipate and design for many otherwise-unexpected conditions. Understanding the back-and-forth sequences that constitute spoken interaction will allow attendees to build discourse units that anticipate, detect and recover typical conversational problems.

Register for the SpeechTEK Conference and Bruce Balentine's workshop. There are still openings for SpeechTEK University workshops and presentations. Submit proposals here by October 11, 2002.

Q&A: Bruce Balentine on the Basics of Conversational Chatbots

Q&A: Greg Stack on Blazing a Trail to Successful AI Migration

Q&A: Dr. Nava Shaked on Evaluation, Testing Methodology & Best Practices for Speech-Based Interaction Systems

Q&A: Dr. Michael McTear on Building a Conversational Chatbot for Google Assistant Using Dialogflow

Q&A: David Attwater on the Ins and Outs of Conversation Design

AI Translation and Captioning Emerge at College Graduations

Vallige Introduces Val, a Smart Companion to Support Families Living with Dementia

Bland Raises Funds to Advance Voice AI

Klick Labs Partners with Mayo Clinic on Vocal Biomarker Research