Moving from the Art to the Science of Voice User Interfaces (VUIs)

Voice User Interfaces (VUIs) are moving from an art form to an applied science. Many ASR vendors include toolkits with modules for common interactions (like, entering a telephone number). While there are still uncertain areas in the voice transaction between a person and a computer, the understanding of requirements for successful, basic interactions is ever growing. The overall guiding principle is that the interaction follow the rules of normal human behavior. This leverages user expectations in the dialog, without the burden of training during the attempt to perform a goal-oriented task.

VUIs fall into two dialog styles: directed or conversational. Directed dialogs generally require one-word responses, and are best viewed as having “multiple choice” prompts where the user speaks one item from a list of active alternatives (some choices may be unprompted). Conversational dialogs accept sentence-like responses which may contain more than one key term of the application, and are best viewed as “fill in the blank” responses. Directed dialogs are one-at-a-time choices from a closed set of options, while conversational dialogs combine many words to open-ended prompts.

The application itself often determines the dialog model (e.g., choosing from a small set of voice mail or call center choices, versus making an airline reservation or stock transaction). However, as the complexity of the transaction increases, the dialog is impacted by the experience of the user (new or power), the particular subtask (yes/no or date and time) and the frequency of use of a feature (greeting or changing fax number). Also, users have different degrees of experience with different transaction parts, so they use different dialog styles.

Why worry about the dialog model? Throughput rate and, consequently, customer satisfaction are impacted. The best reason to justify an appropriate dialog is to match the user’s expectations. When a service interacts predictably, with logical steps to complete the final goal, it is inconsequential to the caller whether the transaction was handled by a computer or a live person. Further, when expectations are met, little or no learning is needed since the user follows a known mental model. This is the best of all cases.

A service fails when the core technology fails or the user goes off the correct path. The counterpoint of throughput is error rate, specifically, the location, causes and remedy of errors. Errors will occur. My research indicates that core ASR technology has wireless accuracy between 97-99 percent per word. ASR errors are minimized by careful selection of acceptance and rejection thresholds, and transparent handling using a “yes or no” question about the most likely choice. Dialog based errors occur due to confusion in following the steps due to poor logic or confusing choices. The best way to reduce logic (call flow) errors is to observe existing procedures, and follow them consistently. Good logic is robust enough to be embedded in other, more complex steps. These are the “dialog” or “speech” modules of code for a common action. Confusing choices are caused by unclear prompts or terms that are not normally spoken by the caller. Wordy prompts have information not immediately required (like, offering “help” at every step). This leads to auditory memory overload, which, if coupled with cognitive loading because a response is ambiguous or not “on target,” leads to user errors.

Reducing errors is achieved through usability tests that, by proper design, decouple the effects of multiple error sources and isolate the issues to be corrected. For example, ASR accuracy of the vocabulary is tested independently by having a group of about 30 native speakers say the commands (or sentences) in the target language at specific steps of the call flow. While not covering out of vocabulary utterances, this is sufficient to identify specific ASR errors requiring tuning, and provides baseline recognizer performance. Another 30 subjects then perform common application tasks that cover all typical steps. This provides a solid assessment of throughput – the metric that correlates with satisfaction – and areas of concern for tuning.

The strongest payback for testing VUI usability is extending these results to multi-media interactions. Test procedures, call flow logic and vocabulary selection are key components that maximize learning transfer from a VUI to a Multi Media User Interface (MMUI). Keeping extensibility in mind avoids conflicts of logic and terminology when another modality is utilized. An MMUI must permit transparent use of any supported modality since the user or subtask may be better suited to one modality than another.

VUI guidelines, test methodologies, test results, standard vocabularies, ASR baseline performance, application integrations, standardization and more, are typical topics discussed in AVIOS publications and conferences, and in the International Journal of Speech Technology. VUI design is not a risky art form. There is a large body of knowledge available to help guarantee successful voice activated applications.

Dr. Matthew Yuschik is campus college chair, Information Technology, at the University of Phoenix, Greater Boston Campus. He can be reached at matt.yuschik@phoenix.edu .

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Moving from the Art to the Science of Voice User Interfaces (VUIs)

Voice Deepfake Fraud Surged 1,300 Percent

Sanas Unveils Simultaneous Real-Time Speech-to-Speech Translation

ESTsoft Partners with ElevenLabs

Deepgram Launches Voice Agent API