November 7, 2005
By Melanie Polkosky Human Factors Psychologist & Consultant - IBM/Center for Multimedia Arts (University of Memphis)
Features

What Is Speech Usability Anyway?

Perhaps one of the great ironies in the field known alternatively as human factors, human-computer interaction, or user-centered design, is that some of its central concepts are exceedingly difficult to define. Take usability, for instance. Walk into any enterprise that claims to know something about technology (or uses it for business operations) and take an informal poll about what this term really means. In all likelihood, you’ll get as many answers as people you ask. The problem is no less pronounced in the speech technology industry, where most everyone is well-versed in the importance of usability and its association with operational cost savings; but what is it, exactly? That’s a tougher nut to crack.

If you take a brief tour of the various definitions of usability, you’ll find there are many. However, a substantial number of these definitions have a decidedly visual interface orientation or one that doesn’t seem to specifically target the usability issues central to speech technology design. Wickens, Gordon, and Liu (1998) assert that "usability is… one of the greatest concerns for those working on software interface design" (p. 453). One of the more well-known definitions of usability was proposed by Nielson (1993), who stated that it is made up of five factors: 1) learnability or how rapidly a user learns to use a system; 2) efficiency or the extent to which a system supports user performance; 3) memorability or the extent to which appropriate system use can be recalled; 4) errors or incorrect actions performed during system use; and 5) the subjective factor of user satisfaction. Although usability is a very important variable in using any technology, definitions of it have been problematic because reliable and valid measurements are rare (Gray and Salzman, 1998a, 1998b; Olson and Moran, 1998) and the definition itself has been controversial (Bevan, 1995; Hartson, Andre, and Williges, 2001; Hassenzahl, 2001; Hertzum and Jacobsen, 2001). Thus, there has been no generally agreed-upon method of measuring usability, which makes it difficult to compare among different findings.

This problem is compounded when you consider the relatively new field of speech technology design. After all, speech does seem to draw upon different aspects of user behavior more than other technologies – it is conveyed auditorily, not visually, and seems to have something to do with our (largely taken for granted) social communication skills. The previous definitions of usability also don’t seem to account for another favorite term in the speech industry, that of persona, or the "personality" of a speech interface as conveyed primarily through speech and linguistic cues. In addition, many applications of speech technology are used for customer service, which would seem to bring in another set of expectations that might influence usability. It’s fairly easy to test whether the average person expects service providers to behave in a certain way – simply find someone who has had bad service and they’ll be able to list a number of things the provider didn’t do, or didn’t do correctly. Intuitively, we know that there is a standard of comparison for service, something service-marketing researchers call service quality, which is closely related to customer satisfaction (Gilmore, 2002). Because customer service automation makes up a significant part of speech technology usage, it seems that any definition of usability must encompass the concepts of persona and customer service behavior.

Toward a Definition of Speech Usability
Over the past five years, the development of a definition for usability of speech interfaces has evolved through several different ways of understanding and measuring this variable. The first attempt at measuring users’ perceptions of speech technology was the construction of an expanded version of the Mean Opinion Scale or MOS (Polkosky and Lewis, 2002). This scale was initially used to measure the intelligibility and naturalness of human voices heard on the telephone; however, these two factors did not adequately measure users’ perceptions of the differences in synthetic (text-to-speech) voices. Thus, using statistical and diagnostic methods borrowed from speech-language pathology, we developed an expanded version of this scale that measured four factors: intelligibility, naturalness, prosody (the musical quality of speech, including its intonation and emphasis), and social impression. We found that the Mean Opinion Scale – Expanded (MOS-X) did indeed differentiate users’ perceptions of synthetic voices and was an important first step in understanding what speech-based cues were important for users’ impressions of speech technology. It also was able to differentiate between human voices (Polkosky, 2003). When the MOS-X was used to measure users’ reactions to working speech interfaces (not just synthetic voices reading a static script), results demonstrated that all four factors were associated with intent to use the technology again, as well as users’ liking of the system (Polkosky, 2003).

Despite this initial success, the MOS-X was originally designed for measuring only perceptions of speech, so it was not able to measure the linguistic cues that were also thought to impact users’ reactions to speech technology. Again, methods and theory used in diagnosis of human communication provided a statistically-sound means of creating another measurement, called the Pragmatic Scale for Dialogues (Polkosky, 2003). This scale was focused on the social-linguistic aspects of conversation (known as pragmatics), including the contingency of responses, turntaking, completeness and amount of information, use of detail and helpfulness of messages. It measured four factors: quantity, quality, manner and relation, which were related to the principles of normal conversational behavior proposed by the philosopher Grice (1975). When the Pragmatic Scale was used to measure working speech interfaces, it also showed that the factors were correlated with liking and intent to use speech technology in the future, but its effect sizes were very small (Polkosky, 2003).

The third attempt at developing a measure of usability was broader in its scope and included methods used in services marketing research to measure service quality, but also the behavior and personality of service providers (Polkosky, 2005). In addition, it used methods of measuring usability and ease of use similar to researchers in human factors engineering who work with other technologies. An initial set of 76 potential items included items that measured a broad variety of speech interface characteristics, including customer service behavior, pragmatics, recognition of user input, users’ affective responses, efficiency, accuracy of information provided, prompt wording, usefulness, the mental model of the system and impression created by the system voice. Then, a group of linguistics and psychology experts rated the quality of 16 speech systems and a set of interfaces were selected to represent low, average and high quality designs. The systems finally were rated by 862 individuals, who completed the 76 items (plus several other scales). These ratings were subjected to several statistical analyses to determine the set of items that best define usability for speech technology. As a result of these analyses, only 25 items (grouped into four factors) were shown to measure speech usability with sufficient validity and reliability (Polkosky, 2005).

A Four-Factor Definition
As Nielsen’s (1993) definition suggested, usability is a complex variable, related both to the characteristics of the user and to the characteristics of the technology. Based on this research, usability is a hybrid and extension of previous definitions but is also unique to speech technology. Results showed that four factors define speech usability:

User Goal Orientation, or the efficiency, control, and confidence allowed by the system as a user completes his task;
Customer Service Behavior, or the expectations associated with customer service, such as use of everyday words, friendliness, helpfulness, and politeness;
Verbosity, or the talkativeness, use of repetitive messages, and amount of detail; and
Speech Characteristics, or the enthusiasm, naturalness, and pleasantness of the system voice.

As in the previous MOS-X and Pragmatic Scale for Dialogues, system voice and social-linguistics behaviors continued to figure prominently in the definition of usability. But interestingly, participant ratings also demonstrated that the amount of talk in a system and the rate at which a user is able to complete his or her task are other vitally important components of usability. Another implication of this research is that the expectations of human customer service providers and other technologies (e.g., Internet, TV, radio) also apply to speech systems. Perhaps the most interesting finding is that for customer service, speech usability is largely synonymous with service quality and the expected behaviors are drawn from both human interaction and Web-based e-service.

Speech Usability and Customer Satisfaction
Each of the usability factors relate to customer satisfaction differently. The strongest relationship occurs between User Goal Orientation and customer satisfaction – the correlation between these two variables is 0.71 (p<0.01), which indicates that as User Goal Orientation increases, so does customer satisfaction. Weaker correlations were shown for customer satisfaction with Customer Service Behavior (r=0.40, p<0.01) and Speech Characteristics (r=0.43, p<0.01), also suggesting that satisfaction is greater when a system shows more stereotypical human communication behavior and uses an engaging voice. Finally, Verbosity showed a negative correlation (r=-0.26, p<0.01), which suggests that more wordiness in prompts is associated with less customer satisfaction.

This research also permitted a view of the effect of expert quality ratings on users’ ratings of usability (see Figure 1). As the linguistics experts’ ratings of quality increased, so did all of the usability factor ratings, except Verbosity (higher quality interfaces show less Verbosity). The most striking differences among high, average and low quality systems were in their Speech Characteristics, which had a larger effect size (h²=0.373) than the other three factors (User Goal Orientation h²=0.196, Customer Service Behavior h²=0.191, Verbosity=h²=0.184). This finding means that the system voice exerts the most influence on user and expert ratings, but it is important to keep in mind that User Goal Orientation has a stronger correlation with customer satisfaction. Generally, though, these results indicate that high-quality speech systems are more in line with user needs, make better use of user expectations about conversation and customer service and are more efficient in their use of language than lower-quality systems.

Decision-Making for Speech Interface Development
What do these findings mean for decision-making during development? In essence, there are four ideas that should be kept in mind if you want to design a high-quality speech system:

Figure 1: Comparison of usability scores based on expert ratings of UI quality

Prioritize Getting to Know the User – Because users want a system to be designed to meet their goals; and due to the strong association between User Goal Orientation and customer satisfaction, getting to know the user should be the first priority of any speech interface development. In practical development, however, this is the step that is most often minimized or overlooked because it is thought to be too costly and time-consuming. Many companies do not obtain information directly from the user but from internal employees, assuming that employees are similar to their users and would view and design speech technology in the same way. Unfortunately, there are likely significant differences between employees’ and customers’ knowledge of the company, its products, processes, history and organizational structure. These knowledge differences can lead to design decisions that make a speech system work only for employees, not end users. Instead, at the earliest stages of development, information about what to automate and how to structure a user interface should be obtained directly from the user group, using focus groups, market research, surveys, user analysis or other questionnaires. Only then can you be sure that the resulting system really is oriented toward user goals.
Be Concerned About Persona, But Not Too Much – While it is true that the speech and linguistic characteristics of an interface are important to user perceptions, persona should have the appropriate perspective: it is a secondary consideration after catering to users’ needs. Undoubtedly, the friendliness and naturalness of the system’s voice are important characteristics that need to be controlled and the prompts should convey helpfulness and politeness; but don’t let a blind focus on these design issues lead you to neglect the more important design decisions that enable a clear, simple, and efficient UI based on user goals.
There’s More to Persona Than Male and Female – Many companies are extremely concerned about whether their system voice is male or female. Although this is a decision that’s often fraught with anxiety, the reality is that the gender of a system voice is only a single aspect of a user interface design and only part of what creates usability. There are many other speech, linguistic and psychological variables that must be adequately controlled to create usability and a positive user experience. A quality UI designer will know how to work with the voice talent and scripts to create quality, regardless of the gender of the talent.
Speech Usability Is a Hybrid of Service Quality in Other Channels – The current research shows that usability for speech technology is related to other forms of customer communication, including Web-based e-service, human customer service, and mass communication (e.g., TV and radio). Practically, many companies have placed their IVR systems in the IT realm, assuming they have more in common with the servers hiding in their back rooms than their customer-facing staff. The current definition of usability places speech squarely in the realm of other marketing efforts that are intended to produce customer satisfaction and loyalty. If you don’t give it the same time and attention as these other channels, your customers will know (and eventually, you will too, but not because you’ve achieved the desired outcomes!).

References
     Bevan, N. (1995). Measuring usability as quality of use. Software Quality Journal, 4, 115-130.
     Gilmore, A. (2003). Services marketing and management. London: Sage.
     Gray, W. and Salzman, M. (1998a). Damaged merchandise? A review of experiments that compare usability evaluation methods. Human-Computer Interaction, 13(3), 203-261.
     Gray, W. and Salzman, M. (1998b). Repairing damaged merchandise: A rejoinder. Human-Computer Interaction, 13(3), 325-335.
     Grice, H. (1975). Logic and conversation. In P. Cole and J. Morgan (Eds.), Syntax and semantics 3: Speech acts, (p. 41-58). New York: Academic.
     Hartson, H.R., Andre, T., and Williges, R. (2001). Criteria for evaluating usability evaluation methods. International Journal of Human-Computer Interaction, 13(4), 373-410.
     Hassenzahl, M. (2001). The effect of perceived hedonic quality on product appealingness. International Journal of Human-Computer Interaction, 13(4), 481-499.
     Hertzum, M. and Jacobsen, N. (2001). The evaluator effect: A chilling fact about usability evaluation methods. International Journal of Human-Computer Interaction, 13(4), 421-444.
     Nielsen, J. (1993). Usability engineering. San Diego: Academic Press.
     Olson, G., and Moran, T. (1998). Commentary on "damaged merchandise?" Human-Computer Interaction, 13(3), 263-323.
     Polkosky, M. (2003). Measuring the interpersonal consequences of interactive speech technology. Unpublished manuscript, University of South Florida.
     Polkosky, M. (2005, in press). Toward a social-cognitive psychology of speech technology:
Affective responses to speech-based e-service (Doctoral dissertation, University of South Florida, 2005). Dissertation Abstracts International.
     Polkosky, M. and Lewis, J. (2003). Expanding the MOS: Development and psychometric evaluation of the MOS-R and MOS-X. International Journal of Speech Technology, 6, 161-182.
     Wickens, C., Gordon, S., and Liu, Y. (1998). An introduction to human factors engineering. NY: Addison Wesley Longman.

Melanie D. Polkosky is a social-cognitive psychologist and speech-language pathologist who has researched and designed assistive and enterprise communication technologies for over 10 years. She is currently a senior human factors psychologist for IBM Conversational Solutions Group, with expertise in social cognition, interpersonal communication, usability measurement, and social impacts of technology.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

What Is Speech Usability Anyway?

Voice Deepfake Fraud Surged 1,300 Percent

ESTsoft Partners with ElevenLabs

Sanas Unveils Simultaneous Real-Time Speech-to-Speech Translation

Deepgram Launches Voice Agent API