Making Text-to-Speech Work

It is becoming common to see telephony applications that use speech to give mobile users access to information that they would not be able to retrieve otherwise, such as e-mail, news items or stock quotes. Application developers attempt to deliver that information using naturally recorded speech whenever possible, but there are times when another method is advisable. For example, it would not be feasible to record all of the names in the phone book for a voice dialing application. In these cases, it is necessary to use text-to-speech. User testing of these applications needs to address not only whether the speech is easy to understand, but also if the speech is acceptable. User acceptance of TTS is a multifaceted issue. Comprehensibility is clearly a critical component - if users can't understand the speech, they won't find it acceptable. However, it is also possible that in some cases users may not accept TTS even if they can understand it. Applications that use speech with an unpleasant sound could suffer the same fate as applications that use incomprehensible speech: People won't use them. Many traditional tests of speech quality, such as the Diagnostic Rhyme Test, focus entirely on comprehensibility by using experimental tasks that bear no resemblance to actual application use. With tests such as these, it is difficult for test participants to judge the acceptability of the speech in the context of the application. When planning a user evaluation of TTS, it is preferable to present TTS in the context of an application and ask participants to judge whether the speech is good enough, given the potential benefits of the application. The sections that follow outline the basic steps for planning and conducting a user evaluation of the acceptability of text-to-speech. Depending on the application, it may be necessary to elaborate on the experimental design or procedure outlined here. Nevertheless, these steps summarize many of the important experimental design issues that arise in user evaluations of speech quality. Step 1: Choose the task and measurement
As you decide what you are going to ask test participants to do in the experiment, it may help to consider these questions:

What will participants listen to? Participants should listen to passages of speech that are similar to the types of things that users will actually hear while using the application, such as e-mail messages or news passages.
What will participants do after each passage? After each passage, participants need to do something that will provide some measurement of the quality of the speech. For example, you may ask participants to make a subjective rating of the quality of the speech. Alternately, participants could complete a task such as transcription or answering a question about the content of the speech. Again, try to choose a measurement that is consistent with the demands of the application. If the application is one where users have to respond quickly and accurately to the speech, you may want to measure speed or accuracy in completing a behavioral task. On the other hand, you may have the sort of application where users don't have to listen to the speech if they don't want to, such as a news reader. In this case, you may choose to collect subjective ratings. For example, you could have participants rate the extent to which they agree with the statement, "The quality of the speech is acceptable to me."

Step 2: Write the test materials
After you have chosen the task, you'll need to write actual items that participants will listen to. The number of items you will need will be a multiple of the number of conditions in your experiment. A good rule of thumb is to use at least four items for every speech condition (so, for example, if you had three speech conditions, you would write 12 items). If you have a small number of speech conditions, it is good to use as many as 10 items per condition. After you have written the items, record every item in every speech condition. Step 3: Counterbalancing
Most user evaluations compare the acceptability of two or more speech conditions. The point of counterbalancing is to ensure that differences in scores among the speech conditions are attributable to differences in the speech and not experimental artifacts. In language experiments, a key question is whether to use the same items in different speech conditions. If you don't, then you can't rule out the possibility that the items in one condition were somehow harder than the ones in the other condition. But if you repeat items in the different conditions, you risk a practice effect where it gets easier to understand a sentence every time participants hear it. Experiments in psycholinguistics commonly use a counterbalancing scheme that avoids both artifact types. To illustrate this counterbalancing scheme, consider a simple experiment designed to compare comprehension and acceptability of names spoken by two TTS systems. The experimenters have chosen 10 names as items (10 is a number that can be equally divided by two, which is the number of conditions). The figure below illustrates the counterbalancing scheme. In this figure, the different colored backgrounds represent the different speech conditions. Separate groups of test participants hear the items in each column: For example, five participants would hear the items in the first column, and a different five participants would hear the items in the second column. The types of speech are rotated through the experiment using a Latin square design. As a result, the number of groups of participants (i.e., the number of columns) is a function of the number of speech conditions in the experiment. This counterbalancing scheme has many benefits, including:

Participants never hear the same name twice.
Speech conditions are not confounded with different names. Across the experiment as a whole, every name is included in every condition.
The experiment controls for potential effects of presentation order. Half of the participants hear one speech condition first, and half hear the other condition first.
Every participant hears both types of speech, which allows for within-subject comparisons.
By including the same items in every different speech condition, it is possible to analyze data both by participants and by items. (This will be described further in the section on data analysis).

Step 4: Collect the data
When you are ready to run the test, recruit participants who match the profile of the people who will use the application. Again, the number of participants that you will need will be a multiple of the number of speech conditions in your experiment. A good rule of thumb is to include at least five participants for each speech condition. Begin the session by giving participants a standard set of instructions telling them what to do and what to expect during the test. Your instructions should include a short description of the application so participants have a context for judging the acceptability of the speech. Depending on the application, you may want to simulate various acoustic factors during the test, such as car noise or wireless network conditions. Step 5: Analyze and interpret the data
For each participant, you can calculate an aggregate score for all of the items that he or she heard in each speech condition. For example, if you collected ratings for each item, an aggregate score might be the average of the all of the ratings that a participant gave in a given speech condition. If you collected responses such as answers to multiple choice questions, an aggregate score might be the percentage of correct questions for each participant in each speech condition. When you are finished, you can perform within-subject statistical tests, using the every participant's aggregate scores for each speech condition. In most experiments, statistical testing is performed with participants as the unit of analysis. This tells whether differences in scores between the different speech conditions will generalize beyond the sample of listeners that participated in the test. For language experiments, it is also important to analyze the data by items. An items analysis will tell you if the pattern of results will generalize beyond the sample of items that you used in the experiment. To conduct an item analysis, calculate an aggregate score for the scores that every item received within every speech condition. Then, perform within-item statistical tests on the aggregate scores. Averages for each condition will be the same, regardless of whether data is analyzed by participants or items. What will be different is the variance within each condition, which is a factor in significance testing. In addition to the results of significance testing, it is important to consider actual scores that each condition received. For example, one TTS system may receive better scores than another, but both systems may receive fairly low scores overall. It is a good idea to include a "baseline" speech condition, which provides a point of reference for interpreting scores in other speech conditions. Often, experimenters will use natural speech as a baseline speech condition. However, including a natural speech condition can sometimes result in artificially low scores for the TTS conditions. In other words, asking participants to judge the acceptability of natural speech may lead them to believe that natural speech is an option for the application. As a result, participants may judge TTS more harshly than they might otherwise. Selecting an appropriate baseline condition may depend on the speech solutions that are possible for the application. As TTS continues to improve, it may become more important to evaluate the acceptability as well as comprehensibility of speech. By collecting user data on the acceptability of TTS in the context of an application, users can judge whether the potential benefits of the application outweigh any difficulty or unpleasantness they experience in listening to the speech. Text-to-speech is still a long way from being as good as natural speech. But for many applications, TTS is good enough.

Kate Dobroth, Ph.D., is an independent usability consultant for Mindframe Design Consulting. She can be reached at kate@mindframedesign.com.

Making Text-to-Speech Work

DentScribe Launches DentScribe Perio Charting 3.0

Krisp Launches Voice Translation v3

Treble Technologies and Hugging Face Benchmark ASR Models

Why Better Client Tracking Starts With Better Capture of Spoken Clinical Interactions