Speech Technology Magazine


TTS and Personalities: Expressing True Attitude

In order for expressive TTS to be effective, the voice, script, and affective tone all must support the intent and match the customer.
By Judith Markowitz , Caroline Henton - Posted May 8, 2006
Page1 of 1
Bookmark and Share

In "Show Some Emotion" (January/February, 2006), Judith Markowitz discussed advances in text-to-speech (TTS) technology that are enabling automated systems to express joy, anger, and other strong emotions.  Most speech applications don't require raw emotions, but developers, researchers, and customers believe that the addition of expressive nuances would enhance the effectiveness of more mundane TTS applications. There is a great deal of work being done to create TTS that is capable of expressing affect.

At the standards level, Kazuyuki Ashimura, team contact for the W3C Voice Browser Working Group, agrees that expressive TTS has the ability to enable automated systems to communicate intention and adds that "the W3C Voice Browser Working Group is looking at ways to extend the Speech Synthesis Markup Language (SSML) that will take into account such elements; although, there is no concrete standardization for it in W3C yet."

This article reveals how creating expressive dialogues that can handle a full range of attitudes is a multi-faceted psycholinguistic, acoustic and engineering challenge.

Can You Be a Little More Sorry?

In order for expressive TTS to be effective, the voice, script, and affective tone all must support the intent and match the customer. Quazza Silvia, TTS researcher at Loquendo, indicates:

If I want to welcome the listener or advertise a product I would choose a brilliant, glamorous intonation. If I want to alert people in case of danger I would use an imperious tone.…An inventory of such conventional patterns can be considered an extension of the set of patterns for 'normal,' linguistic prosody.

According to Vytas Kisielius, consultant to Adeptra, a call center doing collections will use a non-threatening reminder, such as the following, for customers who have never been delinquent.

Hi. This is Susie calling from Federated Bank.  Sorry to bother you, but we haven't received a payment.  By the way, if it's convenient for you and you'd like to pay right now we can do that for you. 

It needs to be spoken in a friendly, sympathetic voice. In contrast, friendliness and sympathy are not appropriate for the high-risk customer who has a history of defaulting. Their messages need to be more assertive - even aggressive.

Hey! You owe us some money.  You've already incurred some fees.  This is going to hurt your credit rating which is going to stink! 

The voice and tone that deliver this message must be as stern and no-nonsense as the content.

Research studies have demonstrated the effectiveness of such expressive TTS for a full range of call-center interactions. Cavalluzzi, Carofiglio, and de Rosis found, for example, that the "chance of success increases if the system establishes an empathic relationship with the user: even in domains which might appear as neutral, empathy contributes to make advice more acceptable."[1]  The reason is that this kind of expressive TTS emulates the expertise of effective call-center agents who know how to communicate the appropriate intent and outcomes for an interaction.

Brand Value

Voice designers at Nuance, BeVocal, Voxify, Tellme, and other companies have incorporated the demand for imperious or pleasing, but always appropriate, vocal behavior into their development work.  They attempt to use specific affect in voices to create moods and personalities that add 'brand value' to IVR scenarios.  For example, Nuance created a special character for the Metropolitan Transportation Commission of San Jose, who delivers traffic information to taxi drivers in a "calm and soothing" voice.2 The 'persona' envisioned for this scenario was a "retired Highway Patrol officer, whose deep warm voice never sounds rushed or flustered." 3

Voice timbre, rate of speech, and overall affect are clearly the most salient aspect of such artificial personalities. But for call-center clients who need to flesh out the being behind the voice, voice designers may also concoct mock résumés for their voices. These bios generally include statistics such as height, weight and age, as well as education and the 'person's' likes and dislikes. One twenty-something female persona, who reads e-mail over the phone to Yahoo customers, has a bio that includes a dog and a new boyfriend, all of which probably sprang fully-formed from a voice-brander's head.

These customized affective agents for specific discourse interactions are engineered attempts to create more empathetic 'involvement' for the user. So when someone calls to get health care or insurance information, the answering voice should sound more efficient and direct than, say, the voice that delivers a daily horoscope.  A banking persona might be more patient; a blackjack dealer or other game player may sound snappy or lugubrious depending on the game.

The Question at Hand

All this designer psycho creativity still begs the question, which is, "What are the exact acoustic and prosodic correlates of 'direct, snappy, patient, drab' or 'bright and sunny'?"

Creating richly expressive TTS is not as easy as recording more speech samples or expanding existing unit-selection algorithms. We can see why just by examining our own everyday utterances, which contain a multitude of articulatory and tonal adjustments at the topic, utterance, phrase, and even the word levels of our speech.

Kim Silverman, a principal research scientist at Apple Computer, describes an up-and-down flow of pitch and loudness that is tied to the topic under discussion. "Whenever we start a topic we raise our voice a bit to let the listener know that we've started something new.  As we keep talking about that topic we lower our voice back down to the normal pitch range. Towards the end of the topic we lower our voice range down even more.  Then for the next topic we raise our voice again." When Silverman played TTS systems using comparable pitch and loudness patterns and other topic-related variations, listeners reported that the TTS voice sounded more natural and appeared to be more interested in what it was discussing. Research has further shown that listeners rely on these suprasegmental and tonal variations to infer the overall 'block' and topic structure of a sequence of sentences.  They do not merely make the speech sound more natural; they are essential for listeners to follow the meaning.

IBM researchers Aaron, Eide, and Pitrelli4 offer the following example of contrastive stress at the utterance level that is typical of transaction dialogues between humans.  

Caller: "I'd like a flight to Boston Tuesday morning."
Computer: "I have two flights available on Tuesday afternoon."4

They point out that

The software's ability to emphasize the word "afternoon" would simplify the exchange enormously. The caller implicitly understands that no flights are available in the morning, and that the computer is offering an alternative. In contrast, a completely unexpressive system could cause the caller to assume that the computer had misunderstood him, and he would probably end up repeating the request.4

The W3C's Ashimura adds distinguishing among homonyms and confusable sequences to the list. "The same phoneme sequences can have opposite meanings. Those can not be identified without prosodic information such as speech duration or intonation." He points out that such patterns exist in every language. "For example, in Japanese /uN/ can have two meanings: One is 'YES' with short duration, another is 'NO' or hesitation with long duration." A corollary in English is the different affect of a speaker who says "Mmmm" with a rising tone (interested/enthusiastic), versus a falling tone (disinterested/bored).

These examples provide a glimpse of how intonation patterns used in normal communication arise from a myriad of interrelated sources including dialogue and topic structure, semantics, and interpersonal dynamics.

Act and React

The example from Aaron, Eide, and Pitrelli highlights the fact that the generation of appropriate and effective synthetic speech is a function of the dialogue.  This is extremely difficult whether the system is handling emotions or affective coloration. For strong emotions, rate of speech, overall pitch rate and intensity, and intonation tunes all contribute to conveying vocal emotion, but there is much debate about the particular mix of these components for emotions in different speakers of different dialects and differing languages.  Some individuals speak slowly when they're content; others slow down when they're so angry that they enunciate individual words in an attempt to contain their frustration. If the detection algorithms of ASR get the emotion wrong, and consequently respond to an irate male caller with a chirpy, upbeat female persona, this will only exacerbate the situation and lead to repeat digital pounding of the star key. It is even harder to detect irritation, sarcasm, amusement, and other more subtle affective colorations. Proper identification of those situations can prevent them from escalating into full-blown outbursts.

Don't Worry, Be Happy!

It appears that adding expressive color to TTS to make it capable of handling mundane, everyday communications constitutes a far more difficult challenge than the one posed by creating emotional TTS.  Creating emotional colorations in synthetic speech depends on the manipulation of many acoustic parameters. 

No commercial TTS engine currently produces the natural long-term voice qualities to synthesize emphasis, attitude, or differing emotions adequately.  Rule-driven synthesizers (such as those from Fonix/DECtalk or Sensory) potentially offer more versatility for synthesizing emotional speech because parameters that control vocal quality and articulation precision can be adjusted easily in those systems. Unfortunately, the shortcomings of parametric models persist which means that developments using parametric synthesizers are destined to remain somewhat crude, since the voice source is entirely artificial. That is, no human has recorded any of the segments or intonation tunes. Furthermore, attempts to create emotions by manipulating inter alia speaking rate, intonation tones, boundary placement and frequency, and other suprasegmental parameters are insufficient to model emotions.  They are more likely to produce confusion in listeners and ultimately detract from the naturalness of the interaction.

Henton and Edelman5 developed a means for authoring text in several visual dimensions that allows the dialogue author to mark up the words for long-term or isolated affect.  Vocal emotions were added to concatenative synthetic speech using a limited number of prosodic parameters, namely average speaking pitch, pitch range, pitch movements, speech rate, segment duration and volume, and silence.  Using these seven parameters, some commonly agreed vocal emotions that they defined included angry (threatening) and angry (frustration), happy, emphatic, bored, and sad. The size and boldness of the text characters could be altered to indicate stress, or emphasis. Colors were also associated with the text to indicate the 'mood' of the utterance, so that "My goldfish died yesterday" can be authored in blue to indicate sadness; and "That goldfish was absolutely delicious" could appear in a 'happy' yellow. 

Using this type of approach, it is possible to partially answer the question posed above.  There are no exact acoustic or prosodic correlates, but Henton and Edelman suggest the following correlates for the attitudes exemplified by shorthand epithets, without TTS-engine or voice-specific values:

                                    Pitch mean/range      Volume           Speaking rate (w.p.m.)

Bright and sunny                Neutral/wide               Neutral                        200

Drab                                 Neutral/narrow            Low                           195

Patient                              Neutral                       Neutral                       170

Snappy                             High/wide                   High                           220

Ellen Eide and colleagues6 have been working recently on creating "flexible" speaking styles using the IBM TTS engine. These styles are: neutral declarative, conveying good news, conveying bad news, asking a question, and showing contrastive emphasis. Linguistically, it is debatable whether "declarative" and "interrogative" are speaking 'styles' or should be classified simply as grammatical 'moods.' The IBM TTS system can also generate paralinguistic events such as sighs, breaths, and filled pauses which further enrich the perception of expressive affect and naturalness.

Where Do We Go From Here?

We have suggested some approaches to address this challenge, but the road is bound to be long and bumpy. As Loquendo's Silvia explains, developers are traveling it in stages:

First, we are releasing something that is immediately useful: the possibility of enriching synthetic messages with expressive phrases and sounds, which may convey expressive intentions and spread their emotional color all over the message. Then we hope to release a more general means of assigning a given expressive intention to any text, independently of its content.

It will be interesting to see exactly what paths are taken and which vehicles are used to traverse them.

1 Cavalluzzi, Addolorata, Valeria Carofiglio, and Fiorellu de Rosis. 2004. Affective Advice Giving Dialogs. Tutorial and Research Workshop on "Affective Dialog Systems," June, 2004.

2 Wong, Nicole C. They speak, thereby they brand.  San Jose Mercury News, March 21, 2005. p.E1.

3. ibid.

4 Aaron, Andy, Ellen Eide, and John Pitrelli. 2003. Making Computers Talk. Scientific American.com (originally published in Scientific American,  March 17, 2003)

5 Henton, Caroline and Bradley Edelman. 1996. Generating and manipulating emotional synthetic speech on a personal computer.  Multimedia Tools and Applications, 3(2):  105-125.

6 Eide, Ellen, Andy Aaron, Raimo Bakis, Raul Fernandez, Wael Hamza, Michael Picheny, John Pitrelli, Zhi Wei Shuang and Wei Zhang 2006. Text-to-Speech: Bridging the Flexibility Gap between Humans and Machines. Proceedings of SpeechTEK/AVIOS.

Judith Markowitz is president of J. Markowitz, Consultants and technology editor of Speech Technology Magazine. Caroline Henton is the founder and CTO, Talknowledgy.

Both authors are members of the Editorial Advisory Board of Speech Technology Magazine.

Page1 of 1
Learn more about the companies mentioned in this article in the Speech Technology Buyer's Guide:
Learn more about the companies mentioned in this article in the Vertical Markets Guide: