Show Some Emotion
Humans are emotional beings. We are highly skilled at crafting voice output - on the fly - that expresses a broad spectrum of emotions and intensities that range from minor irritation and mild amusement to tumultuous outbursts. Managing the expression of emotion in the speech applications that our customers build is also part of our work as speech-technology professionals. Today, we rely on trained actors (voice talent) to craft emotion-laden speech, but some of our customers, notably the entertainment industry, have been waiting for us to provide them with commercial-grade text-to-speech synthesis capable of conveying emotions (expressive TTS).
A Long Road
When I asked TTS researcher Quazza Silvia at Loquendo, why Loquendo made commercialization of expressive speech a priority, Silvia's first response was "because this is universally considered the new frontier in speech synthesis research." Creating TTS that can communicate emotion may be a new frontier, but the path to it is more than 130 years old. In his 1872 publication, "The Expression of Emotion in Man and Animals," Charles Darwin critiques the work of other emotion researchers and posits an evolutionary basis for emotion. Darwin links emotional speech to ancestral singing, "consequently, when the voice is used under any strong emotion, it tends to assume, through the principle of association, a musical character."1 Since then, other researchers have proposed theories of emotion that explain and categorize emotions and emotional expressions from a variety of perspectives.
According to most experts, systematic research on the bond between emotion and speech began with the work of Grant Fairbanks in the 1930s and 1940s.2 By the 1980s, computational linguistic and TTS system developers, including Roddy Cowie, Nick Campbell, Caroline Henton, Klaus Scherer, Marc Schröder, Kim Silverman, and Christof Traber had become involved. In 1988, Janet Cahn of MIT's Media Laboratory spoke on emotional computer voices at Speech Tech3 . Today, work on speech and emotion has become an active international effort involving researchers from the U.S., Canada, Europe, Asia/Pacific, and the Middle East.
Despite the years of research and development on speech and emotion, Loquendo's 2005 release is the first commercial expressive TTS product using concatenative synthesis4 . Even that system, according to Loquendo's Silvia, is only a first-generation tool: "something that is immediately useful: the possibility of enriching synthetic messages with expressive phrases and sounds, which may convey expressive intentions and spread their emotional color all over the message."
In fact, most TTS systems, says David Nahamoo, head of IBM's Human Language Technology Group, "lack the ability to be 'context-sensitive' like humans so they can't modulate the spoken output according to the context and the situation."
Why don't we already have a fully-functional, commercial, expressive system using concatenated TTS? "It's a complex and difficult task that involves many factors," responds Loquendo's lead researcher, Paolo Baggia. Marc Schröder (Research Center for Artificial Intelligence in Germany) agrees, and adds that "the complexity starts with the fuzziness of the emotion concept itself…Multiple approaches stress different interesting aspects, but they cannot easily be integrated and sometimes seem to contradict each other."5
Kim Silverman, Apple Computer's principal research scientist, lists other factors:
- The words that the speaker selects
- The words within the utterance that the speaker chooses to emphasize
- Their facial expressions
- The interaction between the intonation and grammar they use
- The range of pitches used within that intonation pattern
Figure 1: Intensification of I told you not to do that using expanded pitch range
For example, increasing the range of pitches within intensifies the emotion that is being expressed (see Figure 1). According to Silverman, "Higher overall pitch range means you are more aroused; you are more involved in what you are saying. You are more concerned about it and it produces more of a physiological response within you." This arousal is communicated to the listener who, hopefully, perceives the same emotion the speaker intends to express. Since this isn't always the case for human-human communication, intensification of expressive TTS needs careful crafting.
Silverman and other researchers add voice quality to the list as an indicator of whether a speaker is positively or negatively disposed towards the content of an utterance.
IBM's Nahamoo points out that in order to have an effective concatenative TTS system, you need to have all necessary elements fully represented in the database. Unfortunately, according to Nahamoo,
…creating a reasonably small database that captures all of these varieties of emotions is a challenge. The challenge is that we might not have a complete set of segments for an emotion. If you don't find an appropriate segment, how do you create the utterance? For example, can we take something from a happy mode and transform it into a sad mode? This kind of morphing from one kind of expressiveness to another is extremely difficult. It's probably the most difficult part of the research that we have done.
Silverman bemoans the fact that "most people who want to explore how they can add emotion to speech synthesis believe that there are tunes associated with different emotions." They look for the happy tune, the polite tune, the angry tune, the sarcastic tune, the irony tune, and so on. This is a relic from the original Darwinian theory that, according to Silverman, is woefully wrongheaded.
The assumption that there are tunes associated with emotions would seem to make sense intuitively, but as you dig deeper it turns out to not be the case. The same tune can have different emotional overtones associated with different utterances containing different words. So, what are thought to be angry tunes are usually angry word sequences where the tune is aligned in such a way to emphasize words that are associated with conveying to the listener what the speaker is angry about. "I TOLD you not to do that!" - Silverman
Figure 2: The impact of different words on the same tune
Figure 2 illustrates Silverman's point. It shows how the same tune can express different emotions based on the words in the utterance.
Silvia agrees that "you can realize different emotions with the same prosodic pattern, provided that the result is plausible and coherent with the expressive intention," but doesn't see that as a problem for expressive TTS. First of all, "you don't have to implement all the alternatives. A single realization of each emotion is all that is necessary." Incorporation of a single representation for a few emotions is an advance over existing technology and would be useful for video and computer games. The more complex systems of the future will require more. By then, expressive TTS should have advanced beyond the single-representation stage and may even contain non-linguistic expressions of emotion, such as laughter and tears.
The Sounds of Gladness
The vocal expression of emotions is not limited to words and melodies. It includes sobs, whimpers, screams, growls, and a spectrum of laughter types from demure titters to prolonged, snorting belly laughs. Although we sometimes produce inappropriate laughter and crocodile tears, for the most part, we humans are adept at incorporating the entire non-linguistic repertoire into our dialogues. A fully-stocked expressive TTS system should be able to perform as well.
Figure 3: Stand alone and embedded laughter
Incorporation of these emotion vocalizations is remarkably complex. The variability of non-linguistic expressions matches that of the expressive speech. In addition, the expressive TTS system needs to be able to generate them both as standalone vocalizations and as part of the linguistic output (see Figure 3).
Furthermore, facial expressions, which would normally not be a factor in unimodal auditory systems, make themselves "heard" when emotions are involved. This includes the blubbered speech of a sobbing speaker and smiled speech.
Lip spreading, characterized by horizontal labial expansion, and sometimes combined with vertical constriction can be distinguished from the neutral voice quality and will be perceived as smiling - the facial expression which accompanies happiness.6
Even when these acoustic obstacles are overcome, developers must also address functional and contextual issues. Emotion researchers Shiva Sundaram and Shrikanth Narayanan note that "The same speaker may laugh differently while reacting to different situations…Hence, there are both speaker-to-speaker variations and variations within speaker."7 These variations can even extend to sobbing with joy and laughing in despair.
Even if the synthesis isn't going to cover the outer reaches of emotional outbursts, researchers like J rgen Trouvain and Marc Schröder know that they still face the basic problem of determining "when to add laughter in synthetic speech (everybody knows examples of laughter in inappropriate situations)" and they caution that while "laughter can be added to synthetic speech so that listeners have the feeling of higher social bonding… inappropriate type or intensity of the laugh can destroy the desired effect in this socially sensitive area."
Automating emotional, expressive speech is a difficult undertaking, but according to all the developers with whom I spoke, this research is moving forward quickly. We already have Loquendo's 2005 commercial system, which is admittedly a first-generation offering and Nuance recently introduced its Sculptured Speech toolkit which can be used to make TTS more expressive. We can expect to see second and third generations of increasingly powerful expressive TTS in the near future.
Eventually, developers will produce fully-expressive automated systems. When such systems will become commercially viable is still hard to tell.
Darwin, Charles. 1899. The Expression of Emotion in Man and Animals. Project Gutenberg EText p38. (electronic reprint of the 1899 re-release of Darwin's 1872 publication by D. Appleton and Co. of New York/London.) NOTE: In 1998, Darwin's 1899 version was re-released with afterward and commentaries by emotion-researcher, Paul Ekman. 2
Fairbanks, Grant, and W. Pronovost. 1938. Vocal pitch during simulated emotion. Science 88 (2286). Pp. 382-3.
Fairbanks, Grant, and W. Pronovost. 1939. An experimental study of the pitch characteristics of the voice during the expression of emotion. Speech Monographs 6. Pp. 87-104.3
Cahn, Janet E. 1988. From sad to glad: emotional computer voices. Proceedings of Speech Tech 1988. Pp. 35-36. New York. 4
Some expressive parametric TTS systems and tools appeared in the 1990s, notably Apple Computer's MacinTalk which included tools to generate emotional TTS. Developers included Kim Silverman and Caroline Henton. Apple also released a labeling system called TOBI. 5
Schroeder, Marc. 2004. Speech and Emotion Research: An Overview of Research Frameworks and a Dimensional Approach to Emotional Speech Synthesis. PhD Thesis, Saarland University, Saarbrucken. P 1.6
Robson, Julie and Janet MackenzieBeck. 1999. Hearing Smiles - Perceptual, Acoustic And Production Aspects Of Labial Spreading. Proceedings of the 14th International Conference of Phonetic Sciences, San Francisco, USA, P. 222.7
Sundaram, Shiva and Shrikanth Narayanan. 2004. Synthesis of Human-Like Laughter: An Initial Evaluation. Paper presented at the 148th meeting of the Acoustical Society of America. November 15, 2004. P5.
is the technology editor of Speech Technology Magazine
and is a leading independent analyst in the speech technology and voice biometric fields. She can be reached at (773) 769-9243 or firstname.lastname@example.org