Speech Technology Magazine


Avatars Meet the Challenge

A SpeechTEK Europe session showed the appeal of the technology.
By Caroline Leathem , Nava Shaked , Susan L. Hura - Posted Nov 1, 2010
Page1 of 1
Bookmark and Share

An avatar is a computer representation of a person using animated graphics and a real or synthesized voice. The term avatar originates from the Sanskrit word avatara, which is a concept similar to that of incarnation. The appeal of animated avatars is obvious: Avatars put a face to the voice of automated service and allow organizations to further personify their self-service offerings. 

Avatars are increasingly being used to communicate with users on a variety of electronic devices, such as computers, mobile phones, PDAs, kiosks, and game consoles. Avatars can be found across many domains, such as customer service and technical support, as well as in entertainment. Some of the many uses of avatars include the following:

  • reading news and other information to users;
  • guiding users through Web sites by providing instructions and advice;
  • presenting personalized messages on social Web sites;
  • catching users’ attention in advertisements and announcements;
  • acting as digital assistants and automated agents for self-service, contact centers, and help desks;
  • representing character roles in games;
  • training users to perform complex tasks; and
  • providing new branding opportunities for organizations.

SpeechTEK Europe 2010, held in late May in London, offered an Avatar Challenge that presented a collection of newly developed avatars designed for a range of business applications. The purpose of the challenge was to expose the SpeechTEK audience to the range of possible uses of avatars and spark new ideas for their expansion and refinement. We also aimed to understand the current state of the art in avatars by reviewing a variety of avatar implementations and evaluating them for the overall experience they provide users. Because avatars are more clearly personifications of an organization than voice-only applications, issues such as usability and conforming with social conversational norms are even more important. 

In the contest we were asked to evaluate seven avatars of different kinds, genders, and languages, all used for customer services or human-machine interaction. We measured each with the following criteria:

• The avatar’s appearance is human-like. Since the avatar is a form of customer service personalization, its look and feel are very important. Human-like appearance is necessary to create a sense of familiarity and confidence, as well as natural relation, as opposed to a machine with a robotic interaction. People like various appearances as long as they are pleasing to the eye.

• The avatar’s movements are human-like and non-jerky. An avatar should not only look somewhat human, but it also should move in the same manner. Part of the design should make sure the avatar’s movements are not stiff and look as natural as a human figure.

• The voice is expressive. The voice is the avatar’s means of communication as well as part of its personalization. If the voice is synthesized, then it must be clear and free of metallic robotic sensation. Stress and intonation are also imperative for enabling a clear understanding of the content. 

• The avatar effectively uses facial and body gestures. We checked whether facial and body gestures were effective in contributing to the understanding of the dialogue. We considered how the content and dialogue structure fell in line with the gestures to determine whether the avatar yielded a reliable character.

• The voice is well-synchronized with lip and other facial movements. It is challenging yet essential to keep synchronization between the voice, lip movements, and facial gestures. In cases where the synchronization is poor, it is rather funny to look at the avatar, and it definitely leaves the user a bit hanging while trying to follow the dialogue.

A scale was created from 1 to 5 (with 1 being the worst and 5 being the best) to rate the avatars with respect to the above criteria. The judges discussed the characteristics of each avatar and selected avatars for awards.

After reviewing Avatar Challenge submissions, the judges agree that avatars can be an effective addition to an organization’s overall self-service offerings. We also acknowledge that getting avatars to look, sound, and feel right poses a formidable challenge. A number of factors played an important role in our subjective impressions of the avatars, including facial expressions, overall personality and likability, and synchronization of mouth movements with speech. Characteristics such as the quality of synthetic speech (if used) and how well the avatar served its intended purpose were also important, but the first group of factors weighed quite heavily on our reactions. The challenge for those of us in the speech community is that we are generally less familiar with, and possibly even less aware of, the first group of factors than the second. Many speech professionals could ensure synthetic speech is intelligible and appropriate, but many fewer individuals in our community have the background and skills needed to generate appropriate facial expressions or synchronize facial movements with synthetic speech. To create effective, appealing, engaging avatars, organizations must be cognizant of gaps in their expertise and enlist the help of specialists able to provide expert guidance. 

Much like transitioning from touch-tone to speech systems is more than just rewriting prompts, moving from speech to avatars is much more than just marrying an animated face with a synthetic voice. Understanding and mastering these complexities will challenge our industry for many years to come.

The avatars that were submitted for review were the following:

Purpose: “Simge” (“icon” in Turkish) provides daily financial information to mobile users.
Appearance: An animated cartoon headshot of a professional-looking female with short, brown hair. The background alters depending on the time of day and is either an office building in daylight or at night.
Language: Turkish
Format of submission: A prerecorded demo. 
Text is converted to speech by Loquendo TTS. The video is prepared by H-CARE software. 

Purpose: An Amego is an animated talking head that can be used to deliver spoken messages and integrate with a variety of social networking APIs. The submitted avatar was the most casual of the entries and aimed to provide a fun interface that encourages the user to engage with it, thus promoting the novelty of an avatar for use in online communications (e.g., bringing tweets and messages to life).
Appearance: An animated 3D representation created from a photograph of the user. In addition to the read-out of text, the user can choose from four moods that alter the facial expressions of the character.
Language: U.K. English 
Format of submission: A real-time demo that allowed the judges to enter text that was then read back. 
AmegoWorld uses a unique TTS engine designed to be portable to any electronic device. It is capable of a wide range of voice dynamics to cope with national, regional, and foreign accents and languages. 

ejTalk (Experts’ Prize Winner)
Purpose: “Cassandra” aims to make a conversational experience friendly and fun for the user while broadening her ability to behave naturally and with contextual appropriateness. Her main purpose does not appear to be a digital assistant, but instead to be one of the manifestations of the ejTalk conversation engine.
Appearance: Very different from the other avatars. She was the least human-like, with a bald, floating head and neck, and no eyebrows, yet with startling blue eyes in an otherwise pink-gray face. 
Language: U.S. English
Format of submission: A prerecorded demo of Cassandra answering questions about the Shakespeare play Hamlet.
ejTalk coordinates all of the aspects of the conversation, including speech recognition, synthesis, avatar display, and multimodal interactions (e.g., touch screens). 

H-care (People’s Choice Winner)
Purpose: “Myda” (My Digital Assistant) is a virtual personal assistant who can notify users about appointments, tasks, email, and social network updates.
Appearance: An animated cartoon headshot of a professional-looking female with short, brown hair. The background was a black-and-white view of a room containing office chairs; this contrasted well with the avatar, which was in color.
Language: U.K. English
Format of submission: A real-time demo that allowed the judges to enter text that was then read back. 
Myda is powered by H-care’s Face Engine and is created from more than 20,000 polygons with 3D rendering. The voice synthesis is powered by Loquendo.

Humanity Interactive (Experts’ Prize Winner)
Purpose: “Jerry” is a virtual assistant that can be used on a Web page to greet customers, provide support, or present content.
Appearance: A cartoon representation of the top-half of a casually professional young man with short, brown hair, glasses, and incredibly expressive eyebrows. The background appears to be a modern office seating area. 
Language: U.S. English with a laid-back style of speaking 
Format of submission: A real-time demo that allowed the judges to enter text that was then read back. 
Humanity Interactive provides its own patented natural language processing and animation technologies. 

Purpose: “Juan” is an interactive digital assistant for the Web site of the Spanish Institute of International Trade.
Appearance: Photorealistic headshot of a professional looking young man with short, brown hair in an office environment. 
Language: Spanish
Format of submission: A real-time demo that allowed the judges to enter text that was then read back. 
Umanify’s multimodal interface combines digital assistants that speak more than 20 languages and interact directly with users in real time using natural language.

VoxWeb Voice Solutions (Judges’ Special Award Winner)
Purpose: The VoxWeb avatar has been developed as a virtual assistant to support navigation of Web sites. 
Appearance: A 3D animated photograph of the top half of a casually dressed lady wearing a striking necklace. The background was just white. In addition to entering text, the user can make her either nod, shake her head, or wink; add vocal items, such as “ooh,” that have been preset with facial expression and intonation; and use preset phrases, such as “thank you so much.” 
Language: U.K. English
Format of submission: A real-time demo that allowed the judges to enter text that was then read back. 

For views of the seven deployed avatars, go to www.speechtek.com/europe2010/avatar.  

Susan Hura is principal and founder of SpeechUsability, a VUI design firm; Caroline Leathem is an interaction specialist in the Communications Team at Verizon Business’ operations in Europe, the Middle East, and Africa; and Nava Shaked is CEO of BBT, a professional practice focusing on voice applications.

Page1 of 1