Let's Get Creative!

Speech Applications Can Do More Than Impersonate Agents
Speech application developers have long understood the importance of good voice user-interface design. The telephone is the most intimate form of communication most consumer product and service companies have with their customers. Unlike advertising, or even the Web, it's the contact medium where customers are the most engaged and prone to form impressions, good or bad, of a company. Given the public's inexperience with speech technology, their previous encounters with frustrating DTMF (touch tone) applications, and the technology's inherent limitations and subtleties, the voice user-interface clearly deserves substantial attention and effort.

The prevailing concept of voice user-interface design today is to replicate, to the greatest extent possible, the experience of speaking with a friendly, helpful human agent. While the intent is fine, it misses a much greater potential made possible by new speech technologies: to transform the interactive voice response (IVR) system into a powerful new customer communications and branding medium.

Think about the early days of radio, when commercials simply consisted of an announcer reading some copy about the advertiser's product. Today, there's hardly a radio commercial that doesn't feature creative audio production - a musical soundtrack, a voice actor's oral performance and sound effects - to build a stronger and richer impression than is possible with spoken words alone. Or consider computer games. Try playing one with the sound turned off. Though barely noticeable if you don't think about it, the music and action sounds (running, doors opening, guns firing) play a huge role in constructing the player's mental model of the game and enhancing the overall experience.

Now think about the telephone. While live agents can provide excellent service, the customer's experience is essentially the same for every company they call. The only distinguishing characteristics are how friendly the agents are (which may vary from agent to agent and day to day), and the scripts used for different tasks. Speech applications designed to mimic agents are similarly one-dimensional, even those where great effort has been made to create a pleasing "persona." The goal of the "persona" is essentially to impersonate an appealing agent.

On the other hand, speech applications with creative audio production offer countless possibilities for forming rich, unique caller experiences. Like radio commercials, these applications feature musical themes, voice actors' performances and perhaps sound effects and audio "logos," that together create vivid caller experiences. But unlike commercials, they engage in intimate, one-on-one contact with customers, day after day.

Think of it as audio informercials you can talk to. Beyond enabling callers to accomplish their transactions, they can entertain, reinforce the company's brand, and generally have a qualitatively different effect than just speaking with a real or virtual agent. Moreover, they can include non-verbal audio cues to help callers navigate the user-interface. The icing on the cake is that the price tag for all this should be small in relation to the overall costs of a speech application development project.

Entertainment, Marketing and Branding

Why hasn't the creative audio concept been applied before to DTMF-based IVR applications? Maybe because the mechanical experience of pressing keys and navigating DTMF menus can't be improved much by creative audio production. If so, speech technology elicits more natural interactions, and so provides a good foundation for this approach. And the greater complexity of speech applications calls for more sophisticated user-interfaces that can benefit from non-verbal audio cues.

Whatever the reason, it's incredible that many companies who spend millions of dollars building their brands through advertising and promotion apparently give very little creative thought to the impressions they make when actual customers call them.

Actually, a primitive form of creative audio has been around in telephone systems for a long time. "Music-on-hold" is often played while callers wait in queue for an agent. Besides its entertainment value, music-on-hold serves a user-interface function: it lets callers know they're still connected and their calls will be answered eventually. It can also support branding. At one time, Delta Airlines' reservation system featured a south-seas-sounding music-on-hold that suggested exotic destinations. The same music was played on their aircraft during boarding and deplaning, effectively acting as a musical theme for the company.

A somewhat different example is AOL's Moviefone, available in a few major U.S. cities at 777-FILM. It features a frenetic voice actor who offers information on local movies, theaters and show times. It's fairly effective at building excitement for the different shows, despite that fact that it's a DTMF application and the actor's constant over-the-top delivery has no particular correlation to the different kinds of films he's describing.

Verizon has also taken a small step with creative audio. A few years ago, they ran a series of television commercials featuring James Earl Jones. At that time, callers to a Verizon operator or directory assistance would hear his very distinctive voice say "Verizon" when their call was answered. That one-second audio hook instantly identified the Verizon brand and tied the services to the advertising campaign. Today, Mr. Jones' voice still reinforces the Verizon brand. In the Boston area, callers to directory assistance hear him say, "Welcome to Verizon nationwide 4-1-1. (musical audio logo.) Make progress every day."

More deliberate and extensive use of creative audio can more effectively engage and entertain callers. They can interact with virtual characters and environments to produce experiences that reinforce branding messages and extend those from other media. If done well, customers will be left with very strong, positive impressions.

One last point is that it's important not to overdo it. The overall effect should usually be subtle - enhancing, rather than overpowering, the application content. People won't want to feel like they've called just to be bombarded with an advertisement.

Navigation, Grounding and Audio Punctuation

Creative audio production also helps build callers' mental models of applications through non-verbal cues. The "earcon" - a distinctive sound played at a certain point in an application - is a well-known form of audio punctuation. It helps "ground" callers by letting them know where they are; for example, a chime when entering the "main menu." However, music can be an alternative and a more subtle way to achieve the same end. Distinct music or other audio themes can subliminally reveal to the caller when she's entered, or returned to, each functional area and give clues about what's coming next. It also helps set the desired emotional tone and creates the perception of ease-of-use.

Music can blend with selected prompts as earcon-like introductions to application functions or punctuate certain points in a dialog. It might come at the end of a bit of dialog as a transition to the next one or to convey a "waiting" quality that invites the caller to speak.

Sound effects, like the crack of the bat and cheers before playing baseball scores, can identify and emphasize content; the possibilities for adding, sequencing and layering different audio components are limited only by the creativity of the people doing the production.

Many of these creative audio concepts are employed in Tellme's voice portal, a free service offering news, sports, stock quotes, movie listings, etc. As part of a simple directed-dialog voice user-interface, distinct voices, music and audio effects are used to distinguish among the "main menu" and the assorted information categories. Normal spoken prompts and prompts containing background music are punctuated with musical and audio effects to make calls flow easily while entertaining and leaving little doubt in the callers' minds about where they are and what to say next.

Calls begin with a musical "Tellme" logo (no ring tone is heard - a nice touch). A welcome prompt in a cheerful female voice is followed by a five-second advertisement. Then comes a two-note "main-menu" earcon and the same voice says, "Main menu - here are all the categories you can choose from." She then lists the categories over an upbeat musical accompaniment. When selected, each category has its own musical theme: for example, "Entertainment" is spoken along with an orchestral flourish and "Horoscopes' with an other-worldly synthesized trill. Each has its own voice talent together with appropriate pacing and sound effects - a manic "Mark 'Just-the-Highlights-Please' Vandretti" for sports; an intimate, this-is-just-between-you-and-me new-age female voice for horoscopes. There's even a convincing Sean Connery impersonator as the tongue-in-cheek dealer in a telephone version of blackjack. All together, the experience is far more amusing and compelling than a conventional DTMF or speech application.

Design and Development

The range of design processes and skills required for creative audio production are broader than those needed for conventional speech applications. But assuming that people with the right skills can be found, the incremental costs relative to conventional application development projects should be small. The extra effort will center mostly on user-interface design, prompt recording and audio post-production.

User-interface designs will ideally incorporate input from marketers and creative talent, as well as speech user-interface designers who have a feel for the effects of non-verbal content on navigation and usability. Voice actors will need direction in prompt recording sessions to get the desired performances.

Audio post-production will be required to edit and mix the various audio components into the final prompts. For example, the pacing of prompt wording should match the rhythm of background music. This can be accomplished by recording the wording in the usual way and then breaking it into phrases. Each phrase can be mixed with the music so it starts, for example, on a downbeat. Clearly, some amount of musical editing ability is necessary.

Usability testing should be expanded to include perception testing: gauging the effectiveness of the application at creating the desired caller experience and customers' reactions to it - not just, "Was it easy to use?" but also, "What impression did it give of our company?" Similarly, post-deployment tuning might be extended to gather feedback from marketing intelligence sources and refine the branding messages.

Application testing should also include evaluations under conditions of poor audio quality. For instance, distortion typical of bad cell-phone connections will adversely affect music and sound effects. However, these factors can be minimized so that functionality is not compromised. Low-volume background music, for instance, will often just disappear into the background noise, while the prompt words remain clear. With a bit of trial and error, the audio can be refined so the application still performs reasonably well under these circumstances.

Only time will tell how these new possibilities for enriching the media mix of speech applications will evolve and be accepted by companies and customers. But it's a good bet that in a few years, many applications will sound and feel a lot different than those of today.

Mark Levinson is the president of VoxMedia Consulting. He can be reached at 781-259-0404 or mark@voxmediaconsulting.com.

Let's Get Creative!

Deepgram Launches Flux Conversational Speech Recognition Model

AI Voices Indistinguishable from Human Ones, Study Finds

Salesforce Launches Agentforce Voice

SyncWords Launches Vocalics for Real-Time Dubbing