Designing Speech Recognition for the Next Generation

The world of speech recognition technology is vastly different for me and my contemporaries (I am 22) than it is for older generations. Having grown up with cell phones, PCs, and mp3 players, we constantly interface with the world via technology. Most of us have computers. You'd be hard pressed to find anyone who doesn't have both email and instant messenger (IM), and you'd search for days to find a twenty-something without a cell phone. At any given moment, most of us are using one, if not all, of the technological means of communication available to us: multitasking is in our blood. Add the Internet, with popular sites like Facebook and MySpace, and we are 100 percent wired. We are both the current and future markets for all things digital, and appealing to our wants and needs will spell success for those who get it right. Just take a look at how a younger generation communicates - we use written text, our voices, and even symbols called emoticons: for instance, this :-) which looks like a smiley face oriented sideways.

My generation is at ease with technology in a way that many of those older than us are not. We like the new, we're restless; MTV has spurred a culture with short attention spans. Technology is an organic part of our world -- an extension of our bodies, and our interaction with speech recognition is no different. Because of this, we're open to integrating speech technology into our everyday lives, but successful speech applications must be designed to our unique perspective and needs.

The growth in the technology of speech recognition has taken place almost entirely in the last 25 years. In a sense, my generation and this technology have grown up together; our relationship began even before IBM ran Aptiva ads which featured a user telling the computer to speak. We are old enough to remember the Knight Rider TV series of the 1980s, which, while considered to be futuristic at the time, is now an option for today's car navigation systems.

Older generations are not only less familiar with speech technology, but can be reluctant to embrace it because of lingering fear and distrust of the security of technology in general. My parents are, I think, typical of many of their contemporaries: they use computers every day in their respective professions, but for them interfacing with technology (however proficient they are) is still akin to making themselves understood in an acquired language (one in which I and my friends are native speakers). They see technology primarily as a tool, a means to an end, where as we experience it as a tool, as well as entertainment and part of our everyday environment.

Speech technology appeals to our generation precisely because of its newness and the cool factor associated with it. While careers have been made trying to find what represents cool, being cool shouldn't be underestimated. Cool technologies mean engaging technologies. Products that are cool and intuitive form an intellectual connection with the user because of the power the technology affords, as well as an emotional connection because of the way the interaction occurs. To be cool is to make my generation want it, spend money on it, and form a positive relationship with its brand.

We want interactions with all of our technologies to appear normal and unfold hassle-free. Anyone in my generation will hang up on a speech recognition system if he or she feels like he is talking to a poorly designed system, or if the system doesn't understand them right away. In our wired, on-the-go world of cell phones and IM, we can't be wasting time trying to figure out how to get a system to work. We want to be quickly understood, as speed is of the essence in our lives. A well-crafted prompt and one-step correction technology are small things that nevertheless have a profound impact on how willing we are to engage with these phone systems.

However, ease of use is only one feature that makes a system appeal to us. The ever-intriguing cool-factor, embodied in a system with a unique and stylish interface, makes us much more likely to persist in interfacing with a system deficient in usability. Our familiarity with technology makes us more impatient with routine glitches than people who aren't as tech-savvy. In contrast, a sexy system that is, for example, authenticating our identity by voice or changing the song in our car, can engage our emotions enough to overcome our impatience until we figure out how to get the system to work.

The power of the combination of ease of use with style is undeniable. Don't believe me? Look at the success of the Motorola RAZR and Apple products. The 2004 release of the RAZR created a shopping frenzy among my contemporaries, most of whom spend more time on their cell phones than they do talking to their friends in person. The phone's sleek, compact design and styling made it a must-have item. The RAZR is still extremely popular among my generation, although its operating system is sluggish and for many carriers the operating system is also poorly designed. It's the industrial design that makes this item special and hard for other companies to emulate.

And while the RAZR doesn't have much of a unique user interface (it is a cell phone, after all), Apple products are a prime example of harmony between ease of use and style. If you visit any college campus, you will be hard pressed to find more than five people who have mp3 players that are not the stylish and easy-to-use iPods (32 million were sold in 2005 alone). Integrate voice recognition technology into iPods and not only would the object be even more desirable (if that's possible), its sudden ubiquity would also boost people's awareness of speech recognition as a whole. Similar obvious harmonies between style and functionality must be achieved in other applications of speech to guarantee high user satisfaction.

The products that embody this synergy of style and functionality today are primarily ones with a graphical user interface instead of a voice user interface (VUI). To successfully reach the youth market, this attention to detail in style and usability needs to be first transferred to technologies that already have a large youth base. An obvious choice is cell phones. Despite all the speech technologies available today, there is minimal speech recognition software installed on the cell phones that my generation uses, and while applications are by no means unheard of, they are hidden and ineffective.

While voicemail applications have been in existence for some time, they are badly designed and clumsy to use, off-putting to a group of users who crave speed, efficiency, and instant gratification in their tech lives. When I asked my friends what they thought about their voicemail, every person responded that it was boring and that tasks took too long. Think of how tiring it is to hear: "You have one new message. Sent today, April 17^th. First message…" when we know we just missed the call a minute ago, and know what day it is, and know that we only have one message because that datum is displayed on the phone before we call in. (Someone out there, please shorten that prompt!)

Voice dialing is another aspect of cell phones that needs to be improved on and implemented in order to extend speech technology to my generation. Almost all phones sold have a voice-dial feature, but it is not advertised to the buyer and when we try using it, it is unreliable and poorly integrated with our contact list. A tech junkie myself, I immediately opened the speech function on my phone when I got it. Despite multiple attempts, the application got my command right zero percent of the time. Friends of mine whom I asked about the same function either had never tried it, or tried it once and disliked it so much they said they'd never use it.

While my contemporaries do spend a great deal of time on their cell phones, they use them primarily to call their friends, not interface with large call centers or companies. In order for speech technology to have a profound impact on our generation, it must be integrated into systems that we use on a daily basis. Sean Brown, a senior at Tufts University and a staffer at Nuance, agrees, "I think we need to continue to embed this technology in the items and tools we use on a day-to-day basis. For example, expand it to be used with cable television, video games, cars, and even home appliances."

While speech technology is already implemented in many of today's cars, the technology and VUI are mediocre. My generation has grown up with MTV shows such as "Pimp My Ride" and "Cribs" where cars are central to each show's focus. Yet I have not seen a single episode of "Pimp My Ride" (a series dedicated to tricking out cars in every conceivable way) where anything voice-activated has been installed or referenced, nor have I seen any celebrity on "Cribs" (similarly, a series devoted to over-the-top luxury homes and lifestyles) point out a speech recognition feature of their car or home. If a speech application for a car - for example, a system that would read email on command or give directions when prompted - were done in a way that embodied ease of use, powerful functionality, and iPod-like style, the impact would be colossal.

I and my contemporaries want style and ease of use in our dwelling places, our cars, our communications - every interaction we have. We are comfortable with, rather than threatened by, machines that work for us. We look to technology to make our lives both easier and more interesting, and we're chronically impatient. My television guide, my alarm clock, my house security system and home stereo system are all essential parts of my life that could be enhanced dramatically by speech technology. Even my gym workout could be tracked and enhanced by a virtual personal trainer.

Speech technology is going in the right direction, but the real potential has yet to be tapped by today's technologists. You can't pay enough attention to the need for amazing technology to be combined with designs that make us crave it, the way some older generations might have craved the 1956 Cadillac, the Eames lounge chair, the Walkman with its orange headphones, or their first compact disk player.

Making it work is not enough. You need to find out how to make us connect with speech technologies on an emotional level. Once this is accomplished, the growth potential for speech recognition as a whole is endless.

David Donatelli is a recent computer science graduate from Tufts University who has made a number of speech recognition applications in VoiceXML, including a virtual fitness trainer.

Designing Speech Recognition for the Next Generation

DentScribe Launches DentScribe Perio Charting 3.0

Krisp Launches Voice Translation v3

Treble Technologies and Hugging Face Benchmark ASR Models

Why Better Client Tracking Starts With Better Capture of Spoken Clinical Interactions