Virtual Assistants & Mobile Phones: How Speech Makes the Merger

Within the next decade, true interactive speech is expected to be pervasive and in everything.  Digital cameras, air conditioners, watches, televisions, PCs, printers, mobile phones, cash registers, kiosks, automobiles, and vending machines will all have voices to announce their status and function.  Not only will they accept spoken commands, they will hold conversations with us.  Your TIVO will discuss its programming, your car will tell you where to turn, and your mobile phone will remind you to pick up milk as you are about to drive past the grocery store.  The once despised phrase, "When you hear the option you want, press 1" will be replaced with "How may I help you?"  By pressing a button on your mobile phone, using push-to-talk, you will instantly speak with your VCR virtual assistant and ask it how to set the silly clock.

The central figure in the speech adoption process is an unheralded actor, the humble mobile phone.  The mobile phone will provide the gateway for speech communication with practically everything. What is noteworthy is that it's not dictating text to your PC that will persuade the market to fund speech recognition technology development, but rather a ground-swell of technical developments associated with mobility and pervasive computing that will provide both the market need and the funding.  It's the proliferation of small electronic devices, especially wireless devices that will nurture and guide the flowering of speech recognition.  It's the intelligence locked in tiny devices, difficult to access with tiny buttons.  It's the mobility that precludes typing.  It's the networking that connects dumb mobile devices to smarter stationary ones.  It's the technology for interactive conversations, as opposed to simple command recognition that will soon be feasible.  It's the ubiquity of mobile phones that will provide the speech transducers to access all these small devices.  Most of the devices mentioned in the first paragraph will not have speech recognition embedded in them.  They will be accessed through the mobile phone.    

Economic and technical progress has come in three major waves in the last 30 years, each wave building on the preceding and seemingly larger than the last.  First integrated circuits (ICs) swept away the vacuum tube industry and started creating new genres of electronic systems, one of which was the PC.  The second wave was the PC revolution; then the PC enabled the third wave - the Internet revolution; and now we have the fourth wave - the mobile phone revolution.  Mobile phones depend on the infrastructure of ICs, PCs and Internet combined.  We surmise that the fourth revolution will be built on the platform provided by mobile phones.  Could interactive mobile phone conversations with virtual assistants be the fifth "wave"?  We think so.

The basic service offered by mobile phones is sufficient to motivate most people to carry mobile phones with them most of the time.  From this seed, all else grows.

The mobile phone platform is going to be physically worn on our bodies most of the time, the myriad of other things we pack around (keys, wallets, audio recorders, cameras, etc.) will become electronically integrated into the mobile phone; and the mobile phone will become a personal "sidekick," of sorts.  Mobile phones show Muslims which way to face Mecca during their daily prayers, they are replacing handheld gaming devices, replacing slips of paper when "boy meets girl" for the first time, showing directions when you're lost, paying for purchases of soft drinks and snacks, functioning as keys to the front door of your house, etc.

Mobile phones will augment the infrared controllers that switch your TV channel, turn on your air-conditioning, or dim the lights, and phones will have radio frequency identification (RF ID) tag sensors so we can find "stuff" and keep track of it.

RF ID tags will be embedded in items you normally carry (purse, wallet, watch, keys), and if you become separated from them, your mobile phone RF ID tag sensor will detect that fact and alert you to it.  For example, as you are about to leave your home, a gentle buzz and SMS message on your phone would say, "Wallet left behind!" 

In my opinion, there is one particular application that stands out for its profound potential to affect the way we lead our lives.  It's a new method of accumulating and utilizing memories gathered throughout a lifetime: a tertiary memory.  I define tertiary memory to be extra-somatic, associative memory linking together data associated in time, location or content.  It can be created with mobile phones. 

Consider that properly equipped mobile phones can capture pictures, spoken notes, locations, and times of transaction.   This information can be forwarded to a personal media archive.  This media archive can become a tertiary memory system.  It could accomplish feats of recall that previously would have been impossible. 

For example, assuming you had the GPS feature on your phone activated on the date of your last anniversary, you would be able to say, "Where were we on our last anniversary?" Or if you snapped any pictures on your anniversary, you can say, "Do we have any pictures from our last anniversary?" Or, if you happened to record an audio message concerning the quality of the food at the restaurant where you ask the question, you could say "What dishes did we most enjoy at this restaurant the last time we ate here?"  In each of these cases, your request will be recognized and dispatched by your personal voice-activated virtual assistant that knows where on the Internet your pictures are stored, when they were stored, where the pictures were taken, and if there are audio notes attached.  

Speech recognition as the user interface will motivate the development of mobile phones with personal virtual assistants.

Enter the Virtual Assistant

Virtual assistants are synthetic "beings" that understand your speech, have pleasant voices and can carry on conversations with you, but are completely artificial, the creations of clever computer programmers and speech scientists.  They are more likely to be found in the Internet & telephone cloud rather than in mobile electronic devices.  Within the next decade, virtual assistants will start showing up everywhere, and they will be available to advise you 24/7.  Chances are you will have your own personal virtual assistant that will take on your personality.  Indeed, virtual assistants will take on the personalities of users to make us feel more comfortable communicating with them. (See the writings of Clifford Nass, Stanford University: The Media Equation: How People Treat Computers, Televisions, and New Media Like Real People and Places, New York: Cambridge University Press; and Voice Activated: The Psychology and Design of Interfaces that Talk and Listen, soon to be published by MIT Press.)

The reason that virtual assistants are critical to speech in the vision described above has to do with the fundamentally ambiguous nature of the acoustic information in speech and the need to ask for clarification.  When a request is spoken and misrecognized, an intuitively clear set of counter-requests must take place to remove ambiguity. Virtual assistants are defined to be those speech recognition systems capable of such dialogs. 

How we get from where we are - speech recognition is seldom used by the average mobile phone owner - to the point where speech recognition is indispensable and available in every mobile phone?  There are two possibilities: embedded and distributed.

Possibility 1: Embedded - speech recognition will be built into the phones themselves.  This thrust is being pursued by the major players in the speech industry with their "embedded" speech recognition technology.

Possibility 2: Distributed - the larger and more powerful solution is remote, network-based speech recognition, and the world of the virtual assistant.  Offloading processing from the mobile phone to remote devices with both computing power and relevant content provides accurate recognition results and up-to-date information.

We expect to see combinations, hybridized solutions of embedded and distributed solutions.  The mobile phone would be used to call a remote virtual assistant, the distributed solution having recognized the name using built-in, embedded speech-recognition technology.  On the other hand, the distributed solutions will represent a larger market, and in the overall scheme of things, be more important to delivering the vision of ubiquitous speech-recognition.

Building Virtual Assistants

Building virtual assistants capable of rich conversation is currently cumbersome with today's state-of-the-art technology.  However, we already can see how to do it with tools such as VoiceXML, SALT, and word-spotting techniques used in AT&T's "How May I Help You?" (HMIHY) system. 

The key to creating successful virtual assistants is to limit the conversation to specific topics, use word spotting or statistical grammars to recognize the gist of what the caller asks, and to respond with appropriate content.  The responses may be pre-recorded with appropriate emotions and inflections. Or the response may be synthesized, using TTS (text-to-speech) as required to read textual material real-time. 

VoiceXML (http://www.voicexml.org/) and SALT (http://www.saltforum.org/ ) are leading the attempts to create scripting environments and deployment environments that accommodate these needs.

Personality:  Experiments by professors Byron Reeves and Clifford Nass of the Stanford University communications department show that personality in virtual assistants strongly affects their effectiveness.  Reeves and Nass showed that people enjoy interacting with virtual beings with personalities like their own; people are more inclined to purchase products described by virtual agents with personalities like their own; and people are more trusting and forgiving of errors made by virtual assistants with personalities that match their own. 

Enter the world of immersive intelligence: Soon we will see developers springing up around the world programming spoken language virtual assistants for well-defined applications, usually narrowly defined, matching their expertise.  These applications will be immediately available worldwide through the capabilities of VoIP.

In aggregate these independent developers will fill in pieces of the larger natural language understanding puzzle. Collectively they will give the gift of speech to the Internet/telephony cloud.  No one individual, nor any single business entity, will orchestrate the technical development, but in the end, the entire system will present a unified face to the telephone users of the world, a face that is able to understand speech on a wide range of topics, a face that will show up on a plethora of digital devices, especially mobile phones. The dream of automatic speech recognition, in the true sense of the word, will become reality. 

In summary, interactive spoken language dialogs between people and the digital devices that fill their lives will become commonplace over the next decade or two.  However, people will not usually speak directly to devices but rather use their mobile phones as intermediaries.  Mobile phones will, in essence, become the electronic ears and tongues for many consumer electronic devices, especially those connected to the Internet.  Mobile phones will access remote virtual assistants which will have pleasing personalities personalized to suit individual taste; and these virtual assistants will control local digital devices and answer questions about them.

George M. White is principal scientist at I2R, the Institute for Infocomm Research, a Singapore government funded research center.  White started his career with post-doctoral studies of computer science at Stanford University. White has approximately 50 technical refereed publications and a number of publications in magazines, journals and conference proceedings.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues