VoiceXML 2.0: A Real Standard for a Real World

New technology will change the way people interact with computers. PCs enabled users to use a keyboard and screen rather than review printed reports. The Xerox Star and Apple Macintosh introduced Graphical User Interfaces (GUIs) which made the mouse and other pointing devices popular. Now, we are on the verge of a revolution in technology that makes computing portable. Separating user interface devices from the computing device will dramatically change how people interact with computers.

The Computing Device
A personal server, one about the size of a deck of cards, will contain computer memory and Bluetooth communication for sending and receiving information to other Bluetooth-enabled devices within the vicinity.

This portable device can be used to store:

  • A backup of your files from your PC
  • Pictures and video from your digital camera
  • Music files from the Internet
  • Frequently visited Websites from the Internet
  • A personal Web — a geographical and temporal history of content received from other Bluetooth-enabled devices as the user moves through
    • A museum by capturing additional information for later review
    • A factory for capturing production and status information as you walk from department to department
    • A mall by capturing advertisements, menus, and offerings from businesses as you walk through a shopping mall

User interface devices enable users to interact with the personal server. Ideally, these devices should connect wirelessly with the personal server. Example user interface devices include pens, microphones, speakers and displays.

A Special Pen
A special pen has a camera in the tip that captures small marks on specially printed paper. The markings indicate the position where the pen is writing on the paper. By tracking these positions, the pen captures pen strokes written by the user. In table one, the "camera" column illustrates how users perform common tasks by using the pen to mark and write. Not only does the pen bring computing capabilities to the paper, but the original use of paper is preserved — without disrupting existing office procedures, except for the elimination of the labor-intensive data entry task. This special pen might also contain a microphone to capture the writer's voice. The "microphone" column in table one illustrates how users perform common tasks by speaking. Additional capabilities are enabled by combining writing and speaking, as summarized in the "combined" column in table one (page 8).

Microphones and Speakers
Speech will be the primary means for interacting with software agents residing in the personal server. While complete natural language processing is still in the research stage, users will speak and listen using command and control as well as conversational styles of dialogs.

There are a variety of microphones and speakers, including badges and headsets. Users will speak and listen to the personal server via the ubiquitous telephone and cell phone. Handheld computers with microphones and speakers will provide a multimodal user interface. With speech processing to provide a natural user interface, the software agents in the personal server become a "genie in a bottle." In addition to speaking with software agents, users can call and speak with other users much as they do today with telephones and cell phones.

One more ingredient is needed to make a user-friendly computer-supported environmental — a display for presentation of graphical information. Candidates for displaying information to the user include:

  • A small, portable display — either as a handheld attached to a wristband — reminiscent of the two-way wrist radio introduced in 1946 and the two-way wrist TV introduced in 1964 in the Dick Tracy comic strip, or worn on a chain around the user's neck.
  • Existing computer and television screens by using wireless communication to transfer information to any local display for presentation to the user. As the user moves from room to room, the presentation continues but on different screens in different rooms.
  • A micro projector, possibly attached to a key chain, that projects images onto a convenient surface, such as a blank sheet of paper or even a paper form without the blank spaces filled in.
  • A flexible display that bends and rolls into a narrow tube about one inch in diameter, which can be controlled electronically, stays as sharp as regular ink, and doesn't require extra power to retain text or images.
  • Eyeglasses that contain a display.

The Opportunity
New technology allows computer users to escape from the office position — sitting in front of a computer with hands on the keyboard — to move from place to place in the world. No longer will users need to go to the computer; instead, the computer is always with the user — just as a wallet or watch is always with their owner.

Many new types of applications will be possible. Here are just a few examples:

  • Where is it? — Ask aloud "where are my keys" and the lost key chain starts blinking or buzzing.
  • Remind me to buy this — When you run out of toothpaste, just speak: "Add toothpaste to my shopping list." Next time you are in a store that sells toothpaste, the box not only reminds you to buy toothpaste, but tells you where to find it in the store.
  • Who is he? — When speaking to someone you recognize but can't think of his name, the personal server uses a speaker recognition system to identify the mystery person and speech synthesis to whisper the mystery person's name in your ear.
  • Tell me about it — When you wish to obtain additional information about a landmark such as a statue of Sacagawea encountered while traveling, just ask "Tell me more about the Sacagawea," and the personal server whispers a short paragraph about the life and time of this maiden Indian guide who helped the Lewis and Clark expedition.

Get ready for new ways to communicate with computers — many involve speech.

Table 1: Common tasks performed by the Special Pen

User Task




Capture Data

Capture written doodles, drawings, notes, and illustrations for later presentation; for example, write notes during a lecture to study before the final exam.

Record spoken words and phrases for later replay; for example, capture verbal reminder to add to a "to-do" list.

Record both speech and pen gestures for later replay; for example, draw a map while speaking the directions or verbally describing how to solve a math equation while rewriting the equation.

User Identification

Register the user's signature, verify that signature belongs to a registered user, identify a user from among registered users based on his/her signature.

Register the users voiceprint, verify that a voiceprint belongs to a registered user, identify a user from among registered users based on his/her voiceprint.

Increase security and reliability by using both handwriting and voiceprints.

Interpret Content

Handwriting recognition - convert word and phrases written on paper to electronic text.

Dictation - convert speech to electronic text.

The spoken text, "The hidden treasure is at the spot I'm marking with an X" is converted to the electronic text, "hidden treasure at coordinates (x,y)."

Interpret Requests

Convert pen strokes to comands; for example, save information when the user checks a box with the pen

Command and control by converting spoken commands to actions perfomred by a computer, for example, speak "save now" to save information to a file.

Synchronized multimodal input; for example, "reword this (point) paragraph," "e-mail this form to that person (point to a person's name or e-mail address)"

Dr. James A. Larson is Manager of Advanced Human Input/Output at Intel and author of the book, VoiceXML — Introduction to Developing Speech Applications. He can be reached at jim@larson-tech.com and his Web site is www.larson-tech.com.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues