Speech interfaces that Require Less Human Memory
Point. Click. Point. Click. Point. Double-click. Speech is the most natural way to communicate, but PC applications are easiest to use by mouse and keyboard. This is not because of technology limitations, but by design. The graphical user interfaces of Windows and Macintosh have evolved to make life as easy as possible for pointing and clicking. Common commands are immediately visible, reducing demands on the user's memory. The interface has off-loaded the command list from the user's brain onto the screen, making the application much easier to learn and use. Most speech recognition users are reduced to using only a few verbal commands - the ones they can remember - or constantly referring to a paper quick-reference card. Making people remember verbal commands is even more of a hindrance than making them remember keyboard commands. Research suggests that recalling verbal items from memory interferes with the verbal composition process1. If you're dictating a research paper and trying to edit it by voice, remembering what commands to say makes it harder to focus on the composition and creative work at hand. Dictation software designers have addressed this human memory limitation with "natural language commands." The user can say commands as they occur to them, and the speech software is supposed to recognize any reasonable expression of the command. The problem is, the "natural" commands for your software may not be what you as the user find most natural. There is wide variation in how even simple commands can be expressed. In addition, as command variations multiply, recognition errors become increasingly common. Having the computer misrecognize a command is very frustrating to the user. After just a few command and control misrecognitions, she will quickly reach for the keyboard and mouse. Also, controlling your computer by voice is not natural. In speech with other people, we often make requests and sometimes give commands. "Could you say that phone number again?" "Fax this, please." Nevertheless, few people speak so briefly and directly as, "Save this file," and "Copy paragraph." The phrase, "Bold the next three lines," is not an expression that was ever used before speech recognition. Mouse commands are visually - and thus mentally - available. A user can start by using the button bar, pointing and clicking. Clicking on the word "Format" in the menu bar makes the menu drop down, creating new graphical regions more focused on what the user is trying to achieve. Button bars provide the most accessibility, while the menus let the user "drill-down" to more narrowly defined functions, without having to remember them himself or use a quick-reference card. The screen display changes to continually remind the user, and educate the user, on what commands are available. This gives maximum power and control without taxing the user's memory. The drill-down approach also provides a natural learning curve as users advance. Users can start with button bars, and then move to menu commands. They will soon begin to remember in which menu common commands are located. In addition, they will learn the keyboard shortcut keys for the commands they use most often. So, designed into the interface is an effortless learning method for users, if they choose, to advance from clicking the mouse on the correct region to direct access to commands with keyboard shortcuts. Putting More Words on the Screen
To make a successful voice command interface, we need to follow the same path of making voice commands mentally available and providing a simple learning curve so users can advance in skill. This means, in a nutshell, putting more words on the screen. We need to show users more examples of what they can say. Right now, if you want to save your document by voice, saying "Save document" should work. But to be sure, you either have to clumsily refer to a cardboard reference card or try the command and hope it works. You might as well just click on the Save Document button - it's much faster. But what we as voice interface designers could do is add the words "Save Document" to the button bar. The words would be in small letters, and always displayed. With labeled buttons for the most common commands, voice becomes an easy method of computer control. Want to save, cut, copy or print? The commands are always on-screen for the novice user to see.
The Voice Command Bar
Dictation programs now have support for "say what you see" functionality - simply say "Format" or "Format menu" in Microsoft Word and the Format menu opens. You can then say "Font," "Paragraph" or other menu choices that are displayed. This is a great example of making voice commands mentally available. Users do not have to remember what to say; they just choose from the words on the screen. To make users happy using voice commands, we need to create "say what you see" menus designed for voice activation. This would be an alternative menu structure on the screen, displayed in addition to the GUI menu bar. I call it the "voice command bar." The voice command bar contains several menus, leading to a range of different command choices. When the user says the menu name, the menu opens to show more detail. Commands are visually, and thus mentally, available. Users do not have to remember what to say, they just choose from among the words on the screen. After each utterance, they are prompted with more detailed commands. When appropriate, prompting questions will appear ("Which direction?" or "How many?") to help the user formulate his response. This menu structure would not necessarily show all available commands, but it would show enough so that users could learn the most common commands. Users would not be forced to use the menu structure. At any time they could say a complete command and the speech engine would recognize it as such. Also, users would not be required to speak a complete command. They could say the command in several segments as they see what additional information the computer needs. With a hierarchical structure like this, some of the commands created might be far from natural speech (for example, "Make bold back three sentences"). But the commands would work! And, they would be learnable. Currently, the only way to learn voice commands without memorizing a quick-reference card is guessing - trying a command to see if the computer recognizes it. The voice command bar plus a labeled button bar provides a natural learning curve to speech recognition users. They can "get their feet wet" with simple commands on the button bars, then move on to saying the words on the voice command bar, building increasingly complex commands if they desire. As they navigate through the voice command bar to build commands, they will be automatically teaching themselves the full commands. After a few tries with the voice command bar, they will naturally remember commands such as "Make bold back two paragraphs" without having to use the command bar at all. The mouse-oriented GUI has regions and pictures which change, giving more detailed information and more options. Instead of pictures, the speech-centered interface of the future will have words that the user can speak. The words will change, taking up varying amounts of screen real estate, to give more options appropriate to task. People will be able to speak commands freely, unencumbered by limits to human memory.
Dan Newman (newman@SayICan.com) is president of the speech recognition consulting firm Say I Can, Inc., specializing in usability testing and interface design. His free e-mail newsletter is available online at www.SayICan.com.
1 Danis, Catalina, Comerford, Liam, Janke, Eric, Davies, Ken, DeVries, Jackie, and Bertran, Alex, StoryWriter: A speech oriented editor, Proc. CHI '94: Human Factors in Computing Systems: Conference Companion, ACM, New York (1994), 277-278, cited in Shneiderman, Ben, Designing the User Interface, Third Edition (Addison-Wesley, 1998).