WYSIWYS: What You See is What You Say

What does the term "speech-aware" mean in an application? I first heard this term in 1994 while developing and teaching speech programming classes at IBM. The term "speech-enabled" meant that a current application used keystroke macros which were attached to a voice keyword to perform an action. This would be the equivalent of a macro in Wordperfect or Brief, but using voice. A "speech-aware" application on the other hand, meant that the application was written to access the speech engine API's directly, so it was then "aware" that speech was available. But, to me, "speech-aware" meant a lot more. In my view, an application is "speech-aware" when it is designed from the beginning to take into account how a user would "talk" to it. This is what I taught in my API classes as well as practiced in my application designs. So, what is a real "speech-aware" application? This is a very basic design question. Over the years, I have looked at how the user would want to talk to a computer with my vision being that of a computer that responds in a similar fashion as in the current Star Trek TV series. You ask a question and receive some sort of audio response. With this in mind, I will explore what "speech-aware" should be. The development of speech in some ways mirrors the development of the mouse. Before the mouse became a part of the computer, it was an accessory. Most programs then used menu options from a screen that took the user to more screens with more options or data fields. This is still in use on mainframe terminals. When the computers became powerful enough to handle the mouse pointer and a GUI interface, the applications started changing from these "menu" driven types to a "point and click" type. This made the application more "user-friendly" and easier to operate. Added Power = Added Complexity
But the entire process quickly became much more complex. Looking now at the modern word processor, a user can click on a menu item, select a sub-menu, and another and soon be at a complete loss as to how to do something. "Format" just isn't formatting a paragraph any more. It now includes such things as drawing objects. And one can get confused searching through a help window where what the user wants to find is hidden from him by the change in terminology from one manufacturer to another. It can take days for a user to read a manual or go through a boring tutorial to just learn how to type and print a simple letter. Adding power to the user has added complexity. The interface was standardized in the beginning so all users could click on "File" and know that's where to save a file. Now you also print from a "File" menu. A speech-aware application has to start from the beginning. The user, when he starts the application, should be able to say whatever words appear in front of him and be productive without a tutorial. I call this "WYSIWYS"(what-you-see-is-what-you-say). The application, since it is speech, should not have to have any pull-down menus. If the user had to say "File", then "Save" from the pull-down, then why should he use speech? A mouse would be much faster. The same was said about the mouse when it first came into widespread use. Many people predicted that the keyboard would be dead when true mouse GUI's were written. But it turned out to be extremely difficult to click on a letter one at a time to type using a mouse, So, the keyboard stayed. A standard was developed that allowed the user to use the keyboard if he didn't have a mouse, but this was very awkward. A speech-aware application must also be accessed through the standard input devices, even though the process can sometimes be clumsy. Speech has many of the same advantages that the mouse and keyboard had over their predecessors. In some instances, a mouse is preferred over speech, in others, it is not. The history of applications shows that GUI's didn't appear overnight in their current form as well as that application programmers had changed their thinking when the technology changed. A speech application has to be approached in a similar manner. As there are no common user access(or CUA) standards for a speech application, I will put forth my suggested standards for developing speech-aware applications. Let's start with a basic word processor. We'll call it Jerryword. The normal screen would contain a title bar, menu bar, maybe a toolbar, and of course the client area where you would type. A menu item all programs have is "file". In the normal course of things you would click on "file" and then on "save" to get the file-save dialog box. But Jerryword doesn't have a pull-down for "file". Just the menu item. Jerryword also has two status lines; one to show the command recognized by the computer, and another to show the status of the processing of the command. So, here you would say "file" and the word "File" would appear in the status line. Also, the word "Working...." would appear in the other status line. And, best of all, a nice voice would come through the speakers stating, "Thank you. I am ready." Then the file dialog box would appear with the standard drop-down combo boxes, but would have the following buttons: Save, Delete, Close, and Help. Then, if you wanted to save the file to another drive, you might say: "Drive", to highlight the drive box, then "D" to select drive D. To select a folder, you might say: "Open Folder", to highlight the folder box, then "My Folder" to open the folder. If you didn't want to say the folder name, you could say: "Down", to move down an item, "Down 2", to move down 2 items in the list, "Page down", to move down the equivalent of clicking on the scroll bar once, "Bottom", to reach the last entry in the list. Navigating up the list would be the same: "Up" = up 1. "Up 2" = up 2 items. "Page Up" = one scroll bar click up. "Top", "Home", "Begin", "Beginning", etc. to reach the top. If you wanted to change the name, you might say: "Name", to highlight the name box, then "J","E","R","R","Y", "dot doc" to produce the file name JERRY.DOC. Then you would say "save" to save and close the dialog box. Close, Cancel or Quit
If you decided you didn't want to save the file, you could have said "Close", or maybe the old standard "Cancel", or even "quit" to close the dialog. In the above example, note how three menu items, "Save", "Save as", and "Delete" were used in the same common dialog. Also, you see that navigation between the fields in the box were as easy as clicking a mouse on them. List box navigation is as simple as a mouse click and better in some cases. And you could even spell into the entry fields, if you were unable to type. Since this is a speech-aware application, shortcuts also are available and should be used extensively. Dictation navigation is more difficult. In Jerryword, you have just finished dictating a letter. The cursor is at the end. How do you get it to the beginning of the paragraph? "Begin paragraph". Jerryword is even designed to allow you to add your own commands for an action. Say, for "Cancel", the new command could also be "Kill". So what we have in a speech-aware application is not just the ability to use a new type of technology, but a way for the programmer to put his personality into the application and be able to communicate with a user in a more personal way., without having to use a manual, using standard, everyday, conversational words. And the user feels he or she is beginning to have a conversation with the application.

Jerry McKinney is president of DelRey Software, 9128 Cumberland Drive, Irving Tex. 75063, and can be reached at 972 401-3336 or by e-mail at jhmckin@ibm.net.

WYSIWYS: What You See is What You Say

Aircall Acquires Vogent

Grok Voice Mode Comes to Apple CarPlay

Krisp Launches VIVA 2.0, an Infrastructure for Voice AI Agents

DomoAI Launches TTS and Integrates OpenAI's GPT Image 2.0 in Talking Avatar Workflow