Rating Speech As a Human – Computer Interface

As computers have become more pervasive, it is becoming clear that many people have difficulty understanding and communicating with them. Users feel they should simply state what they want done and are frustrated at having to learn non-intuitive procedures in order to accomplish anything useful. Furthermore, such communication is often accomplished via slow, hard to use devices such as mice or keyboards. An easier, faster, and more intuitive method of communicating with computers is needed. One proposed method is the combination of speech recognition and natural language processing software. Speech recognition (SR) software is software that has the ability to audibly detect human speech and parse that speech in order to generate a string of words, sounds or phonemes to represent what the person said. Natural language processing (NLP) software has the ability to process the output from speech recognition software and understand what the user meant. The NLP software could then translate what it believes to be the user's command into an actual machine command and execute it. Speech recognition and natural language processing systems are tremendously complex pieces of software. While there are a variety of algorithms used in the implementation of such systems, there seems to be something of a standard understanding of the fundamental methods used. Speech recognition works by disassembling sound into atomic units and then piecing them back together into distinct words, while natural language processing attempts to translate words into ideas by examining context, patterns, phrases, etc. A series of phonemes make up syllables, syllables make up words, and words make up sentences, which in turn represent ideas and commands. Generally, phonemes can be thought of as the sound made by one or more letters in sequence with other letters. When the SR software has broken sounds into phonemes and syllables, a "best guess" algorithm is used to map the phonemes and syllables to actual words. Once the SR software translates sound into words, NLP software takes over. NLP software parses strings of words into logical units based on context, speech patterns, and more "best guess" algorithms. These logical units of speech are then parsed and analyzed, and finally translated into actual commands the computer can understand based on the same principles used to generate logical units. Optimally, the two speech software packages can work with each other to facilitate comprehension. For example, an SR package could ask an NLP package if it thinks the "tue" sound means "to", "two", "too", or if it is part of a larger word such as "tutelage." The NLP system could make a suggestion to the SR system by analyzing what seems to make the most sense given the context of what the user has previously said. It could work the other way around as well. For example, a NLP system could query a SR system to see if a user seemed to emphasize a certain word or phrase in a given sentence. If the NLP realizes when the user emphasizes certain words, it may be able to more accurately determine what the user wants. (e.g. the sentence "I don't like that!" differs subtly, yet importantly from the sentence " I don't like that!") SR systems may be able to determine which sounds or words were emphasized by analyzing the volume, tone, and speed of the phonemes spoken by the user and report that information back to the NLP system. Problems
So why isn't speech recognition and natural language processing use more widespread? Thus far, SR has been plagued by problems stemming from the difficulties of understanding different types of voices (e.g. male vs. female voices), parsing sounds when people have different dialects (e.g. different accents), and distinguishing between background noise and commands issued to the computer. Moreover, if SR is to work in real time, the software must have access to a large, fast database of known words and the ability to add more words. NLP software problems are even more difficult to overcome. NLP must be able to understand sentences peppered with verbal artifacts, slang, synonyms, ambiguities, and colloquialisms. Historically, SR software has been plagued with problems stemming from differences in pronunciation, enunciation, and speech patterns. For example, the way a child with a high-pitched voice and a southern-drawl pronounces "gravel" may differ significantly from how a deep-voiced man from the northeast pronounces the same word, yet adept SR software should be able to determine that both people are speaking the same word. This can be accomplished by allowing variable patterns of phonemes to make up a given word. Of course, doing so will increase the size of the database needed to map phonemes to words. However, this issue is becoming less problematic as computers become faster and cheaper. Additionally, some SR algorithms use fuzzy logic to help determine what the user has said. Indeed, these technologies are becoming robust enough to allow computerized telephone services to gather information from users (Admittedly, the vocabulary of these systems is extremely limited; e.g. A computer will ask a user some simple questions and ask the user to respond with only the words "yes" or "no"). The problem of distinguishing speech directed at the computer from background noise has not been dealt with as successfully. Currently, users of SR packages often must either work in an environment with minimal background noise, or must wear a headset with a sampling microphone inches from his or her mouth. Certainly, this is not the most desirable user interface. It is inefficient, taxing, and is not "user-centered." Verbal Artificats
Verbal artifacts represent a different type of problem. Verbal artifacts are words or phrases that are spoken, but add little, if any, content to a sentence. For example, the sentence "Golly, I sure love pudding!" contains two verbal artifacts: "golly" and "sure." NLP software must be able to identify these types of words for what they are and react appropriately. NLP also needs to be able to recognize that human beings are capable of conveying a single idea in synonymous ways. This is often an important consideration. For instance, NLP systems must be able to understand that a user saying "Take a memo." is conveying virtually the same idea as a user saying "Why don't you record this for me?" The first sentence is direct and rather unambiguous, but the second sentence comes in the form of a question, even though it is really a request. Surely, one can understand where some confusion could arise in such a situation. Researchers exploring NLP systems have not yet been able to develop systems that are robust enough to handle these dilemmas. Currently these problems are dealt with by simply "hard-coding" certain phrases and words that are synonymous. Ultimately, NLP will need to be able to recognize and react to such synonyms by the context it comes in, the user's habits (e.g. Does the user normally make requests in the form of a question, or is the user actually asking a question?), etc. Moreover, NLPs must have an extraordinary understanding of grammatical rules, practices, and structures. Furthermore, truly adept NLPs would need to be able to identify and react accordingly to sarcasm, humor, rhetorical questions, etc. Benefits
Why then should we try to implement speech recognition and natural language processing if it is so hard to do? Simply put, SR and NLP could revolutionize the entire field of human-computer interaction like nothing before. SR and NLP can greatly abstract human-computer interaction, eliminating the need to understand anything about the computer's internal workings or how to accomplish certain tasks. What are some specific benefits of SR and NLP interfaces? SR and NLP allow real time language translation. If a computer can figure out what words one utters and understand what one actually means, it is a trivial task to translate an idea from one language to another. Computers with capable natural language processing abilities will begin to act on the ideas that their users have, not the commands explicitly given to them. Indeed, one should be able to say to a computer "Do what I meant, not what I said." SR and NLP technologies could also conceivably eliminate the need to physically interact with computers. This means no more having to sit down in front of the computer and manually manipulate a keyboard or mouse. Instead, we'll have the freedom to be anywhere within earshot of a computer to interact with it. For example, one could instruct a computer to find a recipe for Chicken Kiev while hanging wallpaper down the hall. More importantly, people with certain types of disabilities may be able to more effectively interact with a computer. For example, a person with a broken arm will be able to easily work with computers, whereas now, a broken arm would almost certainly impair one's ability to operate more traditional types of interface devices such as a keyboard or a mouse. SR and NLP have the added benefit of being much faster than many other types of interfaces. Most people can speak much faster than they can type. If a user can convey an idea in four seconds that would otherwise take 20 or 30 seconds to type in, productivity could be greatly improved. Clearly, this would be highly desirable. Perhaps the greatest benefit that NLP will yield will not be directly in the field of computer input and output, but in the ability of a computer to understand a user's desires so profoundly that the computer will be able to act autonomously on its user's behalf. For example, one will be able to tell a computer to search the Internet for information on recreational activities in the Yellowstone Park area. An electronic, intelligent "agent" could be given a command from the NLP instructing it to search for this data. Because, the NLP will have understood the meaning of what the user requested, the agent could be instructed to search for a variety of activities; camping, snowmobiling, cross-country-skiing, etc. The key idea is that NLP will be able to transform words and phrases into ideas. Once a computer can accurately understand human ideas true artificial intelligence won't be too far behind.

Mike Machowski is a software engineer for VR· 1, a game company located in Boulder, Colorado. His e-mail address is: machowsk@cs.colorado.edu. He wrote this article for a course in user interfaces while pursing a BS in Computer Science from the University of Colorado, Boulder.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Rating Speech As a Human – Computer Interface

Triton Digital Partners with ekoz.ai on Voice-Cloned Podcast Ads

Soul App Launches Full-Duplex Voice Model

Mistral Unveils Voxtral Open-Source AI Voice Model

Vonage Partners with AWS for AI Voice Agent Integration