Good Vibrations

There may be no greater enemy to speech recognition than noise. It either directly or indirectly causes most speech recognition errors. Errors brought on by noise can be the most persistent and difficult for the recognizer to eliminate. While most dictation software can eventually learn the idiosyncrasies of a particular speaker and learn to produce the right word, errors caused by noise are often too irregular to be countered and can so distort the final response as to render it beyond the understanding of the most sympathetic "best guess." The growth of speech recognition depends on eliminating noise. Noise even shapes the speech recognition market in that vertical markets where there is little noise have found speech recognition far easier to implement than those industries that function at the higher decibel levels. Medical dictation systems for example, have to date fared better in radiology labs than emergency rooms. A speech recognizer cannot simply pick up what the human brain recognizes easily - that the voice of another human is different from, say a teakettle or a passing car. For that reason, speech scientists have devoted a great deal of time and energy to isolating speech and setting it apart from the surrounding noise - like separating the wheat from the chaff, as the saying goes. Causing a medium to vibrate creates all sounds. Sound is carried through mediums by causing adjacent particles to vibrate - one air particle "bumps" the one next to it, moves it farther out, it "bumps" another and so on. The easiest way to visualize this is to make an analogy to a rock being dropped into a pond and causing ripples to form in the water, ripples which lose their intensity as they fan out in a sinusoidal curve, the same shape as a sound wave. Sound systems amplify sound by converting sound waves into electrical energy, and then boosting the energy, and then sending the energy back into sound. Signal processors change one or more aspects of the audio signal. A microphone is considered an input transducer because it turns sound into an electrical current that is the exact representation of the sound. The speaker of a headphone is an example of an output transducer that converts the amplified audio signal back into sound. The best recognition accuracy is achieved when you have the best signal-to-noise ratio possible. Getting the sound exactly right is critical if speech recognition is to be effective. There are many factors that impact speech recognition, including processor speeds, memory and the sound card. But if the microphone is not able to capture and feed a good sound signal to the sound card for the speech engine, accuracy suffers. Active Noise Cancellation
In a process called active noise cancellation (ANC), noise is cancelled by removing certain frequencies from outside sources as well as from the caller, and limits the amount of the sound sample that goes to the speech engine, but preserving all the caller's speech. ANC circuitry usually includes two microphones in the boom - one on the outside listening to any source coming to the boom from the outside and an inside microphone that listens to the user as well as to outside sources. The ANC circuitry then inverts the signal from the outside microphone, and adds it to the inside microphone, cancelling out the undesirable frequencies and providing a good signal to the speech engine. Different microphones are right for different tasks and of course different people find their needs vary considerably. Microphones built into the computer and or monitor, or even into the keyboard have caused some problems with speech recognition because they are often too far away and capture too much noise. Desktop microphones are also available that rest on a microphone stand. Until the recent development of array technology, desktop microphones required the user to keep the microphone pointed toward him or her and to only be used in a quiet room. Handheld microphones are also effective, but many users find it difficult to steadily hold the microphone the proper distance from the mouth. Close talk microphones and ear-piece microphones solve this problem by positioning the microphone as close to the mouth as possible and for that reason they became a preferred method for speech recognition use as the technology developed. Headsets that position close talk microphones to the exact point where they can be most effective had long been regarded as a critical enabling technology for speech recognition. But in the mid '90s some developers saw close talk microphones as a constraint on the growth of speech. The feeling was that the systems required the users to go through too many "lifestyle changes," like wearing a headset. Advances were made to make the headsets incredibly light and even more attractive. While there are certainly a growing number of workers who are on the phone for hours, who will certainly continue to use headsets, there are also users who want the advantages of speech without wearing a headset or being tethered to the desktop. Much like the ripples of water from a rock dropped in a pond, speech is moving farther and farther away from its original center, moving farther and farther away from the desktop computers where it first made its mark. Far Field
Far field microphones enable the user to achieve effective speech recognition simply by being within the hearing range of the desktop PC. Among the first "array microphones" (so named because in reality they are a cluster of microphones, not just one) are the Andrea DA-400 Desktop Array, the Labtec LVA-7280 ClearVoice Digital Microphone, the GN Netcom Voice Array Microphone and the Telex Aria Desktop M-60. With an array approach, a digital signal processor compares sound signals from four microphones, using the differences between them to "lock in" on the sound of the user's voice. This gives a consistent speech signal to the computer. The array approach is already recognized as an improvement on old desktop microphones, which generally work badly for speech recognition. Typically, the three types of microphones mentioned above are getting results quite comparable to headsets. Array technology incorporates an array of microphones to eliminate unwanted noise while providing a natural language interface. One vendor has compared the technology to a camcorder, which can zoom in and out on an intended subject. The technology has numerous applications beyond desktop computing and can be used in automobile PCs, video conferencing and home automation systems. The most recent ripple in the speech recognition pond comes from Clarity, a Troy, Mich.-based company that produces proprietary software that extracts voice from noisy environments and allows voice-activated devices to operate more consistently. Clarity's Clear Voice Capture (CVC) mimics the human ear and brain by separating out one or more signals, using multiple microphones. CVC is an audio extraction system and doesn't suppress noise, but rather separates one signal of interest. The company's vice president of business development, Fred Nussbaum, reports that Clarity's software solution is achieving performance comparable to and in some cases superior to desktop microphones and even headset microphones. As Clarity sees it there are three issues that drive the need for effective noise cancellation - 1.) A growing need to go mobile, which from a speech recognition perspective means interference can come from anywhere; 2.) Growing use of voice to control PDAs, Internet appliances, cell phones and PCs; and 3.) The world becomes a little noisier every day. As speech recognition continues to spread out into the wider economy beyond its original desktop PC base, issues of noise cancellation will become more important than ever.
Brian Lewis is the former executive editor of Speech Technology magazine and can be reached at 203-438-3581 or BrianL3581@aol.com.
SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues