The Importance of Adaptation in Automatic Speech Recognition
Many factors play into how well a speech recognition engine adapts to the characteristics of each speaker.
Posted Jul 16, 2010 Print Version           Page 1of 1
  

There are a variety of strategies employed to improve accuracy in automatic speech recognition (ASR) and related applications. A central approach in most speech recognition systems is the targeting of the application to its intended use (and user) via a process known as adaptation.

For ASR applications where adaptation is not practical, other targeting approaches can be used. For example, a “name dialer” IVR application could use real telephony speech from a variety of sources. For specialized and user-oriented ASR applications, such as medical transcription, or even in embedded speech recognition, such as in automobiles, the opportunity to augment the original model with adaptation to individual dictators offers dramatic accuracy improvements based on the audio that is entered into the recognition system. In such applications, there are two main aspects/steps: Training/Adaptation and Testing/Evaluation.

Adaptation is a key process for ASR engines to perform efficiently on a speaker’s voice input to recognize and convert to words or text accurately. Adaptation enables the engine to adjust to the nuances in the speech spectrum based on different kinds of parameters, such as:

  • Person: The speaker’s pitch, tone, etc., (speech features are extracted usually using Cepstral and Spectral domains) to differentiate male and female, languages, utterance speeds, accents, or dialects;
  • Devices used: The type of microphone and the processing engines; and
  • Environment: The ambience or environmental conditions, including different kinds of noise inputs along with the speaker’s audio.

A speaker’s manner of utterance plays a crucial role in both adaptation and recognition. Different kinds of engines have different mechanisms to adapt the speaker’s utterance. For instance, in a phonetic engine, the adaptation will look for a particular phoneme that a speaker is going to utter. In case of a large vocabulary continuous speech recognizer (LVCSR), the model will be looking for a particular word and the speed of that utterance. The speed of the utterance is a very important factor particularly with individuals such as physicians dictating about their patients. Obviously the pitch and other audio features in the male and female also play a key role due to spectral features that are captured for recognition purposes and the differences between male and female voices. Accents and dialects also play an important part when people from different regions could pronounce the same words differently.

Devices, such as quality high-end microphones, and adequate processing power of the devices/machines are also essential for adaptation and recognition purposes. A speaker’s voice captured from a microphone on a PDA is going to sound different from the microphone on a desk telephone or the one tethered to a PC. Adaptation of a speaker’s voice captured on all such devices would be crucial so that it becomes seamless for that user to use any such device for the dictation to be recognized.

Similarly, processing power of the machines also needs to be reasonably high for the engines to adapt these voices in real time without losing audio packets.

One of the most important factors to have a good adaptation process is the environment in which the speakers train/adapt their voices on the recognition engines. For instance, speaker’s training/adaptation of the voice in a clean room environment is always going to perform better than when the speaker is driving in a car with the windows rolled down or walking on a busy street with a lot of wind or other street/car noises during the voice capture. Also, it is important to clean or filter this audio to perform a good recognition based on the speaker’s adaptation using filtered audio.

Other factors that play an important role in adaptation are online versus offline adaptation, supervised versus unsupervised learning/adaptation, the length of time to adapt etc.

The process of adaptation to train the speaker’s input for better ASR efficiency/accuracy can be performed online (incremental) or offline (batch). Online refers to the adaptation as and when the speaker is providing inputs into the ASR engine for recognition. The engine learns and adapts to the user’s speech as more inputs and more words of the speaker’s audio are being recognized. Usually, in online adaptation, the first few minutes of recognized words are going to be less accurate than if the acoustic and language models had already been saturated with the training/adaptation data. Offline refers to training the ASR engine models with a set of the speaker’s audio corpora (typically as a batch audio) even before the user starts using the engine for recognition or transcription. The offline adaptation usually consists of some of the regularly used words by the speaker during the training. For instance, a physician specialized in a certain practice, such as oncology, will use key medical terms surrounding tumors/cancers during the training process for the medical vocabulary to understand their pronunciations.

Another factor that is significant to improve the acoustic model is based on supervised or unsupervised adaptation. In the case of supervised, usually there is an existing baseline transcript based on how the acoustic models are trained. In unsupervised adaptation, there is no transcript and usually the more new words are spoken, the better the adaptation for the models. In unsupervised adaptation the spoken words are aligned with the baseline model and use the transcribed words for adaptation. Of course, if the recognition accuracy is poor then unsupervised adaptation is usually not preferred. A confidence measure is used to find out whether those transcribed words can be used for adaptation or not. Most of the Hidden Markov Model (HMM)-based recognition engines use the unsupervised or a mixture of supervised/unsupervised schemes to perform adaptation.

Audio length used to perform the training is vital for a good adaptation scheme. A short duration of audio will be insufficient for the model to get adapted with all features in the speaker’s audio. On the other hand, several hours of audio might be overkill. Typically, ASR engines get fully adapted with 10-30 minutes of audio input.

Given all of these factors, one can understand the reluctance on the part of the providers of ASR engines and variants, such as analytics, biometrics, etc., to perform an adaptation process when the user audio needs special processing, such as compression, filtering, or other signal processing algorithms. In fact, the gains achieved by using good signal processing algorithms, such as de-noising filters or any other signal processing schemes, overcomes the time and effort spent in training/adapting the models& with the processed audio.

Based on the application (such as transcription, analytics, biometrics, etc.,) and the kinds of engines used, various approaches can be performed to& improve the quality of the recognition engines by creating new models or adapting existing ones to the processed audio. Also, due the spectral features being altered in the audio while processing, it is always better to maintain the same domain of adaptation and recognition (i.e., unprocessed audio for adaptation to unprocessed audio for recognition); similarly processed audio for both adaptation and recognition. Very few examples have shown benefits from going between the two domains, namely processed versus unprocessed in adaptation versus recognition.


Veeru Ramaswamy, Ph.D., is chief technology officer at Vianix. He can be reached at veeru@vianix.com.

Print Version       Page 1of 1



MarketPlace - Sponsored Links
ITIResearch.com
A collection of market research and reports for executive management and business & IT professionals