Your Voiceprint Will Be Your Key
Soon there will be no need to carry money, keys, credit cards or identification cards. Devices that automatically identify a person from speech patterns will become ubiquitous. Your voice will permit you to access secure buildings, allow for electronic access to networks, and be the basis for secure phone applications such as bank-by-phone and long distance service lines. It will not be surprising to see a jogger pay for bottled water with a "cardless credit card." No personal identification number (PIN) or wallet will be needed; just your voice. There have been many uses and much written recently about the advances of speech recognition, and voice identification is a major facet of speech recognition. Speech is the most natural human biometric and the most recognizable. Babies are able to recognize their mother's voice at birth, apparently learning their mother's "voiceprints" while in the womb. A multitude of speech tones and tempos are used to uniquely identify individuals. The advantage is that speech can identify individuals through a non-contact method; you do not need to see or to touch the person to be able to recognize him. This will drive growth in the explosive speech processing market. Because of the value in security and convenience in knowing who the person is, speaker recognition will be widely used before speech recognition. Related Terms
The terms speech recognition, speaker recognition, speaker verification, and speaker identification may all seem like the same thing. In fact, speaker identification and speaker verification fall under the category of speaker recognition, and we can say that voice identification has been derived from the basic principles of speech recognition. These terms are all related, but they do have differences. Speech recognition can be defined as a system that recognizes words and phrases that are spoken. Voice identification has been derived from the basic principles of speech recognition. Speaker recognition focuses on recognizing the speaker, and is accomplished either by speaker verification or speaker identification. Speaker verification is a means of accepting or rejecting the claimed identity of a speaker. Speaker identification is the process of determining which speaker is present based solely on the speaker's utterance. The speaker identification application evaluates the input with models stored in a database to determine the speaker's identity.
The use of voice identification and its applications are already widespread. Voice ID systems are increasingly prevalent in specialized sectors such as correctional and financial institutions and telephone companies. Voice identification plays an important role in corrections, for example. Inmates are granted the right to make phone calls from prisons, but this right is frequently abused. T-NETIX provides an automatic inmate telephone calling service that controls calls based on Voice ID and reduces the possibility of fraudulent activity. Military and law enforcement will also benefit from the technology. Voiceprints can be used to identify criminals, kidnappers and terrorists, even if they speak over the phone. Police could identify suspects in real time by having the suspect's speech evaluated with voiceprints stored in a database. Call center applications, credit card purchases, banks, and automatic teller machines can all take advantage of voice identification. Call centers are beginning to use Voice ID to authenticate users before allowing the transfer of funds and stock information over the phone. Another application is solving the problem of credit card fraud. Current systems can not answer the questions "Is the person using the credit card the person with the name embossed on the front of the card?" or "Was the person who activated the card the true owner of the credit card?" With Voice ID these questions can be answered quickly and easily. Automatic Teller Machines (ATM) are another category where voice identification can be very useful, as it reduces the opportunity for fraud. Voice ID allows only authorized personnel to access your records, and eliminates the use of secret PIN numbers. Users would just speak a password for direct, secure access. Security
Personal computers, home banking, electronic and Internet banking are diverse technologies, but they all have one major issue in common - security. Voice identification services offer the ability to verify account balances, transfer funds, purchase traveler's checks, handle account maintenance and pay bills. Most importantly, with Voice ID, the bank can verify the person, not the PIN. Electronic commerce, Internet banking and LAN security will all likely soon require some form of Voice ID for security. Voice ID will also secure information that is too sensitive to pass over phone lines. Wireless telephone companies are combating cellular phone fraud with Voice ID in a system currently deployed by GTE. The system targets fraud that occurs from "roaming" calls made outside the home calling area by verifying the roaming user with a voice password over a wireless phone network. Voice Activated Device Control is another area where voice identification can easily be applied. Imagine opening the door to your office, car or house with just your voice. Voice ID will give added security to safes and bank lock boxes. There will be no need for keys and no need to worry about misplacing them - your voiceprint is your key. What is a Voiceprint?
When you speak a word, each sub-word sounds like a syllable, phoneme, triphone, or whatever we call the smallest unit of sound. These units of sounds have several dominant frequencies that remain relatively constant over that subword, or segment. The figure below shows the audio wave form composed of three segments:
The next figure shows that each segment has three or four dominant tones that are plotted in the Linear Prediction (LP) spectrum. Linear Prediction is a frequency spectrum estimation technique. This table of tones is sometimes referred to as a voiceprint.:
Each person has a different voiceprint, even when they say the same word. This becomes more significant as the length of the speech sample grows to more words. The voiceprint differences also increase when people say different words. The number of possible voiceprints is large enough to uniquely identify every person in the world. Voice identification systems take advantage of this diversity. Spoken words provide a rich pool of unique identifiers for individuals. The voiceprint is basically a table of numbers stored in a computer, where the presence of each dominant frequency in each segment is expressed as a binary entry in the tables column. This point is illustrated as follows:
Since all table entries are either 1's or 0's, each column can be read from top to bottom as a long binary code. This column for the entire word forms a unique code word that identifies the person speaking the password. When a person speaks the password, we extract his code word and compare it to the stored word for that person. The theoretical probability of error of the voiceprint construct can be estimated. Assuming that the probability of error of one segment is 10%, the probably of error of one word (10 segments) is Pe = .110 = 10-11
This probability can be further reduced by adding more words. This is a distinct advantage over fingerprint recognition, iris scans and other biometrics, where only a finite number of body parts are available for measurement. Formants
Dominant speech tones, called formants, are only a small part of the speech signal. We focus on these because they are fairly easy to estimate under various conditions. In addition, they remain fixed or invariant under many extreme conditions. Similarly, a good voiceprint is obtainable even under adverse conditions. Access is granted only if the code word is calculated as a match. The challenge for speech engineers is to come up with a good, fast method to find these frequencies and compare them to the stored voiceprint. How Does Voice Identification Work ?
Voiceprint Identification is really a form of pattern recognition. Pattern recognition performs a two-step process to match the given speech sample, as shown: When the Voice ID system receives the speech input, it cleans and processes the sound. This process is called Feature Extraction. During feature extraction, unwanted and unimportant parts of the speech signal are removed. Background noise, channel distortion, volume or mood of the speaker are all examples of extracted parts of speech. The second stage of pattern recognition is called pattern matching or pattern classification. At this point, the system evaluates a specific model or set of models with the extracted features. The result of this evaluation is a decision regarding the authenticity of the speaker in the case of speaker verification or the identity of the speaker in the case of speaker identification. Since speaker verification only requires a decision regarding a specific model, it lends itself to numerous real-time applications. Although pattern classification provides a solid foundation for speaker recognition applications, advances within the feature extraction phase are still necessary to extend deployment opportunities. With techniques such as robust speaker recognition, the resulting features have become much cleaner and much more accurate. This advanced technique has had a very positive effect on speech recognition and has settled the issue of security that often plagues Voice ID systems. Robust Speech Techniques
Robust speech techniques have played a critical role in advancing voice identification as "state of the art" technology. Robust speech techniques can be applied to both phases of pattern recognition, namely feature extraction and classification. For feature extraction, spectral subtraction techniques have been investigated for de-emphasizing unwanted frequency distortions which may be due to the transmission medium, such as a telephone line or cellular communications channel. As discussed earlier, the fact that speech information is based primarily on the location of its own resonant frequencies, known as formants, allows for natural robustness to noise and other distortions. Classification Methods
Classification methods have generally used statistical classifiers to model data from the feature extraction phase. Statistical classifiers seek to determine the degree of similarity between the features extracted during training and testing. Two popular modeling approaches are the Gaussian mixture model (GMM), which has been used in text-independent applications, and the hidden Markov model (HMM), which has been used in text-dependent applications. Other modeling approaches that have been used for speaker recognition include template-based approaches, such as dynamic time warping (DTW), and neural networks, such as multi-layer perceptrons (MLPs) and neural tree networks (NTNs). They are described below. Hidden Markov Model (HMM) - one of the most popular modeling methods used in speaker recognition, the HMM is based on a statistical modeling of the different sounds that a user may have in their password. The HMM models not only the probability that determines how similar a new sound may be to its model, but also the probability of how one sound may make a transition to the next sound in sequence. The HMM has been evaluated extensively for speech recognition and has also more recently been applied to speaker recognition. Dynamic Time Warping (DTW) - uses a template matching approach and provides distance measure between the features obtained during training and testing. The DTW method is simple in concept and implementation, however, tends to have a higher error rate than some of the more modern modeling approaches. Neural Tree Networks (NTN) - a classifier that combines the properties of neural networks and decision trees. The NTN provides a sequential decision structure that allows for efficient searches to classify vectors. With the proper combinations of several modeling methods, a composite model can be constructed which provides superior performance to the methods used individually. Threshold Selection
Threshold selection also poses a challenging problem for speaker verification applications. Essentially, a threshold represents a decision point where a score exceeding the threshold will correspond to speaker acceptance and a score below the threshold will correspond to speaker rejection. The threshold is an adjustable parameter allowing for tradeoffs between speaker acceptance and impostor rejection. The performance of speaker verification systems is typically provided in terms of the Equal Error Rate (EER). The EER corresponds to the point where the false accept and false reject errors are equal. For example, a system that has a false accept rate of 5% and a false reject rate of 5% has an EER of 5%. Advances in speaker verification have resulted in lower EER's and are increasing the commercial viability of the technology. A speaker verification system can commit two types of errors. It may fail to recognize a true customer, leading to a false reject (FR) or "Type I" error. Or, it may wrongfully allow access to an impostor, leading to a false accept (FA) or "Type II" error. By varying the decision threshold, different operating points of the system can be obtained. A graphical plot that shows all the possible operating points of a system is called a Receiver Operating Characteristics (ROC) curve. The system operating point where the probability of false reject (PFR) is equal to the probability of false accept (PFA), is called the equal error point, and the corresponding error rate is the equal error rate or the EER. A typical ROC curve is shown below: Some Potential Problems
In speaker recognition, frequency of tone is very important. Only non-linear processes will alter the frequency of a tone. Fortunately, tape recorders and loudspeakers are intrinsically non-linear devices. Therefore, a tape recorder will move the formant frequencies of the spoken password, which should not be accepted by a good voiceprint system. Typical hand-held tape recorders provide low quality recordings which can be subject to the non-linear distortions. However, high quality record/playback systems which do not introduce significant distortions may be accepted by a voice recognition system. Some immunity from recording devices can be gained by asking the user to enroll into the system with multiple passwords such as their mother's maiden name and place of birth. To beat the system, the impostor will have to playback the multiple passwords in the correct order and in a very short time. Enrollment using random, multiple passwords is less user-friendly, however some users may be willing to go this route for the added level of system security. What if I have a cold ?
When a person has a cold a Voice ID system will continue to work. Distortion, loudness, softness or harshness of the voice will not affect the recorded voiceprint matching process. It has been tested and proven that only changes in the physical vocal tract will affect the model of a voiceprint. Another frequently asked question is about identical twins whose vocal tract is almost the same. "Can identical twins break in to a voice verification system as each other?" The answer is "not usually," because discriminate training aids in distinguishing between you and the other person. The Voice ID system also has the functionality to be fine tuned, or to set thresholds to not allow an "evil twin" to successfully break in as the "good twin." Next Steps
Where will the technology go next? We have already seen the development of consumer-based voice verification for several large banks. Also, investment institutions are in testing stages for credit card, ATM banking by phone, and Internet applications. Internet stock trading is also being tested. One of the largest automobile manufacturers in the world is testing door and ignition control with the use of Voice ID. A large computer software company is testing applications for network and Internet security. A credit card point-of-sale service provider, a network vendor and a large telephone company are considering the "cardless credit card" concept. A phone-based time and attendance firm has planned to add speaker verification to assure that remote employees report to their workplace properly. A leading home automation firm wants speaker verification to assure that only authorized persons can access home functions and home security. Voice mail can now be private and protected. Only the designated person, verified by their voice password, is allowed access to hear his or her voice mail. Personal computers can also be private and sensitive information protected with the use of Voice ID. A voice lock program, embedded in a screen saver or screen lock, will allow only the person with the matching voiceprint to unlock their PC. Large companies, day care centers, and health clubs are considering using Voice ID for ease and security reasons. Voice ID can help to identify employees, paid members, authorized caregivers, and more. The technology has been proven reliable and has been embraced by users in their day-to day lives. Many large industry leaders are seeing the advantages of voice identification. As research continues to uncover new answers, integration will continue in many areas. This process can only expand voice identification applications, unlike social security cards, driver's license and other important forms of personal identification, your voice will never be stolen. Your voice, which no one can copy or steal, and which you can never forget, will be the only thing you need.
Richard J. Mammone, Ph.D., works for T-NETIX, Inc., 371 Hoes Lane, Piscataway, NJ 08854. For more information readers can contact him or Edward J. Devinney Jr. Ph.D., director of technology transfer at T-NETIX at 732-981-1960 or by fax at 733-981-1966.