Exclusive Book Excerpt: Designing Better Speaker Verification
Though automated and semi-automated speech analysis and identification technologies have massive potential within law enforcement, forensics, and intelligence, adoption has been slow and sporadic. This is partly due to poor experiences with previous generations of voice biometric technologies and a cultural misperception that voice biometrics can be “spoofed.”
Voice biometric technology vendors also have contributed to this challenge by producing products that fail to address critical implementer challenges. There are critical problems that only voice biometrics can solve, but getting the solutions well-positioned requires a deep understanding of the nature of government implementations that seems to escape the grasp of too many vendors.
The voice biometric industry faces a challenge that is unique to voice. Fingerprint and iris analysis, for example, were developed primarily to identify individuals. As automated biometric technologies were developed, analysis moved from human observable details to minutiae that required specialized equipment. Even biometric technologies, such as vein scanners and DNA analysis, work on minutiae that, with the right equipment, essentially become observable. Voice biometrics is not visually observable. There can be visual representations, but you can’t look at a person’s voice—even with the most powerful electron microscope.
Fingerprints and DNA are accepted for use in forensics because there is a belief that you can observe the exact physical attribute, and with enough time, an expert could manually view a set of samples and see what matches and what doesn’t. This level of perceived certainty grants these technologies usage in courts under the Daubert standards (which allow attorneys to file a motion to exclude the presentation of unqualified evidence to the jury). However, because voice can never be “seen,” meeting the Daubert standards will require a significant effort to become admissible in court.
Even outside of court proceedings, though, there are significant uses for voice biometrics in the investigatory and forensic communities. But, if we’re going to get these groups to accept voice biometrics, we need to better understand how the technologies will ultimately be used. In a broad sense, we can set a number of categories in which voice biometrics can be useful: access control, surveillance, target identification, and forensics.
Access control is a very simple application and one of the most common use cases for voice biometrics. In the majority of applications, we have a number of conditions that create a very high acceptance rate with low false accepts and false rejects. The primary reason is the psychology of the interaction. When someone is registering a voice for a telebanking application, she wants to provide a good voiceprint to make authentications more successful. With access control applications, there’s a very specific psychological desire to be understood by the system so the caller can quickly get what he wants.
Surveillance is one of the key use cases where voice biometrics can be extremely successful. Voice biometrics simply becomes a plug-in for existing surveillance solutions. Barring some radical technology change, most law enforcement and intelligence agencies have already made their choices regarding the sort of probes they are going to use, what their biometric database structure is, and what kind of administrative tools they are going to allow users to access. To this point, the voice biometric engine needs to simply be a plug-in to these existing surveillance systems.
The benefit of using voice biometrics becomes two-fold. First, because the system is fully automated, a system can watch multiple T1/E1 lines without regular human intervention. This means that mass monitoring can be performed. Instead of having to pick specific phone numbers for wire taps, a cell tower, neighborhood, or community can have their calls run through a voice biometric surveillance system searching for calls where there are voice matches.
The second benefit is that by its nature, voice biometrics removes the need for humans to listen to the audio, which is a possible affront to civil liberties. Aside from the sheer manpower requirements when listening to an entire community’s phone calls, civil libertarians can make clear civil (and possibly criminal) complaints against indiscriminate monitoring. The voice biometric system simply collects calls and determines if there is a suspected voice match. If this is the case, then further action can be made: the call could be recorded, a live listener could be alerted, or the phone number can be flagged for a wiretap.
Because surveillance is typically planned, the probe types can be well defined before surveillance begins, allowing a solution to be well calibrated before it starts processing live audio. Uncalibrated systems or probe types are typically very large contributors to unsuccessful voice biometric implementations.
For decades, police departments have been photographing suspected criminals and collecting fingerprints in the police station. This type of controlled collection has allowed police unparalleled access to biometric data on known and suspected criminals. In many countries, such as Mexico and Spain, criminals are not only photographed and fingerprinted, but controlled voice samples are also collected as part of the intake process.
By building large databases of voices, similar to DNA and fingerprint databases, law enforcement and intelligence agencies can attempt to identify individuals primarily based on their voices, then use other investigatory methods to determine if the individual is a fit. This is helpful when dealing with issues such as phoned-in bomb threats or terrorist videos posted on the Web.
From a link analysis perspective, the use of voice biometrics can become key when working on issues ranging from organized crime to terrorism. When a suspect places a call, target identification can confirm that the suspect is the one placing the call and attempt to determine who the called party is. Even if the called party cannot be identified, his voice can be used to check other recorded calls for a match. So, known person A talks to unknown person B. Unknown person B talks to unknown person C. Unknown person C is linked to a violent crime. Now, investigators can go to person A to get access to person B, who can link the investigator to person C.
In the U.S., working in forensic labs to create testimony that can be used in court is not possible because no single voice biometric technology has passed the Daubert standards. However, in many countries around the world, the use of voice biometrics by a trained acoustic forensic scientist is permitted as court testimony.
In these cases, the voice biometric engine is not used in an automated manner. Instead, the engine is run manually, where scientists perform inter-session and intra-session variability checks, taking multiple voice samples of a suspect and determining how the voice changes within a recording and between recordings. These voice samples are then compared to a population of individuals considered by the forensic scientist to be similar in nature to the suspect—typically based on nationality, gender, age, relative health, regional dialect, education level, financial status, etc. Building the appropriate reference populations can take weeks or months.
Outside court and Daubert-sanctioned voice biometrics is the world of pre-forensics or forensic investigation. Here, forensic technicians and scientists use forensic techniques to determine whether someone is a suspect. Speed is of the essence. Where forensic scientists in a court case might have months to prepare evidence, a forensic investigator might have minutes to take a voice sample and confirm it’s the suspect.
Loquendo has taken an innovative approach of building a series of standard targeted reference populations for their pre-forensic tool (Loquendo Voice Investigation System–Pre-Forensic). The system comes preloaded with more than 60 reference populations broken down by language, gender, country/region, and probe type. If a forensic investigator has the voice of a Mexican male, collected by phone, speaking Spanish, instead of having to find 50 similar individuals to build a reference population, the investigator can select the Mexican-male-Spanish-telephony reference population and produce a good likelihood ratio. So, where the forensic scientist can drive down to the 150,000-to-1 likelihood that this is a match as court testimony, the forensic investigator can say to a judge issuing a warrant “Given that this is a Mexican male, we are comfortable saying there is a 120,000-to-1 likelihood that this is the suspect.” This can be done in minutes and can provide feedback that can be used to get a court order for an arrest.
Instead of discussing the specifics of any given voice biometric algorithm, it makes more sense to discuss where we, as an industry, need to start focusing to better handle all of these use cases. Loquendo has been actively working to improve both the core biometric engine as well as these other ancillary features:
Calibration Testing: Operators need to be able to take audio samples collected using a new probe or method and the system needs to be able to determine if the system is well calibrated for this acquisition method. If not, it needs to prescribe a course of action to recalibrate the system.
Channel Handling: Channel mismatch handling occurs when a sample recorded on a microphone is compared to a sample recorded using an air probe, VHF probe, or hands-free speakerphone. This can be addressed in the core voice biometric algorithms or with updated normalization strategies.
Speed: To better handle massive surveillance and identification projects, speed becomes critical. If the system is slow, it becomes functionally unusable in many cases.
Language Independence: When individuals are recorded at a police station, they may speak one language, but when their voice is intercepted during surveillance, they might be speaking another.
Voice Change: Being able to determine when in a stream or an audio file the voices change is critical. Being able to determine if the phone has been passed to a third party is a critical trigger to determine if the call needs to be listened to.
Key Word Spotting: Of course, implementing key word spotting in a vacuum can be an issue. “I will detonate this bomb” and “My girlfriend is the bomb” both include the word bomb, but the semantics are different. Semantic processing of key words, potentially combined with emotional analysis, becomes a key technology that needs to be developed.
Language Identification: By identifying the language being spoken, the correct live listener can be brought into the call. It doesn’t make sense to have a Pashto translator listening to a call in Urdu. It also can be used for surveillance profiling. If you are monitoring Chinese drug gangs and the system detects Mexican Spanish being spoken, that might introduce a person of interest to monitor. Also, gender identification can be similarly helpful.
Anti-Spoofing: Determining if a voice is live, synthetic, or somehow manipulated has traditionally been the function of the access control space. However, in the surveillance space, a synthetic or digitally modified voice might also raise concern. If an automated surveillance system notices two people speaking with obvious digital manipulation, that could be a signal to start monitoring. Different types of analog and digital manipulation can also carry signatures. Though the original voice pattern might not be able to be extracted, it could be possible to determine what kind of manipulation method was used.
Signature Detection: Every device that transforms or records a voice leaves a signature. Determining what types of phones/microphones are being used can be helpful for forensic investigators to identify a suspect.
Better End-Pointing: Unless you’re taking a call in a sterile room without background noise, there are going to be aberrations in any audio sample. Vendors need to improve their capabilities to identify in an audio sample which parts are speech-related and which parts are environmental.
Real Time: For many use cases, audio can’t be buffered or recorded. Loquendo’s sliding-window method avoids the need to buffer a live stream to perform voice biometric analysis.
Support Field Devices: Military, law enforcement, and intelligence agencies want access to self-contained voice biometric systems that can be used on a handheld device without network connectivity. Algorithms and capabilities that were designed for high-powered server environments need to be redesigned to work on mobile platforms.
Shorter Utterances: Sometimes only a small amount of audio can be provided for analysis. Vendors need to look at how to reduce the amount of audio needed to build a comprehensive voiceprint or perform an analysis. This can also help in very noisy environments where the signal-to-noise ratio might be bad but small snippets of audio are clean enough to be processed.
Likelihood Ratios: Voice biometric vendors typically report their results as a confidence score or percentage. These numbers are only useful in a relative sense…is 3.1 good or bad? It’s better than 2.0 and not as good as 4.0. Vendors need to start using a Bayesian approach to likelihood ratios, reporting values such as “a 20,000-to-1 likelihood that the voice is a match for the target versus a random sampling of other voices from around the world” or “a 50,000-to-1 likelihood that the voice is a match for the target versus a random sampling of other Mexican males.”
PAGE 1 OF 2