Exclusive Book Excerpt: Designing Better Speaker Verification
Posted Sep 1, 2011

Though automated and semi-automated speech analysis and identification technologies have massive potential within law enforcement, forensics, and intelligence, adoption has been slow and sporadic. This is partly due to poor experiences with previous generations of voice biometric technologies and a cultural misperception that voice biometrics can be “spoofed.”

Voice biometric technology vendors also have contributed to this challenge by producing products that fail to address critical implementer challenges. There are critical problems that only voice biometrics can solve, but getting the solutions well-positioned requires a deep understanding of the nature of government implementations that seems to escape the grasp of too many vendors. 

The voice biometric industry faces a challenge that is unique to voice. Fingerprint and iris analysis, for example, were developed primarily to identify individuals. As automated biometric technologies were developed, analysis moved from human observable details to minutiae that required specialized equipment. Even biometric technologies, such as vein scanners and DNA analysis, work on minutiae that, with the right equipment, essentially become observable. Voice biometrics is not visually observable. There can be visual representations, but you can’t look at a person’s voice—even with the most powerful electron microscope.

Fingerprints and DNA are accepted for use in forensics because there is a belief that you can observe the exact physical attribute, and with enough time, an expert could manually view a set of samples and see what matches and what doesn’t. This level of perceived certainty grants these technologies usage in courts under the Daubert standards (which allow attorneys to file a motion to exclude the presentation of unqualified evidence to the jury). However, because voice can never be “seen,” meeting the Daubert standards will require a significant effort to become admissible in court.  

Even outside of court proceedings, though, there are significant uses for voice biometrics in the investigatory and forensic communities. But, if we’re going to get these groups to accept voice biometrics, we need to better understand how the technologies will ultimately be used. In a broad sense, we can set a number of categories in which voice biometrics can be useful: access control, surveillance, target identification, and forensics. 

Access Control

Access control is a very simple application and one of the most common use cases for voice biometrics. In the majority of applications, we have a number of conditions that create a very high acceptance rate with low false accepts and false rejects. The primary reason is the psychology of the interaction. When someone is registering a voice for a telebanking application, she wants to provide a good voiceprint to make authentications more successful. With access control applications, there’s a very specific psychological desire to be understood by the system so the caller can quickly get what he wants. 

Surveillance

Surveillance is one of the key use cases where voice biometrics can be extremely successful. Voice biometrics simply becomes a plug-in for existing surveillance solutions. Barring some radical technology change, most law enforcement and intelligence agencies have already made their choices regarding the sort of probes they are going to use, what their biometric database structure is, and what kind of administrative tools they are going to allow users to access. To this point, the voice biometric engine needs to simply be a plug-in to these existing surveillance systems.

The benefit of using voice biometrics becomes two-fold. First, because the system is fully automated, a system can watch multiple T1/E1 lines without regular human intervention. This means that mass monitoring can be performed. Instead of having to pick specific phone numbers for wire taps, a cell tower, neighborhood, or community can have their calls run through a voice biometric surveillance system searching for calls where there are voice matches.

The second benefit is that by its nature, voice biometrics removes the need for humans to listen to the audio, which is a possible affront to civil liberties. Aside from the sheer manpower requirements when listening to an entire community’s phone calls, civil libertarians can make clear civil (and possibly criminal) complaints against indiscriminate monitoring. The voice biometric system simply collects calls and determines if there is a suspected voice match. If this is the case, then further action can be made: the call could be recorded, a live listener could be alerted, or the phone number can be flagged for a wiretap. 

Because surveillance is typically planned, the probe types can be well defined before surveillance begins, allowing a solution to be well calibrated before it starts processing live audio. Uncalibrated systems or probe types are typically very large contributors to unsuccessful voice biometric implementations.

Target Identification

For decades, police departments have been photographing suspected criminals and collecting fingerprints in the police station. This type of controlled collection has allowed police unparalleled access to biometric data on known and suspected criminals. In many countries, such as Mexico and Spain, criminals are not only photographed and fingerprinted, but controlled voice samples are also collected as part of the intake process. 

By building large databases of voices, similar to DNA and fingerprint databases, law enforcement and intelligence agencies can attempt to identify individuals primarily based on their voices, then use other investigatory methods to determine if the individual is a fit. This is helpful when dealing with issues such as phoned-in bomb threats or terrorist videos posted on the Web.

From a link analysis perspective, the use of voice biometrics can become key when working on issues ranging from organized crime to terrorism. When a suspect places a call, target identification can confirm that the suspect is the one placing the call and attempt to determine who the called party is. Even if the called party cannot be identified, his voice can be used to check other recorded calls for a match. So, known person A talks to unknown person B. Unknown person B talks to unknown person C. Unknown person C is linked to a violent crime. Now, investigators can go to person A to get access to person B, who can link the investigator to person C.

Forensics

In the U.S., working in forensic labs to create testimony that can be used in court is not possible because no single voice biometric technology has passed the Daubert standards. However, in many countries around the world, the use of voice biometrics by a trained acoustic forensic scientist is permitted as court testimony. 

In these cases, the voice biometric engine is not used in an automated manner. Instead, the engine is run manually, where scientists perform inter-session and intra-session variability checks, taking multiple voice samples of a suspect and determining how the voice changes within a recording and between recordings. These voice samples are then compared to a population of individuals considered by the forensic scientist to be similar in nature to the suspect—typically based on nationality, gender, age, relative health, regional dialect, education level, financial status, etc. Building the appropriate reference populations can take weeks or months.

Outside court and Daubert-sanctioned voice biometrics is the world of pre-forensics or forensic investigation. Here, forensic technicians and scientists use forensic techniques to determine whether someone is a suspect. Speed is of the essence. Where forensic scientists in a court case might have months to prepare evidence, a forensic investigator might have minutes to take a voice sample and confirm it’s the suspect. 

Loquendo has taken an innovative approach of building a series of standard targeted reference populations for their pre-forensic tool (Loquendo Voice Investigation System–Pre-Forensic). The system comes preloaded with more than 60 reference populations broken down by language, gender, country/region, and probe type. If a forensic investigator has the voice of a Mexican male, collected by phone, speaking Spanish, instead of having to find 50 similar individuals to build a reference population, the investigator can select the Mexican-male-Spanish-telephony reference population and produce a good likelihood ratio. So, where the forensic scientist can drive down to the 150,000-to-1 likelihood that this is a match as court testimony, the forensic investigator can say to a judge issuing a warrant “Given that this is a Mexican male, we are comfortable saying there is a 120,000-to-1 likelihood that this is the suspect.” This can be done in minutes and can provide feedback that can be used to get a court order for an arrest.

Moving Forward

Instead of discussing the specifics of any given voice biometric algorithm, it makes more sense to discuss where we, as an industry, need to start focusing to better handle all of these use cases. Loquendo has been actively working to improve both the core biometric engine as well as these other ancillary features:

Calibration Testing: Operators need to be able to take audio samples collected using a new probe or method and the system needs to be able to determine if the system is well calibrated for this acquisition method. If not, it needs to prescribe a course of action to recalibrate the system. 

Channel Handling: Channel mismatch handling occurs when a sample recorded on a microphone is compared to a sample recorded using an air probe, VHF probe, or hands-free speakerphone. This can be addressed in the core voice biometric algorithms or with updated normalization strategies.

Speed: To better handle massive surveillance and identification projects, speed becomes critical. If the system is slow, it becomes functionally unusable in many cases. 

Language Independence: When individuals are recorded at a police station, they may speak one language, but when their voice is intercepted during surveillance, they might be speaking another. 

Voice Change: Being able to determine when in a stream or an audio file the voices change is critical. Being able to determine if the phone has been passed to a third party is a critical trigger to determine if the call needs to be listened to.

Key Word Spotting: Of course, implementing key word spotting in a vacuum can be an issue. “I will detonate this bomb” and “My girlfriend is the bomb” both include the word bomb, but the semantics are different. Semantic processing of key words, potentially combined with emotional analysis, becomes a key technology that needs to be developed.

Language Identification: By identifying the language being spoken, the correct live listener can be brought into the call. It doesn’t make sense to have a Pashto translator listening to a call in Urdu. It also can be used for surveillance profiling. If you are monitoring Chinese drug gangs and the system detects Mexican Spanish being spoken, that might introduce a person of interest to monitor. Also, gender identification can be similarly helpful.

Anti-Spoofing: Determining if a voice is live, synthetic, or somehow manipulated has traditionally been the function of the access control space. However, in the surveillance space, a synthetic or digitally modified voice might also raise concern. If an automated surveillance system notices two people speaking with obvious digital manipulation, that could be a signal to start monitoring. Different types of analog and digital manipulation can also carry signatures. Though the original voice pattern might not be able to be extracted, it could be possible to determine what kind of manipulation method was used.

Signature Detection: Every device that transforms or records a voice leaves a signature. Determining what types of phones/microphones are being used can be helpful for forensic investigators to identify a suspect.

Better End-Pointing: Unless you’re taking a call in a sterile room without background noise, there are going to be aberrations in any audio sample. Vendors need to improve their capabilities to identify in an audio sample which parts are speech-related and which parts are environmental.

Real Time: For many use cases, audio can’t be buffered or recorded. Loquendo’s sliding-window method avoids the need to buffer a live stream to perform voice biometric analysis.

Support Field Devices: Military, law enforcement, and intelligence agencies want access to self-contained voice biometric systems that can be used on a handheld device without network connectivity. Algorithms and capabilities that were designed for high-powered server environments need to be redesigned to work on mobile platforms.

Shorter Utterances: Sometimes only a small amount of audio can be provided for analysis. Vendors need to look at how to reduce the amount of audio needed to build a comprehensive voiceprint or perform an analysis. This can also help in very noisy environments where the signal-to-noise ratio might be bad but small snippets of audio are clean enough to be processed.

Likelihood Ratios: Voice biometric vendors typically report their results as a confidence score or percentage. These numbers are only useful in a relative sense…is 3.1 good or bad? It’s better than 2.0 and not as good as 4.0. Vendors need to start using a Bayesian approach to likelihood ratios, reporting values such as “a 20,000-to-1 likelihood that the voice is a match for the target versus a random sampling of other voices from around the world” or “a 50,000-to-1 likelihood that the voice is a match for the target versus a random sampling of other Mexican males.”

PAGE 1 OF 2 

PAGE 2 OF 2

Building the Short List

Though not a technology requirement, one key element of success that needs to be understood is how to create a target list from a set of results. There are four strategies typically returned:

• Single Threshold: any response over a threshold is considered a possible match, and any response under that threshold is considered not a likely match;

• Dual Threshold: two thresholds are set, where anything above the top threshold is a likely match, anything below the lower threshold is not a likely match, and anything in the middle is a possible match;

• N-Best: an implementer wants the N closest matches to the target voice; and

• Full List: when all results are returned to an operator.

Each strategy has pros and cons. Threshold-based systems require live testing calibration to determine where the calibration point should be. N-Best lists can typically provide short lists that include the target speaker in a high number of cases, but they can be thrown off if there is no match in the system or if there is a cluster of many targets that score similarly (if n is set to the top 10 results but there are 30 results that all score very similarly). Full lists give a trained operator more granular control but can be unwieldy if there are more than 20 possible targets to test against.

It is important to use logic when setting thresholds or N-Best lists. If a target is of very high importance, you might favor a false match if it ensures against false rejects. Conversely, for a low-priority target, possibly failing to identify is more important than tracking down possible false matches. 

For integrated solutions, other pieces of information can be fed into a decision engine. For example, law enforcement and intelligence services typically have full dossiers on suspects that list known whereabouts. If you get a voice in Chicago that matches the voice of someone known to be in federal lockdown in Miami, it’s probably safe to exclude that as a match. In a full solution, it can be much more valuable to send all results to an application that can apply these types of rules instead of fixing a threshold in the biometric engine.

If the strategy for rolling out the technology is flawed, even a technically successful deployment could be considered a failure. Though books have been written about how to successfully manage complex technology deployments, it’s worth keeping a simple acronym in mind: SPEC. 

Scope: Fully understand how a solution will be used and what will make it successful. It’s almost impossible to spend too much time scoping out a project. Will the voice biometric solution work independently or be integrated into a multi-biometric searching system? What sort of probes will be used, and what could cause new probes to be introduced? What is the overall charter of the project? It’s important to level-set the implementer regarding what can be achieved with the technology and to ensure that they understand what it will take to properly integrate and deploy a voice biometric solution.

Scoping a project is not passive. As much as vendors need to extract information from the implementers, it’s also important for vendors to relay the best way for the technology to be implemented. Often, what seems like an unreasonable technical goal might be a misstatement of a reasonable operational goal. 

Prototype: Never deploy a system without building a prototype to evaluate the technology’s performance in a series of near-live environments. Depending on the complexity of the final integrated solution and if it will be deployed in a classified environment, sometimes two or more prototypes are necessary. 

Execute: Building a proper rollout plan is critical for success. Does the system need to be calibrated using live audio? How many units should be deployed as a beta before certifying that everything is working properly? Will you have access to live performance data to ensure the system is working properly? In many use cases, based on clearance levels, the vendor might never be able to gain access to live audio samples or get final specifications on the deployment hardware. In these cases, you need to determine if it is necessary to build an execution plan that includes a pre-deployment simulation for calibration or system training.

Control: When scoping a project, it is important to build the success criteria not solely based on technology goals, but also on operational system goals. Once the system has been deployed, it is important to determine a point in time where the solution will be evaluated to determine if goals have been met. 

Worldwide, the case for voice biometrics in investigatory, forensic, and judicial processes has been made. As international precedents have been set, what holds back implementations in the United States is the voice biometric community itself. We speak to customers with the guarded voice of a researcher, not the confident voice of a vendor. Though voice biometrics is not infallible, no statistical identification method is. Fingerprints, DNA, and iris scanning all have acceptable levels of tolerance for errors. These levels are set not by the technologists but by the implementers. As an industry, we no longer can confuse voice biometric accuracy with speaker identification’s utility.

(This excerpt was lightly edited for space reasons.)

ABOUT THE BOOK:

Forensic Speaker Recognition: Law Enforcement and Counter-Terrorism (released September 1) is an anthology of the research findings of 35 speaker recognition experts from around the world. The volume provides a multidimensional view of the science involved in determining whether a suspect’s voice matches forensic speech samples, collected by law enforcement and counter-terrorism agencies, that are associated with the commission of a terrorist act or other crimes. The challenges of forensic casework are explored, along with such issues as handling speech signal degradation, analyzing features of speaker recognition to optimize voice verification system performance, and designing voice applications that meet the practical needs of law enforcement and counter-terrorism agencies. A running theme is how the rigors of forensic utility are demanding new levels of excellence in all aspects of speaker recognition. The contributors are scientists in speech engineering and signal processing, and their work represents such diverse countries as Switzerland, Sweden, Italy, France, Japan, India, and the United States. 

The above chapter was written by Avery Glasser, managing partner at Flecture, a provider of management consulting, solution design, and vendor representation for clients with a specialization in surveillance and identification for law enforcement and intelligence agencies.