Biometric in Less Than 40 Bytes

A speaker verification biometric must satisfy three key technical requirements if it is to prove successful in the real world. It must be robust, discriminative and secure.

Robust in the sense that it characterizes the speaker at all times regardless of how much the voice varies or how severe the background noise is. (It must recognize the voice of John Doe under all conditions.)

Discriminative enough to allow each speaker to be uniquely identified. (It can not mistake John Doe for Jim Smith).

Secure enough to offer immunity from theft, rendering it useless to criminals even if stolen or compromised.

(It should be impossible for Jim Smith to get into John Doe's bank account with it).

Many conventional approaches fall short of meeting one or more of these requirements. Despite the considerable investment, many current solutions commonly exhibit a variety of shortfalls, including: large demands for memory and computational power; the need for time alignment, lengthy registration and training procedures, susceptibility to benign traumas that distort speaker dictation, vulnerability to background noise and communication channel impairments, slow verification response times and architectural inflexibility.

This article describes a different approach, a new Massively-Parallel Network Architecture (MPNA) that hs the potential to provide high levels of robustness, speaker discrimination and security using a speaker biometric of less than 40 bytes. Being so small, this configuration opens up the real prospect of allowing speaker verification to be performed using low-memory simple dumb cards, (for example, magnetic stripe cards).

The proposed strategy dispenses with the need for over-complicated training procedures and offers an interrogation response in under one second. The architecture also minimizes system complexity, obviating the need for large communication infrastructures and centralized databases currently demanded by conventional world-wide card-based transaction systems.

The application of Time Encoded Signal Processing and Recognition (TESPAR) wave form coding procedures to multiple orthogonal Fast Artificial Neural Networks (FANNs) as a means of producing text-dependent speaker verification has already been shown to be highly effective. (See Figures 1-2 below)

The main aim of the trial under discussion here, which required 16 man-months of effort, was to produce a reliable biometric for Portable Secure Objects, such as high performance state-of-the-art Smart Cards.

Using supervised registration procedures, the verification performance obtained from processing a 218-speaker database comprising of 150 males and 68 females, in which each speaker said the phrase "Sir Winston Churchill" on 20 separate occasions, was as follows:

… 0 x False reject errors out of 4360 interrogations.
… 4 x False accept errors out of 2616 interrogations.

An embodiment of this verification system using a small set of biometric networks can be readily implemented on a single chip or card in 1-2 kbytes. (See figure 3) This performance compares favorably with that offered by competitor methods.

As a result, TESPAR/FANN technology is being used to provide the biometric capability required in the European Union CASCADE Esprit Smart Card project - the objective of which is to develop a 32-bit RISC processor 20 square mm in area for a new generation of Smart Card and secure Pocket Intelligent Device applications (See figure 4)

TESPAR/FANN Technology

TESPAR is a new simplified digital language, first proposed for coding speech, although the process is equally amenable to all band-limited signals. TESPAR is based on a precise mathematical description of all wave forms, involving polynominal theory, which shows how a band limited signal can be completely described in terms of the locations of its real and complex zeros.

Given the real and complex zero locations of the signal, a vector quantization procedure has been developed to code these data into a small series of discrete numerical descriptors, typically around 30 (the TESPAR symbol alphabet).

The TESPAR code generates a simple numerical symbol stream that may be converted into a variety of progressively informative matrix data structures. For example, the single dimension vector S-matrix is a histogram recording the frequency with which each TESPAR coded symbol occurs in the data stream. A more discriminating data set is the two-dimensional A-matrix which is formed from the frequency of symbol pairs, that need not necessarily be adjacent. Typical S and A matrixes are shown in Figure 1.


The new technique of Massively Parallel Network Architectures (MPNA) is the product of research conducted by Domain Dynamics Limited (DDL) at Cranfield University, UK. MPNA embodies the immense power of multiple parallel artificial neural networks and data-fusion decision making to achieve the performance associated with a large number of trained networks in parallel.

In this technique, an ordered set of unique networks is trained in non-real time using speech material from a large number of arbitrary speakers who do not appear in the user population. Each network is unique in the sense that it is trained using a subset of speakers not used by any other of the N-1 networks. This gives rise to the requirement for a training database.

Each single network is assigned one input neuron for every cell in the working TESPAR matrix, and one separate output neuron for every speaker trained on, resulting in inputs (I) and outputs (S). Typical system settings for an MPNA using S matrixes are N = 100, I= 29 and S = 8.

Figure 2 depicts a single network, with six hidden neurons, that forms part of this arrangement.

During registration, users of the system issue a common phrase several times. Each utterance is approximately converted into a TESPAR matrix, and compared against each of the 100 single networks in turn. Each network will respond with a winner (the highest scoring output.) This in a sense, is the networks way of saying "Which of the eight speakers on which I was trained most closely matches the person who is currently registering?" The identity of the winner is then encoded into 3 bits.

After all 100 networks have been interrogated, the net result is a series of 100 winners who most closely match the registering speaker, encapsulated as a pattern of 100 x 3 bit codes, circa 38 eight bit bytes. These data are used to form the speaker biometric because they describe, to a very high probability, the numerical output profile likely to be generated by the registered user's voice input on subsequent interrogations.

During the verification, users of the system are prompted to say the common phrase. The speech undergoes the same process of TESPAR matrix conversion and MPNA interrogation used during registration. Statistical comparisons are made between the profile of winners that results and the biometric of the valid users, after which a final verification decision is given.

The MPNA may be embodied in silicon as a general purpose, low-cost low-power biometric engine suitable for installations in terminals world-wide. The underlying TESPAR coding process is already available, both as a software algorithm, and in a low power ASIC design.

Conventional Approaches

Remote speaker verification requires the use of a remote central biometric database. Speech submitted for verification is collected at a local terminal and identity is claimed using a simple dumb card. These data must be transmitted to the remote side because this is where the verification is performed.

The disadvantages are heavy congestion and delays caused by many terminals seeking simultaneous access to a remote site, the susceptibility of transferred data to criminal interception and the requirement for complex network administration.

By installing a biometric database at every terminal, verification can be performed locally. This solves many of the problems associated with remote verification but the limited capacity of each biometric stored restricts the number of users that can be accommodated by it.

The use of smart cards to store each speaker biometric and, on which the verification itself can be performed, is perhaps the ideal solution to all these problems. But this still leaves the simple dumb card without a speaker verification capability.

The new TESPAR/FANN MPNA configuration encodes a reliable speaker biometric into less than 40 bytes, allowing it to fit comfortably onto either a simple dumb card or a smart card. Because each speaker biometric is card-based, the number of users that can be accommodated by the system is unlimited.

Verification in the TESPAR/FANN MPNA systems is performed using a common pre-trained MPNA verification engine that requires no further modification during registration. This makes it possible to store the MPNA chip verification engine inexpensively on ROM. The use of a ubiquitous MPNA chip means world-wide verification coverage can now be supported from local terminals.

This architecture has several advantages:

… The configuration is applicable for use with simple dumb cards and smart cards.
… Coverage can be offered on either a personal, local or world-wide basis.
… The deployment of a common world-wide silicon biometric engine significantly minimizes system installation and running costs.
… The need for a costly centralized management system and massive data storage requirements is eliminated.

A new MPNA has been presented that has the potential to provide significant levels of robustness and speaker discrimination using a speaker biometric of less than 40 bytes without degrading effectiveness. This MPNA offers several advantages over the conventional approaches.

T.C. Phipps and R.A. King work at Heaviside Laboratories 12, Cranfield University, Swindon, SN SLA, United Kingdom and can be reached by email: ddl@rmcs.cranfield.ac.uk.



SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues