In November, an international stir followed Al-Jazeera’s release of a tape purported to be of Osama bin Laden. Al-Jazeera is the satellite television news network based in the Persian Gulf nation of Qatar.
By Judith Markowitz - Posted Jan 14, 2003
In November, an international stir followed Al-Jazeera’s release of a tape purported to be of Osama bin Laden. Al-Jazeera is the satellite television news network based in the Persian Gulf nation of Qatar. The measures required to authenticate a recording as politically-charged and important to international security as the bin Laden tape are significantly different from those used in most everyday access-security systems. Most access-security applications have multiple users whose voices must be processed within seconds after they present themselves for authentication. Their voices are transmitted over a single channel (e.g., wireline/wireless telephone) and they also generally know they are interacting with an automated system. The system may restrict what they say for authentication(e.g. a password), invoke backup measures (e.g., secondary passwords), and test for “liveness” vs. tape-recordings). In contrast, the bin Laden tape contained the recorded voice of an individual speaking freely to a human audience. The recording itself was of poor-quality. It had apparently been transmitted by phone at one point and recorded many times before it was played by al-Jazeera. Consequently, the tape required slow, careful analysis by human experts – assisted by computers. Content And Quality The speaker in the bin Laden tape made references to recent events, including the bombing in Bali, the hostage siege in Moscow, the killing of a U.S. soldier in Kuwait, the assassination of an American diplomat in Jordan, and the bomb attack against a French oil tanker off the coast of Yemen. The analysis, therefore, included attempts to determine whether the words and phrases concerning those events had been interpolated into older bin Laden recordings. Linguistic analysis of such a recording includes verifying the speaker is using the correct Arabic dialect, employing a bin Laden style of oratory, and exhibiting acoustic patterns that match other bin Laden recordings. Stylistic elements include preference for certain words, speed of articulation, dynamics, idiosyncratic articulation and/or intonation patterns, and even characteristic fillers (e.g., "uh," "see"). A speaker may, for example, routinely pronounce the word "didn't" as "dint," "didint," or "din." Some speakers frequently end sentences on a rising pitch making statements sound like questions. Such patterns are compared with authenticated recordings of bin Laden. Good mimics can imitate the style of an individual but they don’t have the physiology of that person. Consequently, authentication – whether by human experts or automated tools – examines acoustics patterns that contain information about the size and shape of the speaker’s throat, mouth, nose, etc. The use of such features makes it difficult for professional mimics to fool speaker-authentication systems. The noise and distortion of the bin Laden tape made analyses difficult because it affected those features. The challenge in such cases is to eliminate as much noise is possible without removing or further distorting acoustics patterns needed for authentication. Live or TTS Could the bin Laden tape have been created using concatenated text-to-speech synthesis (TTS) or voice conversion technology? Voice conversion transforms the voice of one person into someone else’s voice. For example, it would make Judith Markowitz’ voice sound like the voice of Humphrey Bogart. Today, conversions produced by such systems may be recognizable as the target-speaker’s voice but they often sound stilted and unnatural. “They sound artificial” says Dr. Carline Henton, president of Talknowledgy (see 'The State of TTS,' this issue). "The problem is that many so-called voice conversion systems are based on the same limited rules as parametric TTS systems such as DECTalk use." Bin Laden would get better results using commercial concatenative TTS. In order to generate flexible, natural-sounding TTS, though, he’s have to spend a minimum of ten hours in a professional recording studio providing high-quality samples of his speech. The recorded material would be segmented into labeled units and stored in a large database. It might be possible to use existing tapes of bin Laden's voice for this purpose but they would lack necessary acoustic variants. They also wouldn’t have sufficient consistency in quality, volume, and the other factors necessary to produce units that, when concatenated, sound as if they were spoken naturally and at the same time. According to Henton mismatches of this sort could be covered up. “You could hide any acoustic artifacts of the concatenation process by having a sufficiently noisy-enough channel, which is typical of Bin Laden’s speeches.” Unfortunately, the resulting speech would fail to reproduce the emotional nature of bin Laden’s speeches -- which are designed to stir followers into taking violent action. "Current high-end TTS systems are good, but I haven't heard any synthetic speech system that could reproduce the hectoring and invective in his speeches", says Henton. "Besides that, human intervention is needed to tweak the occasional artificial-sounding bubble in a synthetic utterance." It’s unlikely that these technologies were used in the November 2002 bin Laden tape but we shouldn’t eliminate them from consideration I the future. Dr. Judioth Markowitz is the associate editor of Speech Technology Magazine and is a leading independent analyst in the speech technology and voice biometric fields. She can be reached at (773) 769-9243 or jmarkowitz@pobox.com.
