April 1, 2005
By Judith Markowitz Principal - J. Markowitz, Consultants
Q & A

James L. Flanagan, Recipient of the 2005 IEEE Medal of Honor

IEEE, the world's largest technical professional society, named James. L. Flanagan, a pioneer in the areas of speech analysis, speech transmission, and acoustics, recipient of the 2005 IEEE Medal of Honor. The award celebrates Flanagan's sustained leadership and outstanding contributions in speech technology. Flanagan was one of the first researchers to see the potential of speech as a means for human-machine communication. He made seminal contributions to current techniques for ASR, TTS, and signal coding algorithms for telecommunications and voicemail systems, including voicemail storage, voice dialing and call routing. He created auto-directive microphone arrays for high-quality sound capture in teleconferencing and pioneered the use of digital computers for acoustic signal processing. He led researchers to a greater understanding of how the human ear processes signals and was responsible for the development of advanced hearing aids and improved voice communications systems. His work included the development of an electronic artificial larynx, playback recording systems for the visually impaired and automatic speech recognition to help the motor impaired. He was the director of the Information Principles Research Laboratory at Bell Laboratories and vice president for research and director of the Center for Advanced Information Processing at Rutgers University.

JM: How does it feel to win the IEEE Medal of Honor?
It's really thrilling. This award represents the approbation of the cream of the electrical engineering profession. I've devoted my career to this profession and the award is a stamp of approval of my work. It's a precious thing to have.

JM: When you began working on things like basic speech and speaker recognition did you ever envision a time when they would come together as they have today?
Not when we first started, which for speech and talker recognition was around 1974. My first concern was with packet transmission and the goal was to get as much efficiency and economy for digital transmission as one could get - to utilize the network capacity to the fullest. The other technologies were still in such an embryonic form that they weren't commanding much research time.

JM: You not only worked on speech and speaker recognition you also developed early VoIP.
Yes. We got a patent on packet transmission of voice in 1978, which was way before the Internet so it was a long way before its time. For that, we used a bit rate coding that varied in accordance with the load on the network. If it was congested we used lower encoding rates. It included silence detection, so that you didn't burn up transmission capacity just to send silence.

JM: Do you see VoIP and speech recognition working well together?
I see no real bar to that. The key is what kind of encoding does one use for the packet voice. As long as the encoding rates remain high enough to give an accurate representation of the signal spectrum, you can expect that VoIP will work just fine with speech recognition and with talker recognition as well. Quite a few people have demonstrated that you'll get degradation if you try to run a low bit rate coder with ASR, which means you're liable to encounter some problems.

JM: More recently, you worked on a project called Voice Mimic. Was that designed to convert one person's voice into another voice?
Actually, we were trying to understand what constituted the natural quality of human voice. Why did it sound human? What do you have to capture in describing it in order to preserve a personal identity? We were synthesizing speech from what we called "first principles." That is, the human physiology of speech - as best we could determine because there are remarkably few quantitative measurements in medical books. The trick was to get accurate quantitative data on what was realistic for human physiology.

We built a computer program, the inputs for which were the subglottal pressure (which is essentially the lung pressure), an approximation of the position of the vocal cords, and the tension of the cords (which controlled the fundamental frequency); plus a set of area functions that included the cross-section area of the mouth and the side branch that is the nasal cavity and radiation from the nostrils. We were able to do remarkably human-sounding text-to-speech synthesis with that system. So, we learned a few things about the major components of naturalness in speech synthesis. It's still a very complex formulation that requires a huge amount of computation. No one uses it as a synthesizer at this point. But, it has some possibilities.

There was an interesting spin off of this. A Japanese otolaryngologist was interested in using the system to predict the effects of treatments to the larynx. For example, if someone had a paralyzed vocal cord what would the quality of the sound be? You could use the model to simulate those kinds of things. I think there are some very interesting diagnostic things that can come of work in that direction. Places like National Institutes of Health may see fit to support research in this direction.

JM: Talk a bit about your work on microphones and hearing aids.
We did a good bit of work on auto directed microphones for teleconferencing. On the hearing side, we had to establish the acuity of human hearing for certain features that we were using to transmit the signal. We wanted to know how accurately we had to represent a particular feature before a listener would hear something that didn't sound like the original. We did a lot of psychoacoustic testing and some binaural testing to look at techniques like release from masking, which is important for spatial perception of sound. For example, if you're trying to transmit spatial realism using a stereo or some complex system you need to understand how the spatial separation affects signal-to-noise ratio. We did a computer simulation of the basal membrane motion in the cochlea based on Georg von Bekesy's Nobel-prize winning physiological measurements. The basilar membrane is sort of a mechanical frequency analyzer. It's influential in actuating the neural signals about sound to the brain. The pointed end of the cochlea is more responsive to low frequencies and the basal end is responsive to high frequencies. Our program was able to show how complex signals were represented in terms of the basilar membrane motion. That gave us some guidelines about perception.

One of the researchers in our organization designed a computer-fitted hearing aid. He followed the concept into development and because it wasn't mainstream telecom, a medical equipment company licensed the technique and commercialized it.

JM: You were also instrumental in establishing a groundbreaking international research exchange between ATandT and Nippon Telephone and Telegraph of Japan. Please talk about that.
That came about after several of us from Bell Labs gave papers at the international Congress on Acoustics in Tokyo. At that meeting and subsequent visits, we met several Japanese scientists, including Dr. Fumitada Itakura who was working in NTT at that point. By the way, he is also receiving a major IEEE award at this time, the Jack S. Kilby Medal in Signal Processing.

Itakura gave a paper on low bit rate coding that we thought was quite remarkable. We thought we ought to put our heads together on basic research of that type. When I came back I proposed we set up an exchange program at the basic research level with NTT where they might send an engineer to work with us for a year or two and we would send one to work with them. After a good bit of legal back-and-forth, the companies agreed to the ground rules for trading personnel at the basic research level.

Itakura was the first NTT researcher who joined us at Murray Hill and he stayed two years. He worked on a system for booking airline reservations using speech recognition that ran on a Nova Data General computer. It attracted a lot of interest.

The exchange went on for a number of years. It was a tremendous program because companies are very protective of their intellectual property, and rightly so because that's the business. But, you can collaborate at a fundamental scientific level, which is sometimes called "pre-competitive." At that level, you can advance understanding that can benefit a great segment of the research community. We went on to have exchanges in talker verification and in techniques for parsimonious description of speech signals.

JM: What do you see as an important research frontier that exists today?
One thing I keep beating the drum on is the challenge of creating a multi-modal interface that will enable people in remote locations to collaborate and communicate as naturally as possible. It would be an interface where you can not only exchange information by voice, but where the spoken word is supplemented by gesture, eye movement, facial expression, etc.

The question is how do you create that kind of environment and how do you make the system context-aware so that it knows how to fuse all of the separate sensory channels? That's a long-range challenge that will involve linguists, computer scientists, and engineers.

The rest of this interview will appear in the July/August issue of Speech Technology Magazine.

Dr. Judith Markowitz is technology editor of Speech Technology Magazine and heads her own analyst firm specializing in the speech-processing industry.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

James L. Flanagan, Recipient of the 2005 IEEE Medal of Honor

Voice Deepfake Fraud Surged 1,300 Percent

Sanas Unveils Simultaneous Real-Time Speech-to-Speech Translation

ESTsoft Partners with ElevenLabs

Deepgram Launches Voice Agent API