Speech Technology Magazine

 

An Interview with James L. Flanagan

Dr. James Flanagan's groundbreaking research in all facets of speech processing and related areas earned him the prestigious IEEE Medal of Honor. This year, he's retiring from Rutgers University. I had an opportunity to interview him about his work. JM: Of all the things you've worked on, what was the most exciting for you?…
By Judith Markowitz - Posted Jun 20, 2005
Page1 of 1
Bookmark and Share

Dr. James Flanagan's groundbreaking research in all facets of speech processing and related areas earned him the prestigious IEEE Medal of Honor. This year, he's retiring from Rutgers University. I had an opportunity to interview him about his work.

JM: Of all the things you've worked on, what was the most exciting for you?
JF: Virtually everything I've been involved in has been interesting to me.

About the time I joined Bell Labs, three things came together that had an enormous impact on research. The first was the deeper understanding of sampled data theory. That is, how to process time-discrete samples.  That was coupled with computation based on binary computing. The third was the invention of the transistor. Solid-state circuitry brought the promise of microelectronics - that is, assembling large amounts of circuitry in small packages. The coming together of those three things made it possible to use a digital computer to simulate an entire transmission system - just by writing a program and without having to spend hours or days designing and assembling custom hardware to run a new algorithm. It provided an enormous capability for testing new ideas and designs quickly. 

Low bit-rate coding is a centerpiece of efficient digital transmission.  The traditional communication channel involves 64 kilobits per second (kbps) transmission; and the goal of our work on adaptive differential PCM was to increase the capacity of ordinary pulse code modulation by a factor of two.  We ran the ADPCM at 32 kbps.  A little later, my colleagues and I developed sub-band coding, which allowed us to do fairly good quality digital transmission at 16 kbps.  This was important because it made the first AT&T voicemail system commercially possible.  For the first time, you could store voluminous amounts of voice messages economically.  Prior to that, it was too expensive to store at 64 kbps.  It required too much storage. 

In 1978, we got a patent on packet transmission of voice. That was way before the Internet, so it was a long way before its time.  We used silence coding and variable bit-rate coding based on the load on the network. 

I also believe we were the first to have speech-recognition on mobile cellular communications.  We thought it would be nice to simply say the name or number of the person you wanted to connect to rather than take your hands off the steering wheel to punch the buttons.  In 1986, we added a speaker-dependent speech recognizer (we called it a "repertory dialer") to a mobile cell phone. It could store around 40 names or numbers. We got the ASR working with a $1,600 transmitter in the trunk. The speech card cost only about $200, so we were really proud of that.  It differentiated AT&T's technology from everyone else. 

JM: What are the research frontiers you see today?
JF: One big challenge is to create a user environment that has the naturalness of a face-to-face conversation no matter where that user is or what distance that user is trying to communicate over. Speech is going to be central to this, certainly, because speech carries a large part of the burden of information exchange, but where speech would be supplemented and supported by gesture, eye movement, facial expression, etc. 

We have a relatively well-developed, quantitative framework for language.  We know what the phonemes of our language are and how to put them together to produce words. We have lexicons and grammars for combining words into meaningful sentences. We even have some semantic rules to interpret meaning.  There's no comparable framework for multimodal communication.  What are the phonemes for multimodal?  And, how do we make such a system context-aware, so that it knows how to fuse the separate sensory channels?

For example, if I'm sitting at a screen with a map display, I could point to an icon and then at a location and say, "Move this there," and the system would do it. That's a long-range challenge that will involve linguists, computer scientists, and engineers.

JM: What are the plans for your retirement?
JF: I haven't looked for anything special at this point.  I'm trying to honor existing commitments, which include membership on four boards. I've started on some writing I wanted to do.  Otherwise, I'm going to see how it plays out. 



Judith Markowitz is the technology editor of Speech Technology Magazineand is a leading independent analyst in the speech technology and voicebiometric fields.  She can be reached at (773) 769-9243 or jmarkowitz@pobox.com.


Page1 of 1