May 1, 2012
By Roberto Pieraccini Chief Technology Officer - SpeechCycle, Inc.
Features

The Great Divide

In his book, The Voice in the Machine: Building Computers That Understand Speech (The MIT Press, 2012), Roberto Pieraccini explores the maturation of speech technology and the industry's quest to make it as humanlike as possible. Will speech technology ever fully interact in the same way as humans? Perhaps, but to do so, one must fully understand the complexities of human speech—something we have yet to accomplish. This exclusive excerpt from Pieraccini's book reveals what speech technologists are up against when creating techniques for speech understanding, and offers a very helpful perspective on what to expect.

Language is complex, and we know very little about how we humans manage this complexity. Although we understand the physiology of the organs that produce and perceive speech, as we move toward the brain activities that handle language, our understanding becomes vaguer and vaguer. We have theories, but we don't know exactly what mechanisms our brains use to represent concepts and ideas and transform them into syntactic structures and sequences of words, and how these structures and sequences are ultimately transformed into motor nerve stimuli that activate our articulator muscles when we speak. We know that the cochlea in the inner ear generates a complex signal that represents the spectral information of the speech sounds we hear, but we don't know exactly how the brain uses that signal to extract the individual phonemes, arrange them into words, perform a syntactic analysis, and then come to a semantic representation of the concepts and ideas that makes sense in the immediate context we find ourselves in at a particular moment in time. We don't even know whether the brain disentangles the complex signal on the different linguistic levels—semantic, syntactic, lexical, morphological, phonetic, and acoustic—or skips that by making direct associations between sequences of sounds and concepts or ideas. We can make educated guesses based on the experimental evidence from perceptual tests and studies of anomalous behavior caused by disease or accident, but we don't know for sure. We humans are the masters of a language of incomparable complexity among all living species, but we don't understand, at least not yet, exactly how it works. And, nevertheless, we want to build machines that speak and understand speech.

Will having a better understanding of how we humans handle language help us build more-intelligent talking machines? And should machines that imitate human behavior be based on the same mechanisms we know or presume humans use? Kempelen's speaking machine was indeed an attempt to imitate the physiology of human vocal organs. Say you wanted to build, today, a modern computerized version of Kempelen's machine. You could indeed create a computerized machine to simulate the vocal tract; define a reasonable mathematical mechanical model of the tongue, lips, jaw, and velum; and compute the acoustic characteristics of the resulting sounds by solving the equations that govern the mechanics and motion of the air in a resonating cavity of a given shape and size. You could then link each individual speech sound to the corresponding articulation sequences and build a computer program that, for any utterance, would compute the actual movements of the simulated vocal tract, solve the equations, and generate synthetic speech. And, in fact, many researchers have been taking just this approach for decades.

But attempts at exactly simulating the highly complex mechanics of speech production haven't yet been able to produce natural-sounding synthetic speech, at least not as well as other approaches have. The problem is that our lack of knowledge about the fine articulatory details of speech production and the inherent complexity of creating an accurate mechanical model of the vocal tract inevitably give rise to a computationally expensive and still crude approximation of human speech. And even if you had a more accurate mechanical model of speech articulation, the computation needed for solving all the motion and fluid equations in real time—while the computer was speaking—would be beyond the capacity of today's commercial computers.

Consider instead a simpler and more brute-force approach, one that would abstract essential elements of speech rather than precisely imitate vocal tract physiology. For instance, you could create an inventory of audio clips of speech segments, like phonemes, syllables, or words, recorded by a human speaker. With a complete inventory of these sound clips, you could then synthesize any utterance by splicing them in the correct order. Thus, rather than building a machine that functions as humans do, you could build a fast, sophisticated, but intrinsically simpler machine that worked by assembling sounds from a catalog of recorded sounds: a smart reproducer of recorded speech samples. If you used the right technology for splicing sounds in a seamless way, with a large inventory of elemental speech units and a lot of engineering, the effect could be compelling: a machine that spoke without simulating any moving parts, a machine that used a fundamentally different concept than the one used by humans, yet sounded very much like a human. Of course, the brute-force approach has many limitations when compared to a powerful mathematical mechanical model of the vocal tract. But we can't yet build a talking machine based on a mathematical mechanical model, whereas, despite all the limitations of the brute-force approach, we have built talking machines using it, machines that have worked for many years now with compelling results.

However enticing it may seem to build a machine that uses humanlike mechanisms for accomplishing what humans do, such an approach has proven to be impractical not only in understanding and generating speech, but in many other areas of machine intelligence. It's a known and widely accepted fact today that machines don't have to replicate the same mechanisms used by humans, or whatever we believe these mechanisms to be, to exhibit intelligent behavior. Machines that produce and perceive speech, recognize images and handwriting, play chess, and translate languages without imitating humans stand as proof of this fact.

To appreciate why replicating human mechanisms is not always the best way of having machines exhibit human intelligence, you need to understand that the computational characteristics of the human brain are fundamentally different from those of modern computers. Whereas digital computers derive their power from a relatively small number of fast computational elements—processors—that are sparsely connected and share data through a limited number of channels, the human brain is an abundantly interconnected assemblage of an immense number of computational elements—neurons—that are both far slower and far less powerful than a computer chip. By way of comparison, a commercial top-of-the-line 2010 Intel personal computer chip—microprocessor—is capable of performing more than 100 billion operations per second, whereas the response of a human neuron is a mere 1,000 impulses per second. But the brain is composed of some 100 billion neurons, whereas a commercial PC has typically only one or a couple of computer chips. A back-of-the-envelope estimate of the computational power of the human brain based on those numbers is around 100 trillion operations per second, roughly 1,000 times more powerful than a commercial computer chip today. After all, 1,000 is not that big a number. At the current rate of increase in computer speed, we may see, in a decade or two, chips that match or exceed the computational power of the human brain.

But computational speed or power is not the only fundamental difference between a digital computer and the human brain. Data access and the type of processing achieved are also key differentiators. A computer chip generally includes a single processing unit, the central processing unit (CPU), which performs operations in a sequential way—one operation at a time. Even when you run simultaneous tasks on a home computer, for instance, printing a document while browsing the Web and receiving an e-mail, or chatting with several remote users at the same time, the CPU is still running in a sequential fashion. In fact, each fraction of a second, the CPU dedicates a small amount of time to each one of the tasks, giving you the illusion that you're doing things in parallel on the same computer. Since the birth of the electronic computer, we've developed effective ways of programming single CPUs that mostly rely on procedural mechanisms: first do A, then do B, then do C, and so on. Moving from a single CPU to a massive assemblage of computational elements—parallel computing—is indeed a hot topic in computer science today. Blue Gene, the high-performance computer built by IBM in 2004 for solving genetic analysis problems, could assemble as many as 65,000 individual processors to reach computational powers in the hundreds of trillions of operations per second. Its successor, IBM Roadrunner, built in 2008, has roughly twice as much power. A computer like that—not your typical home computer—is very close to matching and even exceeding the raw computational power of a human brain. If that's true, why can't we use a machine like Blue Gene or Roadrunner to replicate the human mechanism of speech production, hearing, understanding, and intelligent behavior in general?

The problem, beyond the cost of such a supercomputer, is that parallel supercomputers still have severe limitations in data access when compared with the human brain. The 100 billion neurons of a human brain are linked together by incoming and outgoing information channels that are connected by specialized junctions called "synapses." The neurons, with all their specialized connections, form a huge network that acts as both the program and the data of our brains. New connections are created and others are destroyed every time we learn or memorize something. It's estimated that a human brain includes some 100 trillion connections—synapses—which provide storage for an estimated 100 terabytes (100 million megabytes) of data and programs. The brain apparently knows how to access all this data in a rapid and efficient way through its immense network of neuron connections for performing activities that seem natural to us, like speaking, walking, or seeing. Today's parallel supercomputers are nowhere close to that. Moreover, a massive supercomputer like IBM's is programmed in much the same way as a single-chip home computer is, using step-by-step procedures. The brain, with its huge parallel computational architecture, is most likely not programmed with step-by-step procedures. The phenomenal performance of the brain in activities such as speaking and playing chess must come from different programming paradigms, which, as of today, we simply don't understand. And, of course, even if we did know how to program a brainlike machine, we wouldn't know what program to write. We know so little about how our brains transform concepts and ideas into speech, and speech into concepts and ideas, that we wouldn't be able to replicate it in a brainlike machine.

But if I had to name just one of the features that differentiate the human brain from any machine we might build, it would probably be its ability to continuously learn. We're each born with a brain that initially knows little more than what other animals do. It knows how to control your internal functions, how your heart beats, how you breathe, and how, as an infant, you suck milk for your survival. As a newborn, your ability to control and coordinate your limbs through your senses of touch and sight, what we call "hand-eye coordination," is rather limited, and you certainly don't know how to speak or understand speech. But evidently, your brain is endowed with the power to let you learn. Indeed, before too many years, you learn how to use your hands, to walk, to speak, and to eat with a knife and fork. And you go on to learn math, history, science, and music. Your ability to continuously learn is exactly what digital computers currently lack, though they're starting to acquire it to a very limited degree, thanks to scientific disciplines such as machine learning. But, still, most of what computers do today has been painfully programmed, step by step, by humans. If we want to build machines that replicate sophisticated human activities, like speaking and seeing, we need to give them, first and foremost, the ability to learn.

Thus, there is a deep divide between how the human brain functions and how a modern computer does, and this divide can't be bridged by today's technology. Asking whether we can teach machines to behave like humans by replicating human mechanisms in them is an ill-posed question, especially given that, by and large, we don't know how those mechanisms work. If you want to build a talking machine, you need to exploit the computational power of today's computers using today's programming models. Airplanes don't flap their wings, yet they can fly faster and farther than eagles. And though few airplanes can land on the top of a mountain, and none can dive down to catch its prey, reproduce, and care for its babies, all airplanes can carry humans and things from one place to the other in the fastest and safest way, which is what they're built to do, and which is definitely not what eagles do. Airplanes and eagles thus have different purposes or goals, so it makes little sense to compare what each can and can't do. Today's talking machines may not have mouths, tongues, and ears, but they can reasonably produce and understand speech, and they'll do it better and better with each passing year. But you shouldn't expect them to do what humans do in all, or even most, situations. Just as it makes little sense to compare eagles and airplanes, it makes even less sense to compare talking machines and humans. Like laptop computers, TVs, cars, and airplanes, talking machines are devices created to expand the capabilities of human beings, not to imitate them.

Excerpted from The Voice in the Machine: Building Computers That Understand Speech, by Roberto Pieraccini, © Massachusetts Institute of Technology, 2012. All rights reserved. Excerpt has been edited for space and style.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

The Great Divide

Voice Deepfake Fraud Surged 1,300 Percent

Sanas Unveils Simultaneous Real-Time Speech-to-Speech Translation

ESTsoft Partners with ElevenLabs

Deepgram Launches Voice Agent API