Speech Recognition Is Still Hard Work—and Reliant on People
With a son deeply involved in making autonomous vehicles a reality, I rarely get through a week without someone mentioning to me how humanity is quickly headed toward the Singularity’s ending all human independence from technology. My reply usually is something along the lines of wishing that technology would just do what we ask it to do, at least for a start.
Then my work comes up, and everyone jumps in with their latest complaints about (fill in the blank) speech recognition. We end up agreeing that humanity has little to worry about from computers when the two sides seem to work together about as well as politicians of different parties.
Artificial intelligence (AI) seems to be the latest bright and shiny discussion topic and sales pitch. The presentation of AI uses many different terms, and the term “neural network” is woven in more often than not. At the risk of being a buzz kill, here is today’s speech recognition reality in the form of a real-life case study.
Aviation is a highly regulated activity, with virtually every action dictated by a written procedure and a supporting regulation with a penalty/fine behind it. Frank Sinatra wouldn’t have liked being a pilot because there is virtually no wiggle room for you to do anything “My Way.”
Within this construct, Adacel is the leading supplier of air traffic control (ATC) simulators for training, research, and modeling of air traffic procedures, as well as voice-activated cockpit interfaces. In the ATC training simulators that Adacel supplies to the U.S. Air Force, speech recognition is used to convert controllers’ communications to simulated pilots into commands that the simulator system can process. Adacel is contractually obligated to provide a minimum of 98 percent word accuracy.
An analysis of the official ATC manual, FAA7110.65, shows a requirement to support fewer than 75 controller phrases. Since air traffic control communications are part of a well-defined approach to flight safety, it would be reasonable to assume that, allowing for regional variations, support for 150 phrases would suffice. However, the unpredictability and creativity of verbal communication even in the disciplined domain of ATC often results in the use of unexpected phrases. Today that support for 150 phrases in the Adacel system has evolved into a speech system that recognizes more than 150,000 phrases and an almost infinite number of possible permutations. Even with support for such an extensive grammar, every week has at least one report of a new phrase that is “unrecognized.”
Given the hard reality of extemporaneous speech—especially English, with its 200,000-plus words, many of which with multiple meanings—how do we approach the next phase of speech recognition, beyond the recent substantial acoustical improvements?
Someday deep neural networks, supported by advanced analytical techniques and tools yet to be imagined, may make progress to the point of matching human understanding. That may be 10 or perhaps 20 years out. In the meantime, there is plenty of evidence that humans are changing how we speak to our technology to make it recognize our intent. How will learning technology adjust to our increasingly stilted speech? Humans are adapting to technology; will we be able to make technology adapt to us? Perhaps such systems will need to be set to always listen, much like an infant transitioning away from baby talk by overhearing adult conversations. Privacy concerns are coming to the forefront!
As someone who is “Inside Speech,” my recommendation is to stay tuned to progress but don’t buy into the bright and shiny claims floating around about AI or deep learning technologies just yet.
Given my experience of both, I am betting that autonomous vehicles will beat AI/deep learning speech recognition in the race to be part of our daily lives. But at this point I am not sizing up a new purchase of a driver-less car or a neural network–driven speech recognition system. Let the buyer beware.
Kevin Brown is managing director at VoxPeritus, where he specializes in speech solutions and caller experience consulting. He has more than 25 years of experience designing and delivering speech-enabled solutions. You can reach him at firstname.lastname@example.org.