Speech Recognition Has Finally Come of Age. Now What?

Article Featured Image

Those who have long been involved with speech technology remember the annual mantra “Speech recognition will come of age next year!” It got to the point where it became a trite saying applied to various other new technologies, while recognized as an ongoing joke about speech recognition. Yet here we are with speech recognition a part of everyday life, suddenly accepted by the masses as quickly as the electric light bulb became a fixture of every household.

Recently, several technology pundits have dismissed new speech-enabled solutions because “pretty much everything is just based on a Google or Alexa API.” Rejecting solutions as not innovative because they are not fully provided by a single vendor is misguided. Just as with electric light, current-day speech-enabled technologies don’t deliver their benefits due solely to a single vendor; expertise is applied to add value by various parties throughout these solutions’ development. Yet these sometimes complex arrangements do present challenges for buyers and providers alike.

During the long period where speech recognition was due to become the next great thing, recognition was focused on speech-to-text via statistical analysis of phonemes (please forgive the shorthand description). Though it seems silly today, back in 1986 IBM announced that its Tangora could predict upcoming phonemes through the use of the Hidden Markov Model. This resulted in predictions of ubiquitous speech recognition right around the corner.

Today’s acceptance of speech recognition is the result of adding contextual understanding with highly tuned statistical modeling to overcome the obstacles in language that confound humans. A quick review of any social media will demonstrate that homonyms such as “there,” “their,” and “they’re” apparently require high-powered cloud computing to get right.

With the rise of Amazon’s Alexa, Google Assistant, Apple’s Siri (which is based on Nuance’s speech recognition), and Microsoft’s Cortana, contextual understanding leapt forward due to the billions of utterances constrained in boundaries such as maps and directions, computer commands (e.g., “open Word” or “send text message”), automotive commands, and so on. With the massive power of cloud computing enabling contextual understanding, speech recognition has become quick, convenient, and helpful.

To get back to the opening point, with speech recognition’s sudden, actual ubiquity, what is a buyer or provider of speech-enabled technology to do? How can you protect yourself from rapid changes in alliances or terms and conditions with these global behemoths? And what about the potential for abuse of personally identifiable information (PII)?

All the previously mentioned vendors are either planning or already offer enterprise integration, and several other traditional enterprise software vendors are joining them. For the most part their contracts follow those of most enterprise cloud computing offerings, but many holes are conveniently left to the point where common sense should be a guiding factor in protecting yourself. For example, no PII should be shared outside of your organization if you are a buyer of the end solution, and if you are using these third parties in your speech-enabled solution, you should create an additional Chinese wall to protect end users in case your customers haven’t fully disassociated PII from speech processing.

Furthermore, given where the sentiment of North America’s population is heading regarding data privacy, using the European Union’s General Data Protection Regulation (GDPR) as a guide might add a healthy dose of future-proofing to your solution or use of a speech-enabled solution that relies on cloud processing.

On a more positive front, speech technologies are now changing at a rate never thought possible, so keep your eye on opportunities to apply new functionality few would have dreamed of even a year or two ago. And always keep in mind author Douglas Adams’s admonition: “We are stuck with technology when what we really want is just stuff that works.” 

Kevin Brown is customer experience application architect at Banner Health, where he specializes in voice and web customer experience solutions. He has more than 25 years of experience designing and delivering speech-enabled solutions. You can reach him at kevin.brown@voxperitus.com.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues