The 2017 State of the Speech Technology Industry: Speech Engines

If the story of speech engines to date has been about perfecting the artificial ears that listen and the artificial mouths that speak, the story of speech technology going forward is about perfecting the artificial brains that will make sense of it all.

2016 seems to have been a tipping point for speech engines. Issues with improving the speed and accuracy of speech recognition have largely been solved, and both clients and providers are now increasingly focused on the conversational technologies that can activate speech in powerful, meaningful ways. Interest in both intelligent assistants and speech analytics is on the rise, and product development is growing to meet demand.

MarketsandMarkets projects that the overall speech market will reach nearly $10 billion by 2022, rapidly expanding from its $3.7 billion valuation in 2015. As that happens, more generalized tech companies have started to make serious forays into the speech market, and companies that have specialized in speech technology for years are looking to make their products more accurate, robust, intelligent, and scalable.

Nuance Communications, for example, in May launched its Nuance Transcription Engine, which boasts 88 percent accuracy at its lowest transcription speeds and an upper limit of 10 times real-time audio, delivering swift and near-human voice transcription in a very palpable way.

As the wealth of utterances driving the machine learning for recognition reaches critical mass, the speech market will boom, with start-ups looking to add speech to their product functionality. Companies that offer strong cloud-based computing will have a distinct advantage, according to Deborah Dahl, principal at Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interactions Working Group.

“Amazon, Google, Microsoft, and IBM are all strong products technically,” she says. “Deep learning has made a huge difference in speech recognition accuracy over the past few years in cloud recognition.”

In last year’s spring issue, Speech Technology magazine suggested that 2016 might be the year when the proprietary barriers that have held speech engines back finally fall. That has happened to some extent, as expanding applications for consumer speech have led many vendors to diversify their footprints and expand their ecosystems.

Amazon, for example, offers out-of-the-box speech with the Alexa Voice Service (AVS), an implementable version of the speech-recognition and natural-language-understanding tools found in its intelligent assistant of the same name. As for the intelligent assistant itself, its ecosystem is growing rapidly. Beginning with only 14 “skills” (applications designed largely by third-party developers to enhance Alexa’s functionality), Amazon claimed to have roughly 6,000 skills tied to its Alexa platform by the end of 2016. Additionally, Amazon has already sold more than 4 million Alexa-enabled devices, which include the Echo home speaker line and the Fire line of TVs and tablets. Constantly feeding utterances into the Amazon cloud from this ever-growing user base will likely ensure continually improved machine learning, an ever-broadening lexicon, and greater speech recognition accuracy.

In addition, a recent collaboration between Amazon and the audio technology company Conexant Systems has resulted in the Conexant AudioSmart CX20921 Voice Input Processor, whose implementation is poised to greatly increase voice recognition performance in high-background-noise environments.

Google has also expanded its services, adding voice transcription and text-to-speech (TTS) as features in its own apps, like Google Docs or Google Now, or through downloadable add-ons to its Chrome web browser. Devices running on the Android operating system also enjoy speech-recognition and transcription, and third-party developers looking to integrate these offerings will find them available through the Speech API on the Google Cloud platform.

IBM also continues as a pioneer in the speech market, merging speech technology and cognitive technology concurrently in Watson, its suite of ever-evolving deep learning applications. Watson is capable of speech recognition and TTS, both of which are available as APIs or software-as-a-service. In addition, IBM launched Watson Virtual Agent in October, allowing businesses to build and deploy conversational agents in a market primed for automated customer service in chat channels.

Microsoft offers enterprise access to its speech engine through Microsoft Speech Platform, a set of APIs that allow TTS and speech recognition; Microsoft’s own products have these functions baked in. Its own intelligent assistant, Cortana, comes standard on every product running Windows 10 and most Windows-based mobile devices.

In perhaps the most exciting development for Microsoft speech, a team of researchers reportedly developed a speech recognition system whose accuracy rivals or surpasses what human transcriptionists can deliver. The error rate is merely 5.9 percent.

The 2017 State of the Speech Technology Industry: Speech Engines

Movate and Krisp Partner on AI-Powered Voice Solutions

DeepL Launches Voice API

BoldVoice Raises $21 Million to Advance AI Voice Coach

DentScribe Introduces Perio Charting