The 2022 State of Speech Development Platforms

Article Featured Image

The speech platform market appears to be nearing an important inflection point this year. Initial enthusiasm around the technology has been tempered. Deployments continue to rise but more gradually than initially expected. As a result, suppliers are shifting their priorities with the goal that the changes will lead to breakthroughs that will nudge market adoption.

Speech application development platforms are the foundation upon which business and consumer voice applications are built. Creating an infrastructure that presents users with a voice interface is a massive undertaking, one requiring a number of building blocks. Some blocks are already in place, but many are still being developed.

Year in Review

Initially, vendor efforts centered on adding more languages and extending the reach of their development tools. Efforts continue in those areas.

In March, Microsoft added support for 11 languages to Azure Neural Text-to-Speech. The vendor now works with 60 languages, 142 neural voices, and a total of 219 voices.

In April, SoundHound, which supplies voice artificial intelligence and conversational intelligence technologies, expanded its Houndify Voice AI platform to 22 languages. With it, developers add conversational intelligence to their products and services.

Developers like to work with certain tools, languages, and skills. In response, leading vendors also added new development aids to their product lines.

In July, Amazon unveiled its largest release of new tools to date. Developers can now build Featured Skill Cards that promote their skills in a home screen rotation.

Also, their skills are now suggested when Alexa responds to common requests, like “Alexa, tell me a story”; “Alexa, let’s play a game”; or “Alexa, I need a workout.” The personalized skill suggestions are based on customers’ use of similar skills. New contextual discovery mechanisms allow customers to use natural language and find the skills.

In the goody bag was a way for developers to create widgets for their skills. With them, customers interact with Echo Show or other Alexa devices via screen input as well as voice.

A Look Ahead

Increasing the number of languages and tools is helpful but does not address a primary market roadblock: Companies still have trouble building the business case for voice application deployments. “Very few corporations come to us to build a voice-only application,” explains John Earle, president and founder of Chant.

When voice interfaces were announced about a decade ago, suppliers modeled their efforts after the mobile application development market, but voice has not caught on as quickly or become as ubiquitous as mobile apps.

Market leader Amazon’s experience helps to illustrate the evolution and identify where the industry is now. On the one hand, the vendor has been quite successful. More than 900,000 developers created more than 130,000 Alexa skills, which are used in a wide array of mainly consumer applications.

But after an initial rush, there has been a significant decline in skills development. In 2019, Alexa skill revenue in the first 10 months of the year was only $1.4 million, far short of Amazon’s $5.5 million target. Since then, Amazon has not released its skills numbers, developer revenue, or targets.

Why would interest ebb? “The initial voice skills were plentiful but in many cases not very useful, especially for businesses,” explains Deborah Dahl, principal at Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interactions Working Group. “They were developed quickly and in a number of instances without deep thought or care.”

Businesses have high expectations, though. For instance, they must put checks in place to protect customer data privacy, according to Balaji Raghavan, chief technology officer of Uniphore. He adds that the smallest of errors can result in significant business problems, like when a sales call interprets a caller’s intent as “go” instead of “no” or when a call center customer breaks down into tears over the loss of a relative, but the voice bot tries to quickly wrap up the call to minimize handling time.

Speech application development has lagged behind mobile market application development for a few other reasons. One factor is that the former matured through many years and built up a robust ecosystem where code can be easily mixed and matched. Standards have emerged that made it easy for vendors and third parties to mix and match software. As a result, compliant software features high levels of interoperability and portability, allowing companies to spend more time adding needed functionality to their voice applications and less trying to get the basic infrastructure pieces to work together.

Consequently, work is under way on a number of fronts to address these shortcomings. In certain cases, suppliers head up the spadework.

Amazon has been at the forefront of the Voice Interoperability Initiative (VII). Its objective is to develop common interfaces so multiple voice agents simultaneously work on a single device.

Amazon also developed Multi-Agent Design Guide, offering best practices for creating such solutions. The initiative gained support from more than 80 suppliers, a group that included consumer electronics brands, automotive manufacturers, telecommunications operators, hardware solutions providers, and systems integrators. Dolby, Facebook, Garmin, and Xiaomi all support the work.

The initiative has potential, but it is focused on development for Amazon’s own ecosystem. Alternatives are emerging that have a wider scope.

In June 2020, the Linux Foundation formed the Open Voice Network (OVN). The initiative grew out of work by the Massachusetts Institute of Technology (MIT) Auto-ID Laboratory, Capgemini Consulting, and Intel.

The consortium determined that voice agents need to cooperate and, sometimes, collaborate with each other. “The Open Voice Network (OVN) believes that interoperability should enable voice assistants to share dialogues, data, context, and controls,” says Jim Larson, vice president of Larson Technical Services and a senior adviser at the Open Voice Network.

The OVN outlined six voice agent interoperability features:

  1. Invoke remote voice agents.The goal is to provide voice with the same ubiquitous capabilities as data now has on the Internet. The voice agent address enables it to reach any network destination, regardless of platform or location.
  2. Support a voice registry system.On the internet, the Domain Name System (DNS) routes requests for specific websites through the internet to the named websites. A Voice Registry System (VRS) enables voice agent owners to register the unique names of their software, so users can directly connect to them.
  3. Switch among voice agents.Currently, voice agents operate in seclusion. The industry has to move to a model where users can invoke multiple voice agents.
  4. Process an implicit request.Right now, users have to ask questions in a direct manner. They should be able to make implicit requests.
  5. Share data and context among voice agents.Consumers do not want to have to answer the same questions from each voice agent. A voice agent needs to be able to share any user data it collects and put it into the right context.
  6. Extend companies’ personas.A persona refers to the voice and characteristics presented by a voice agent. Rather than switch between personas when changing between voice agents, developers can maintain the persona of the first voice agent when users switch to a second voice agent.

The World Wide Web Consortium, the body responsible for the VoiceXML specification, meanwhile, has been working on a third option. The W3C voice interaction community group wants one voice application to pass information to a second. Areas they are examining include the following:

  • discovery of virtual assistants with specific expertise, for example, one that can supply weather information;
  • standard formats for statistical language models for speech recognizers;
  • standard representations for references to common concepts, such as time;
  • interoperability for conversational interfaces; and
  • common work on dialogue management or ‘workflow’ languages.

The end result is that work has begun on developing voice industry standards that would make it simpler for software vendors, third-party systems integrators and consulting companies, and businesses to integrate the technology into their applications. Right now, the work is being undertaken autonomously. “Technically, there is a lot of potential for coalescence,” Dahl says.

But hurdles remain, starting with alignment of the standards. “Licensing and IP (intellectual property issues need to be worked out,” she adds.

The building blocks for voice development platforms continue to take shape. Vendors are extending their solutions. Voice agent interoperability projects are gaining traction. These possibilities are expected to vie for acceptance in the new year and make it easier for vendors, enterprises, and third parties to mix and match speech software. 

Paul Korzeniowski is a freelance writer who specializes in technology issues. He has been covering speech technology issues for more than two decades, is based in Sudbury, Mass., and can be reached at paulkorzen@aol.com or on Twitter @PaulKorzeniowski.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues