The 2023 State of Speech Development Platforms
Like most people, speech application developers and platform providers are glad that 2022 is over. A sobering reality set in: The market ramp-up was taking longer than expected, and the business case, especially for enterprise users, was quite difficult to make. So in 2023, finding ways to ease deployment has become a top priority, one they hope will eventually increase adoption.
Year in Review
In 2022, vendors retrenched. Their efforts to pollinate the market were costing too much and producing too little. Add to that a technology landscape plagued by worldwide economic uncertainty, ongoing supply chain issues, and a sometimes-hostile government regulatory environment. And some of the largest tech giants in the space, including Amazon, Google, and Microsoft, all laid off large portions of their global workforces.
While speech has many potentially beneficial capabilities, creating these applications has proven to be quite challenging. “Building a voice application is not like designing a web application,” explains James Larson, vice president of Larson Technical Services. “Just about anyone can develop a web application with all of the drag-and-drop tools that are available. Speech relies largely on new technologies that many developers must learn how to use. Right now, there are not enough skills and tools available to simplify deployment.”
Input is another issue. With web apps, users are forced to enter information in a set way—usually filling in a field asking for items, like personal information. Speech is freeform; users can express themselves in a myriad of ways. The application has to somehow put the words into context and recognize what the person is saying, a task it rarely does well enough.
Another problem is that application design has largely been monolithic. One supplier assembles the entire software infrastructure—from the lowest layer, the artificial intelligence speech recognition systems, to the top, the development tools that add voice interfaces to applications.
In the web world, such elements are broken into discrete pieces, so third parties focus on select issues, like mobile device portability. Speech solutions have an antithetical design. Here, the vendor builds the speech recognizer, speech analyzer, dialogue system, generation system, etc. Because of that approach, interoperability is largely nonexistent.
“Initially, vendors claimed that delivering speech applications would be simple and easy, but companies found that isn’t the case,” says Deborah Dahl, principal at Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interactions Working Group.
Enterprise developers and third parties find themselves working with a multiplicity of incompatible puzzle pieces. The products do not talk to one another and do not engage well with the existing technology infrastructure: customer care applications, websites, mobile apps, digital messaging platforms, or home devices. As a result, programmers spend a lot of time integrating the different software components, which leaves them less time to improve the front-end user-facing features.
Businesses spend time, money, and manpower on speech apps that often fall short of their goals, so making the business case to deploy voice has been challenging as well.
A Look Ahead
Speech developer platform suppliers have responded to these limitations in a few ways. One way has been to narrow their focus to select environments rather than every possible use.
For instance, Amazon made a series of announcements that largely extended the Echo home systems. The Alexa Voice Service (AVS) SDK 3.0 combined the Alexa Smart Screen SDK and the AVS Device SDK into a single solution. Offering one SDK with configurations and templates simplifies application development and updating.
In October at Ignite, Microsoft’s Azure Communication Services added automation capabilities designed to speed up call workflows and simplify the delivery of personalized customer interactions. Its REST APIs provide a programmable interface that abstracts the complexity of telephony systems. Developers use the software to program calling transactional workflows, like proactively notifying parents and students of school closures, or complex interactive workflows, such as resolving airline flight changes.
In addition, suppliers have been teaming up and linking their products. At the start of 2022, Nuance Communications expanded its partnership with Genesys, providing Genesys Cloud CX customers with access to Nuance’s Contact Center AI, which features interactive virtual assistants and biometric authentication.
As 2022 unfolded, the need for industry standards became pressing, and a number of different Initiatives gained traction. In some cases, the groups are sharing information and working together, but in others, they are not.
Amazon is driving interoperability mainly among smart home devices. The vendor’s Voice Interoperability Initiative (VII) now supports multiple IVAs and multimodal customer experiences. Its Universal Device Commands (UDCs) and Agent Transfers (ATs) support multi-simultaneous wake words, so customers can interact with different voice services via one device. When a person speaks a command, the system automatically transfers requests it can’t fulfill to another service.
The World Wide Web Consortium’s Voice Interaction Community Group constructed a general architecture for personal agents and interfaces among these agents. They are concentrating on linking intelligent personal assistants (IPAs), which are available in two configurations. The first runs largely on smartphones, for instance Apple’s Siri, Google Assistant, and Samsung’s Bixby. The second option is smart speakers, like Amazon’s Alexa or Google Home.
In June 2020, the Linux Foundation formed the Open Voice Network (OVN), which sprouted from work done by the Massachusetts Institute of Technology (MIT) Auto-ID Laboratory, Capgemini Consulting, and Intel. That organization has been crafting standards so voice applications share dialogues, data, context, and controls.
In 2022, their work centered on improving the following five areas:
Component interchangeability. This work enables developers to mix and match different speech components—replacing speech recognition for English with one supporting another language, or one AI data model with a more powerful one. OVN is working with the W3C Voice Interaction Community Group in this area.
Mediation. Here one conversational assistant acts as a user of another conversational assistant. Just as users call on experts to help them with complex problems, a conversational assistant confers with other conversational assistants. Possible use cases include the following:
• instant replay—replaying the conversation just prior to an interruption to remind the user what he was doing;
• dictionary lookup—asking for clarification of a term;
• thesaurus lookup—asking for a better or an alternative term; and
• just-in-time help—asking how to perform a current task.
Channeling. In this case, one software module modifies the voice characteristics of a voice assistant. For example, a channeling assistant slows the audio rate for non-native speakers or cognitively impaired speakers or increases the audio volume for the hearing impaired.
Data sharing. Often, the user is burdened with reentering the same information into multiple voice agents. It would be more convenient for agents to share data when the user switches among them.
Service automation. One form of reusability is automation, gluing voice agent fragments together and producing new voice agents. In this case, OVN is relying on work done at Stanford University. One example of constructing new capabilities developed at the Open Voice Assistant Laboratory is an open-source voice interoperability model where users access multiple independent agents.
In 2022, a formidable new player emerged: OpenAI, an AI research and deployment company. The ambitious movement combined open-source development and more than $1 billion in seed money from Elon Musk and others to create ChatGPT. This large language model can be used for natural language processing tasks and focused on the following four areas:
Text generation. ChatGPT generates humanlike text responses to prompts. Use cases include creating chatbots for customer service, generating responses to questions in online forums, and creating personalized content for social media posts.
Language translation. In this case, the user provides the model with a text prompt in one language, specifies the target language, and the model translates the text.
Text summarization of long documents or articles. This feature provides a quick overview of articles so users do not have to read the entire documents.
Sentiment analysis. This feature helps users understand the overall tone and emotion of a piece of writing and can be helpful in detecting customer sentiment to improve customer satisfaction.
ChatGPT can generate humanlike text responses to various prompts. The technology has many potential uses: creating chatbots for customer service, answering questions in online forums, or creating personalized content for social media posts.
The industry focus is now on developing standards so development becomes less monolithic, more heterogeneous, and more open. The goal is to move away from siloed, walled development niches to an open, interoperable worldwide voice web. The work is complex and touches on a wide range of areas. A lot of tasks remain, including development of the framework, prototyping, testing, and eventually releasing a set of commercial products. More groundwork is expected to aid momentum in 2023, but a market with easy-to-deploy, plug-and-play solutions will probably need more time before it fully takes root. x
Paul Korzeniowski is a freelance writer who specializes in technology issues. He has been covering speech technology issues for more than two decades, is based in Sudbury, Mass., and can be reached at firstname.lastname@example.org or on Twitter @PaulKorzeniowski.