The State of Speech Development Platforms
The speech application development market has been driven largely by consumer products. Recently, it has started to become easier for companies to build their own speech solutions, although this area is not as fully defined as businesses would like.
To date, vendors focused largely on creating consumer speech applications. “Alexa has more than 100,000 skills, but very few of them are for serious business use cases,” explains Deborah Dahl, principal at Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interactions Working Group. “Many skills are student projects and various types of experiments. Not as much energy has been put into constructing industrial-grade applications.”
Year in Review
Indeed, top vendors such as Amazon, Apple, and Google forged their businesses by creating consumer solutions. In 2020, an enterprise platform emerged. With Nuance Communications’ Mix, firms can now construct their own enterprise intelligent assistants.
The solution includes a number of components that enable organizations to create speech applications that integrate with enterprise software via application programming interfaces (APIs). With Mix.nlu, a custom natural language understanding (NLU) system, companies author speech models that are deployed from the Mix Project Dashboard. Mix’s automatic speech recognition (ASR) functions are powered by Krypton, a real-time speech-to-text engine for transcribing audio. Krypton uses domain language models and word sets to customize recognition for specific environments.
“Nuance Mix is really similar to Alexa Skills Kit in the capabilities that it offers third parties who want to build speech applications,” Dahl points out. Because the Nuance solution just began shipping, it is well behind the larger and more established consumer platforms in the number and richness of skills available.
Most enterprise developers have worked with traditional text applications and need to become familiar with the functionality that is available in voice APIs. Nuance seems to be aware of that need. “Nuance created very polished and helpful training materials,” Dahl says. A video walks newbies through the development process, and best practices and tips provide additional guidance.
Suppliers also continued to tune their speech engines. In October, Artificial Solutions, for example, updated its Teneo language, which features the Teneo NLU Ontology and Semantic Network and maps language to sounds.
Teneo also now applies syntactical conditions, such as understanding when a word is used as a noun or verb in a sentence. Additional conversational modules deliver prebuilt solutions with back-end integration for common dialogues, such as a live chat handover or booking a meeting room.
The product continues the conversation even when users have gone silent; maintains a personality that aligns with companies’ brand values; and keeps momentum going even when users go off topic.
Another development platform provider, Voiceitt, first built its speech recognition algorithms and voice database by working with people who have atypical speech patterns. In December, the vendor made Alexa accessible to people with disabilities. The Voiceitt mobile app applies machine learning and speech recognition technologies to help people with speech disabilities stemming from strokes, degenerative diseases, or developmental disorders communicate.
The announcement followed a successful pilot with Inglis House, a long-term-care wheelchair community. The two developed an application to help participants with cerebral palsy use their voices to independently perform common tasks, such as controlling channels on their TVs or playing music.
In general, though, speech solutions have largely evolved autonomously, despite the fact that enterprises would like to interconnect them.
In August, Genesys enhanced Engage, its cloud contact center speech solution, so it runs in multicloud deployments.
Engage’s containerized architecture supports private, public, or hybrid cloud deployments. The product works with leading infrastructure-as-a-service providers, such as Amazon Web Services (AWS), Google Cloud, and Microsoft Azure. With it, organizations can move their software among different clouds or use multiple providers to address varying system requirements, geographic needs, or data sovereignty compliance regulations.
A Look Ahead
Until now, speech development platforms have been largely proprietary solutions promoted by large, successful technology companies. A lot of products are available, but they have different goals, design foundations, and interfaces, and they rarely work outside of their tightly knit ecosystems.
Consequently, businesses cannot easily leverage work done on one of them to an application with a similar need running on another platform. Slowly, the industry is moving to standard, open systems in a few different ways.
One area of emphasis is on open-source solutions. Rasa, a startup that raised $40 million in venture capital, developed an open-source speech development platform. The vendor provides the infrastructure and programming tools that programmers use to create chatbots, voice applications, and conversational services.
Rasa offers three products in its conversational AI suite. Released in 2019, Rasa Open Source creates voice AI software. Rasa X is a free toolset that helps developers build intelligent voice assistants on Rasa Open Source. Rasa Enterprise provides an enterprise-grade IVA development platform.
Open-source software has strong and weak points, according to Dahl. Price is always a deployment consideration, and open-source solutions are almost always available for free. In addition, these products are pliable; people can use them in any way that they desire. They are responsive. With do-it-yourself toolkits, businesses can change software instantly rather than waiting for their suppliers to add needed functionality.
But open-source solutions have their limitations too. Core upgrades often take a while because they require consent from the community, which might have widely different views on how to improve functionality. Usually, these systems are complex, and businesses lack the expertise needed to deploy and maintain them. If problems arise, users usually are not able to pick up the phone and get technical support, something found in commercial systems.
Another emerging trend is a push toward industry standards, which also make it simpler for organizations to build and connect speech software. The Open Voice Network (OVN) emerged from research conducted by the Massachusetts Institute of Technology (MIT) Auto-ID Laboratory, Capgemini Consulting, and Intel in the summer of 2016. “We recognized that voice user interfaces had the potential to reshape how humans interact with computer systems,” explains Jon Stine, executive director of the Open Voice Network.
OVN, which is a directed fund of the Linux Foundation, launched last spring. Currently about a dozen enterprises and more than 150 designers, developers, and strategists have been examining ways to create a common speech software architecture. They have four goals: be secure; offer user, ecosystem, and architectural choice; be inclusive and accessible; and support open software and hardware but still enable commercial differentiation.
The group plans to address one market limitation. “Currently, enterprises have no way to register their speech work,” Stine points out. “There is no DNS [Domain Name Service] for speech.”
OVN has begun examining how to build a database and processes so companies can register items, like their names. With it, consumers can differentiate between Delta Airlines and Delta Dental speech skills, for example.
To date, consumer applications have driven the speech development market. New enterprise focused platforms have begun to arrive, and movement to open-source and standards-based systems have the potential to make it easier for companies to create more business-quality speech applications in 2021.
Paul Korzeniowski is a freelance writer who specializes in technology issues. He has been covering speech technology issues for more than two decades, is based in Sudbury, Mass., and can be reached at firstname.lastname@example.org or on Twitter @PaulKorzeniowski.