The State of Speech Developer Platforms
Speech vendors are forging new paths to extend the use of their development tools. Enhanced modeling accuracy, improved back-end integrations, extensions to other interfaces, and more open systems were areas of emphasis in 2019. On the docket for 2020 are the development of standard interfaces and increased portability.
The Year in Review
In building their solutions, speech vendors focused on providing users with an intuitive interface to company applications. Amazon’s Alexa Presentation Language, which was announced in June, is one such toolset designed for user interface design. It features a skills personalization feature that enables developers to create personalized skills using voice profiles captured by Alexa applications. The voice profiles address preferences, remember settings, and differentiate among users.
Improving speech recognition is an area of continued interest. In October, Amazon added a trio of new tools to its Alexa Skills Kit. This development toolkit helps companies build self-service applications. Two of the features, Natural Language Understanding (NLU) Evaluation Tool and Utterance Conflict Detection, are designed to enhance voice model accuracy. The former tests batches of utterances and compares how they are interpreted by voice applications’ natural language processing (NLP) models against expectations. To improve result quality, the NLU Evaluation Tool relies on the commands that consumers typically say instead of sample utterances built by an interaction model. As a result, the system isolates areas of training by identifying problematic utterances. The tool also supports regression testing, allowing developers to create and run evaluations after adding new features to their voice apps.
The NLU Evaluation Tool performs measurements with anonymized frequent live utterances from in-production data, which is designed to help tune the accuracy of any changes made to the voice model.
The Utterance Conflict Detection feature detects utterances that are accidentally mapped to multiple intents, one factor that can reduce NLP model accuracy. The feature automatically runs when each model is built and can be used either prior to publishing the first version of an app or as intents are added over time.
In addition to front-end development, integration with back-office business applications was highlighted throughout the past few months. In September, Nuance Communications extended the capabilities of the Nuance Intelligent Engagement Platform, which adds speech capabilities to marketing business processes. The development environment now has interfaces that connect to:
• Messaging Services, so companies can automate and improve human-assisted customer engagements across multiple channels;
• Agent AI Services, designed to provide agents and supervisors with relevant, real-time customer information;
• Security and Biometrics Services, to improve authentication and prevent fraud; and
• back-end integration, so the platform works with third-party cognitive engines and data sources that deliver needed information.
Data analytics has been another area of interest among third-party developers. Amazon added a Get Metrics API that works with third-party data aggregation platforms and allows developers to evaluate various metrics, like unique customers. It also supports the creation of monitors, alarms, and dashboards that spotlight changes that could impact customer engagement.
Speech development platforms have traditionally had varying levels of openness. Apple, for example, has largely focused on tying its systems to its own solutions and made it challenging for developers to use alternatives. In October, the vendor opened its system by allowing Siri to use third-party apps. Users can invoke third-party apps, such as WhatsApp, in place of Apple solutions, like its own Messages application. The third parties, however, will need to add that capability to their software.
A Look Ahead
Portability has been a long-standing challenge for speech developers. “Voice application developers find that they have to rewrite large portions of their software whenever they move them from one speech engine to another,” notes Deborah Dahl, principal at Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interactions Working Group.
Portability touches on many issues. Legacy systems were built to run on servers in data centers. Many new systems have a cloud-first design. Moving software from one to the other represents a complex undertaking. Nuance’s Intelligent Engagement Platform includes cloud-agnostic flexibility, allowing organizations to deploy the same solutions across Nuance’s hosted, public, and private cloud.
Device support is another area of emphasis. “We’ll continue to see voice become the new interface with more and more devices becoming voice-enabled,” says Tony Lorentzen, senior vice president of omnichannel solutions at Nuance.
Then, these solutions must be integrated with legacy solutions. Amazon’s Alexa Presentation Language enables developers to create Alexa skills for devices with screens, such as desktops and laptops.
Another portability issue is migrating software from one system to another. Historically, the market lacked standard interfaces, so enterprises and third parties had to complete common work, such as allocating storage, each time time they worked with a different speech engine.
In October, Nvidia unveiled Jarvis, a multimodal AI software development kit that combines speech, vision, and other sensors in one system. The tool supports workflows for building, training, and deploying GPU-accelerated artificial intelligence systems that can combine visual cues, such as gestures and eye movement, along with speech to establish context.
In September, Amazon led the formation of the Voice Interoperability Initiative, an ad hoc consortium to create standard voice development interfaces. The group set the following four goals:
• developing voice services that work with other solutions, while protecting the privacy and security of customers;
• building voice-enabled devices that promote choice and flexibility through multiple, simultaneous wake words;
• releasing technologies and solutions that make it easier to integrate multiple voice services on a single product; and
• accelerating machine learning and conversational AI research to improve the breadth, quality, and interoperability of voice services.
More than 30 companies, including Baidu, Microsoft, Salesforce.com, and Verizon, support the effort. Apple and Google are notable absentees. In 2020, the group’s first fruits are expected to arrive.
Paul Korzeniowski is a freelance writer who specializes in technology issues. He has been covering speech technology issues for more than two decades, is based in Sudbury, Mass., and can be reached at firstname.lastname@example.org or on Twitter @PaulKorzeniowski.