Microsoft Advances Speech APIs
Microsoft has announced the general availability of its Custom Speech Service and the Bing Speech API, finally taking them out of limited public and beta trials.
Both are part of Microsoft’s Cognitive Services portfolio of artificial intelligence.
“Microsoft is interested in moving technology into the hands of our customers; speech recognition is part of that pursuit,” says Xuedong Huang, chief speech scientist and technical fellow in the Microsoft AI and Research group. “It’s very easy to use.”
According to Microsoft, more than 424,000 developers across 60 countries have already tried Cognitive Services. The company now offers 25 Cognitive Services, up from just four nearly two years ago.
These tools will enable developers to add the same machine intelligence that powers Microsoft’s Skype Translator, Bing search engine, and Cortana virtual assistant into third-party applications that people use every day.
“Cognitive Services is about taking all of the machine learning and AI smarts that we have in this company and exposing them to developers through easy-to-use APIs so that they don’t have to invent the technology themselves,” said Mike Seltzer, a principal researcher in the Speech and Dialog Research Group at Microsoft’s research lab in Redmond, Wash., in a statement. “In most cases, it takes a ton of time, a ton of data, a ton of expertise, and a ton of computing to build a state-of-the-art machine-learned model.”
Microsoft has been working for more than a decade to develop speech recognition technology to perform robustly in noisy environments as well as with the jargons, dialects, and accents of specific user groups. This technology is now available to developers of third-party applications through the Custom Speech Service.
In a blog post, Microsoft engineers explained that the Custom Speech Service lets users customize Microsoft’s speech-to-text engine. By uploading text and/or speech data to the Custom Speech Service, users can create custom models that can be combined with Microsoft’s speech models and deployed to a custom speech-to-text endpoint, accessible from any device.
For applications containing particular vocabulary items, such as product names or jargon that rarely occur in typical speech, users can improve performance by customizing language models. For example, an app to assist automotive mechanics might feature terms like “powertrain,” “catalytic converter,” or “limited slip differential” more frequently than in typical voice applications. Customizing the language model will enable the system to learn this.
Similarly, customizing the acoustic model can enable the system to do a better job recognizing speech in particular environments or from particular user populations. A voice-enabled app for use in a warehouse or factory, for example, might require a custom acoustic model that can better isolate speech from all of the background noise.
The Bing Speech API converts audio into text, understands intent, and converts text back to speech.
The entire collection of Cognitive Services stems from a drive within Microsoft to make its artificial intelligence and machine learning expertise widely accessible to developers to create delightful and empowering experiences for end users, said Andrew Shuman, corporate vice president of products for Microsoft’s AI and research organization, in a statement.
“Being able to have software now that observes people, listens, reacts, and is knowledgeable about the physical world around them provides an excellent breakthrough in terms of making interfaces more human, more natural, more easy to understand, and thus far more impactful in lots of different scenarios.”