Speech Technology Magazine

 

Google Aims to Improve Call and Video Transcription with New Cloud Speech-to-Text

Google introduces new voice and video transcription models.
Posted Apr 9, 2018
Page1 of 1
Bookmark and Share

According to a statement from Dan Aharon, Product Manager, Cloud AI at Google the company is “announcing the largest overhaul of Cloud Speech-to-Text (formerly known as Cloud Speech API) since it was introduced two years ago.”

Google introduced the Cloud Speech API in 2016, and Aharon says after being on the market for nearly a year usage more than doubles every six months. “Today, with the opening of NAB and SpeechTek conferences, we’re introducing new features and updates that we think will make Speech-to-Text much more useful for business, including phone-call and video transcription.”

Cloud Speech-to-Text now supports:

1. A selection of pre-built models for improved transcription accuracy from phone calls and video
2. Automatic punctuation, to improve readability of transcribed long-form audio
3. A new mechanism (recognition metadata) to tag and group your transcription workloads, and provide feedback to the Google team
4. A standard service level agreement (SLA) with a commitment to 99.9% availability

New Transciption Models

In this version of Cloud Speech-to-Text, Google has added models that are tailored for specific use cases— e.g., phone call transcriptions and transcriptions of audio from video. For example, for processing phone calls, Google routes incoming English US phone call requests to a model that is optimized to handle phone calls and is considered by many customers to be best-in-class in the industry. Now it’s giving customers the power to explicitly choose the model that they prefer rather than rely on automatic model selection.

Aharon says, “Most major cloud providers use speech data from incoming requests to improve their products. Here at Google Cloud, we’ve avoided this practice, but customers routinely request that we use real data that is representative of theirs, to improve our models. We want to meet this need, while being thoughtful about privacy and adhering to our data protection policies. That’s why today, we’re putting forth one of the industry’s first opt-in programs for data logging, and introducing a first model based on this data: enhanced phone_call.”

Google has developed the enhanced phone_call model using data from customers who volunteered to share their data with Cloud Speech-to-Text for model enhancement purposes. Customers who choose to participate in the program going forward will gain access to this and other enhanced models that result from customer data. According to Google, the enhanced phone_call model has 54% fewer errors than our basic phone_call model for our phone call test set.

Additionally, Google is unveiling the video model, which has been optimized to process audio from videos and/or audio with multiple speakers. The video model uses machine learning technology similar to that used by YouTube captioning, and says it shows a 64% reduction in errors compared to the default model on a video test set.

Both the enhanced phone_call and premium-priced video model are now available for en-US transcription and will soon be available for additional languages. We also continue to offer our existing models for voice command_and_search, as well as our default model for longform transcription.

The demo is available on on the product website to upload an audio file and see transcription results from each of these models.

Page1 of 1