ASAPP Intros Pre-Trained Speech Models for Conversational AI

Article Featured Image

ASAPP has introduced pre-trained models for speech that it says dramatically improve the performance of conversational artificial intelligence systems that use them.

The free pre-trained models on HuggingFace reportedly achieve a 13.5 percent word error rate reduction and nearly double the speedup when compared against Wav2vec for conversational AI systems that depend on automatic speech recognition, speaker identification, intent classification, and emotion recognition, the company claims.

Wav2vec 2.0 (W2V2), arguably the most popular approach for self-supervised training in speech, contains many sub-optimal design choices in the model architecture that make it relatively inefficient, according to officials at ASAPP.

So ASAPP proposes instead what it calls Squeezed and Efficient Wav2vec (SEW) and SEW-D (SEW with Disentangled attention). The larger SEW-D-base+ model takes a quarter of the training epochs to outperform W2V2, significantly reducing pre-training costs, according to the company.

SEW differs from conventional W2V2 models in the following ways:

  • It introduces a compact waveform feature extractor that allocates the computation across layers more evenly. This makes the model faster without sacrificing performance.
  • The squeeze context network downsamples the audio sequence and reduces the computation and memory usage. This allows for a larger model without sacrificing inference speed.
  • It includes MLP predictor heads during pre-training that improve the performance without overhead in the downstream application since they will be discarded after pre-training.

SEW-D further replaces the normal self-attention with disentangled self-attention, which ASAPP says achieves better performance with half the number of parameters and a significant reduction in both inference time and memory footprint.

Conversational AI systems using the SEW pre-trained models will be able to better detect what consumers are saying, who's saying what, how they feel, and to provide faster response times, ASAPP also asserts.

"The SEW speech models by ASAPP are faster and require less memory without sacrificing recognition quality," said Anton Lozhkov, machine learning engineer at Hugging Face, in a statement. "The architecture improvements proposed by the team are very easy to apply to other existing Wav2Vec-based models, essentially granting performance gains for free in applications such as automatic speech recognition, speaker identification, intent classification, and emotion recognition."

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues