-->

Bloomberg AI Researchers Refine OpenAI's Whisper

Article Featured Image

Researchers from Bloomberg's AI Engineering group worked with a group of researchers from the WeNet Open Source Community have refined OpenAI's Whisper speech-to-text AI system to handle live audio streams with minimal accuracy degradation.

Since Whisper is designed to work on whole recordings, not live, real-time audio, it usually struggles to transcribe meetings or phone calls as they happen, the researchers found, To overcome this, they added a second quick-listen system (called a CTC decoder) that produces fast, partial transcripts in real time and then used Whisper's original careful-listen system to clean them up when a pause is detected. They also gave the quick-listen part a smaller set of word pieces, which made it faster and better at guessing unusual words.

Testing on company earnings calls and public speech datasets demonstrated that the new version can keep up with speech in real time, even on regular CPUs, while still delivering accurate, well-formatted transcripts.

Haoran Zhou, lead researcher on the project, said "the result is close to Whisper's transcription quality but now runs in real-time on CPUs with a clear, tunable delay.

"Our work pushes [automatic speech recognition] forward by taking Whisper...and converting it into a true streaming model that delivers near-offline accuracy with low, predictable delay using standard CPUs. We do this by embedding Whisper in a Unified Two-Pass (U2) framework, in which a lightweight, causally masked CTC decoder emits draft transcripts as audio arrives, and the original attention decoder then rescores them for high quality. We further introduce a “hybrid” tokenizer that shrinks the CTC token set for data-efficient fine-tuning while retaining Whisper’s full vocabulary for reranking. This is the first work that turns Whisper into a true streaming model," he added.