-->

NVIDIA Releases Open Dataset, Models for Multilingual Speech AI

Article Featured Image

NVIDIA has released a dataset and models that support the development of high-quality speech recognition and translation artificial intelligence for 25 European languages, including several with limited available data, like Croatian, Estonian, and Maltese.

These tools will enable developers to scale AI applications to support global users with speech technology for use cases such as multilingual chatbots, customer service voice agents, and near-real-time translation services. They include the following:

  • Granary, a massive, open-source corpus of multilingual speech datasets that contains around a million hours of audio, including nearly 650,000 hours for speech recognition and mor than 350,000 hours for speech translation.
  • NVIDIA Canary-1b-v2, a billion-parameter model trained on Granary for transcription of European languages, plus translation between English and two dozen supported languages.
  • NVIDIA Parakeet-tdt-0.6b-v3, a streamlined, 600-million-parameter model designed for real-time or large-volume transcription of Granary’s supported languages.

To develop the Granary dataset, the NVIDIA speech AI team collaborated with researchers from Carnegie Mellon University and Fondazione Bruno Kessler. The team passed unlabeled audio through an innovative processing pipeline powered by NVIDIA NeMo Speech Data Processor toolkit that turned it into structured, high-quality data. This pipeline allowed the researchers to enhance public speech data into a usable format for AI training.

With Granary's clean, ready-to-use data, developers can get a head start building models that tackle transcription and translation tasks in nearly all of the European Union's 24 official languages, plus Russian and Ukrainian. For European languages underrepresented in human-annotated datasets, Granary provides a critical resource to develop more inclusive speech technologies

The new Canary and Parakeet models offer examples of the kinds of models developers can build with Granary, customized to their target applications. Canary-1b-v2 is optimized for accuracy on complex tasks, while parakeet-tdt-0.6b-v3 is designed for high-speed, low-latency tasks. Canary-1b-v2, available under a permissive license, expands the Canary family's supported languages from four to 25. It offers transcription and translation quality comparable to models three times larger while running inference up to 10 times faster.

Parakeet-tdt-0.6b-v3 prioritizes high throughput and is capable of transcribing 24-minute audio segments in a single inference pass. The model automatically detects the input audio language and transcribes without additional prompting steps.

Both Canary and Parakeet models provide accurate punctuation, capitalization and word-level timestamps in their outputs.

NVIDIA NeMo, a modular software suite for managing the AI agent lifecycle, accelerated speech AI model development. NeMo Curator, part of the software suite, enabled the team to filter out synthetic examples from the source data so that only high-quality samples were used for model training. The team also harnessed the NeMo Speech Data Processor toolkit for tasks like aligning transcripts with audio files and converting data into the required formats.