-->

Nari Labs Launches Dia TTS Model

Article Featured Image

Nari Labs, a two-person speech startup, has released Dia, a 1.6 billion parameter, open-source text-to-speech model for high-fidelity speech synthesis.

Dia is capable of producing multi-character conversations from text and incorporates a transformer-based architecture that balances expressive prosody modeling with computational efficiency. It supports zero-shot voice cloning, capable of replicating speakers' voices from a short reference audio clip. Dia can also synthesize non-verbal vocalizations, such as coughing, throat clearing, and laughter. The model also supports real-time synthesis with optimized inference pipelines.

Dia supports nuanced features like emotional tone, speaker tagging, and nonverbal audio cues from plain text.

Dia is available under an open-source Apache 2.0 license for commercial and academic use. Developers can fine-tune the model, adapt its outputs, or integrate it into larger voice-based systems. The training and inference pipeline is written in Python and integrates with standard audio processing libraries.

The model weights are available directly via Hugging Face and GitHub, and the repository provides a clear setup process for inference, including examples of input text-to-audio generation and voice cloning. It currently only supports English.

SpeechTek Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues