OpenAI Launches TTS Models and APIs

Article Featured Image

At its first-ever DevDay event this week, OpenAI launched Audio API, a text-to-speech application programming interface that offers six preset voices (Alloy, Echo, Fable, Onyx, Nova, and Shimmer) and two generative AI model variants.

The Audio API provides a text-to-speech endpoint and can be used to narrate written blog posts, produce spoken audio in multiple languages, and give real-time audio output using streaming.

The speech endpoint takes in three key inputs: the model name, the text that should be turned into audio, and the voice to be used for the audio generation. By default, the endpoint will output an MP3 file of the spoken audio, but it can also be configured to output any of the other supported formats, like Opus (for internet streaming and communication), AAC, for digital audio compression (preferred by YouTube, Android, and iOS), and FLAC (For Lossless Audio Compression).

The Speech API also now supports real-time audio streaming using chunk transfer encoding. This means that the audio can be played before the full file has been generated and made accessible.

The new API "is much more natural than anything else we've heard out there, which can make apps more natural to interact with and more accessible," OpenAI CEO Sam Altman said at the conference. "It also unlocks a lot of use cases, like language learning and voice assistance."

OpenAI also used the event to launch Whisper large-v3, the next version of its open-source automatic speech recognition model, which reportedly offers improved performance across languages.

Another API introduced at the event was DALL-E 3, OpenAI's latest text-to-image model. This API offers new format, quality, and resolution options.

And then there was the introduction of the new, upgraded GPT-4 Turbo, with context that has been expanded to April of this year. Prior versions were cut off at January 2022.

GPT-4 Turbo also accepts up to 128K of text input, up from the roughly 3,000 words that could be accepted at one time by previous GPT models. And it supports the new DALL-E 3 and TTS models.

GPT-4 Turbo can accept images as inputs in the Chat Completions API, enabling use cases such as generating captions, analyzing real world images in detail, and reading documents with figures.

And finally, OpenAI also launched Assistants API, which it calls "the first step toward helping developers build agent-like experiences within their own applications."

The new Assistants API provides new capabilities, such as Code Interpreter, Retrieval, and function calling to handle a lot of the heavy lifting that users previously had to do themselves.

This API is designed for flexibility, according to OpenAI, which said that use cases "could range from a natural language-based data analysis app, a coding assistant, an AI-powered vacation planner, a voice-controlled DJ, a smart visual canvas—the list goes on."

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues