2023 Speech Industry Award Winner: Microsoft’s VALL-E Breaks the Mold in AI Training
Text-to-speech models traditionally have required very long training samples, but Microsoft turned that on its head earlier this year with its release of VALL-E.
VALL-E, one of Microsoft’s latest forays into artificial intelligence, is a transformer-based text-to-speech model that can re-create any voice from just a three-second sample clip. The model can re-create the voice from phone call recordings, in-person recordings, or even from podcasts, and it can synthesize that voice to say anything.
The ground-breaking innovation not only reduces the training time to generate new voices, but VALL-E also creates a much more natural-sounding synthetic voice than other models, preserving all of the unique qualities of the original sample, including accent, intonation, charisma, pitch, speaking style, and more. It’s known as zero-shot TTS, and it’s been a long-standing problem for the speech world until now.
Microsoft contends that all this brings performance that far exceeds previous synthetic voice models, to the point where people can find it difficult to distinguish between the real or fake voice.
To create VALL-E, Microsoft fed more than 60,000 hours of speech, much of it from recordings made with the Microsoft Teams app, to train the AI models.
Analysts at the time said VALL-E would greatly democratize speech synthesis, extending its availability and viability to organizations that don’t have a lot of data for better performance.
VALL-E set in motion a series of actions by competitors like Meta and Google, which have both tried to one-up Microsoft. Meta this summer launched Voicebox, an advanced AI tool for generating speech from text and related tasks, such as editing, sampling, and stylizing. Google at the same time introduced AudioPaLM, its own large language model for speech understanding and generation tasks.
Meta claimed that Voicebox could produce audio clips using just a two-second audio sample, and both Google and Meta touted their respective technologies as superior to VALL-E. Neither can deny, however, that Microsoft was the originator.
VALL-E also formed the basis for Microsoft’s SpeechX, a versatile, robust, and extensible speech generation model that is still in the research and development stages.
SpeechX is built on VALL-E and includes a neural codec language model, autoregressive and non-autoregressive transformer models, and task-based prompting to acquire knowledge of diverse tasks, facilitating a versatile and highly extensible speech generation process. SpeechX can also preserve background sounds during speech editing and leverage reference transcriptions for noise suppression and target speaker extraction.
But that’s not Microsoft’s only AI speech innovation from the past year. The company is a big investor in OpenAI, creator of ChatGPT. Microsoft originally contributed $1 billion in funding to OpenAI back in 2019, and earlier this year it kicked in an additional $10 billion in funds to help in its development efforts. Microsoft is looking to incorporate some of OpenAI’s technology into its own offerings, including its Bing search engine.
Microsoft also built its own speech technologies into several of its other products. It gave a voice to Bing Chat for desktops with the addition of a speech-to-text capability. With this capability, users can verbally ask Bing Chat questions and the chatbot can respond in a voice of its own. Currently, the chatbot supports English, Japanese, French, German, and Mandarin Chinese and comes with four voice options, two male and two female.
And amid falling popularity of its Skype communications and collaboration platform, Microsoft added a real-time “TruVoice” translation service for video calls on the platform. By combining Skype Translator with speech recognition and natural language processing AI, Skype can now deliver personal voice translation. The TruVoice component re-creates the speaker’s voice, meaning the person who hears the translation hears it in the user’s actual voice.
To do this, the software samples words users speak and tunes the translation response to sound the same. The feature currently supports English, French, Chinese, German, Spanish, and other languages. It is available for one-to-one calls but is expected to come to group calls in the coming months.