MLCommons Releases Two Big Open-Source Speech Datasets

MLCommons, a nonprofit artificial intelligence consortium, has released two large speech datasets as open-source tools to improve speech recognition and voice technology.

The People's Speech Dataset offers more than 30,000 hours of supervised conversational data provided by companies and researchers, including Harvard University, Factored, NVIDIA, Intel, Baidu , and Landing AI, under a creative commons license.
"In short, the People's Speech provides a solid jumping-off point for other companies and individuals to innovate and experiment" with machine learning as it relates to speech technology, the company said on its website.

The other dataset, called the Multilingual Spoken Words Corpus (MSWC), contains more than 23.4 million examples of 340,000 keywords in 50 languages, adding up to over 6,000 hours of speech. The organization constructed this dataset by applying open-source tools to extract individual words from crowd-sourced sentences donated to the Common Voice project, which can then be used to train keyword spotting models for voice assistants across a diverse array of languages.

Contributors to the MSWC included researchers from Coqui, Factored, Google, Harvard University, Intel, Landing AI, NVIDIA, and the University of Michigan.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues