MagicData-RAMC Releases Mandarin Speech Dataset
MagicHub, an open-source community for artificial intelligence, has released MagicData-RAMC, a 180-hour conversational speech dataset in Mandarin for free, enriching the open-source speech corpus and promoting the development of spoken language processing technology and conversational AI.
MagicData-RAMC is a collection of annotated training data that includes 351 sets of multi-turn Mandarin conversations recorded indoors by smartphone over 180 hours. MagicData-RAMC ensured a balanced gender and geographic distribution, as well as a diversity of topics during the collection process. There are 663 speakers in total in MagicData-RAMC, including 368 males and 295 females, 334 from the north and 329 from the south.
The annotation information of each conversation includes transcribed text, voice activity timestamp, speaker information, recording information, and topic information. The speaker information includes gender, age, and geography, and the recording information includes environment and device.