• November 16, 2022
  • FYI

Carnegie Mellon ASR Pipeline Seeks to Recognize 1,900 Languages Without Audio

Article Featured Image

Most modern speech recognition models depend on large amounts of supervised training data, which is readily available for popular languages like English, Spanish, and Chinese. However, it’s not so easy to get for most of the thousands of languages worldwide that are considered low-resource languages.

A research team from Carnegie Mellon University in Pittsburgh created a voice recognition pipeline that does not need audio for the target language to address this problem.

The model only assumes that it has access to unprocessed text datasets or a set of n-gram statistics. The speech pipeline is composed of acoustic, pronunciation, and language models. The target languages’ phonemes are recognized using the acoustic model. In a grapheme-to-phoneme (G2P) model, the pronunciation model forecasts the phoneme pronunciation given a grapheme sequence. Both the acoustic and pronunciation models use multilingual models without supervision. To apply their newly acquired linguistic skills to low-resource languages without supervision, they can first be trained using supervised datasets from some high-resource languages.

The raw text dataset or n-gram statistics are used to construct the language model. A lexical graph is created by encoding the approximate pronunciation of each word using the pronunciation model. By counting the n-gram statistics, the model can also estimate a traditional n-gram language model thanks to the text dataset. A weighted finite-state transducer (WFST) decoder is subsequently created using this language model in conjunction with the pronunciation model. Using 10,000 raw text utterances from Carnegie Mellon’s own Wilderness dataset, this ASR2K algorithm identified more than 1,900 languages without audio for the target language and yielded 45 percent character error rates and 69 percent word error rate results. That was an improvement over the previous test, which achieved 50 percent CER and 74 percent WER on the Wilderness dataset.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues