What Does Scaling Mean for the Speech-to-Text Industry?

Article Featured Image

The speech-to-text industry has accelerated at a staggering pace during the past few years. The near-perfect transcript, once a pipe dream, is getting closer and closer to becoming reality. There are many reasons for this progress, but it is in part due to the development of deep learning techniques that are unlocking a new generation of incredibly sophisticated models, providing previously unimagined levels of accuracy and understanding.

Another key factor is that this has been delivered at scale, and increasing the amount of training data, the size of the model, and the amount of compute will result in significant performance gains.

This is not distinct to speech-to-text but encompasses the wider reinforcement learning, computer vision, and large language model fields as a whole.

The symbiotic relationship between self-supervised learning (SSL) and data lends itself perfectly to scaling. In short, SSL is a machine learning method that has increased available data resources and reduced the manual process of labelling. It takes vast amounts of unlabelled data and uses some parts of it to construct a supervised task, without the need for expensive data labelling by humans. In real terms, this means engines can be trained on data straight from the gigantic pool of audio files taken directly from the internet, thus delivering a far more comprehensive representation of all voices and dramatically reducing artificial intelligence bias and errors in speech recognition.

Naturally, scaling a neural network has to be done carefully and correctly. Thanks to advancements in graphics processing units (GPUs), the hardware allows the training of models with orders of magnitude more parameters compared to a few years ago. Efficiently distributing the data and model across many GPUs allows you to supercharge machine learning capabilities.

The current hype around language models and AI-powered chatbots, such as ChatGPT, show the technology working at scale. These tools can complete task without having been trained on them.

The release of GPT-4 presented a step-change for the AI community; this technology can understand and respond to both text and images. And like GPT-4, both Deepmind's Flamingo and Microsoft's Kosmos-1 have multimodal capabilities, allowing them to reason about and understand both text and images. These multimodal language models can perceive images (as well as text), take IQ tests, and perform image captioning. As a user, you can interface such models with natural language, but they are limited to replying in text; they can't reply with a meme or GIF just yet.

However, we cannot ignore the fact that scaling, and any plans for AI advancements, require a wide-ranging compute strategy. When tackling the biggest AI problems, you need and will continue to need massive compute resources. This is the only way for AI companies to achieve scale. AI is hugely dependent on compute power; much like electricity, rail travel, and the internet, it's part of the infrastructure of modern life.

Scaling the dataset, model, and computing power brings us increasingly closer to near perfect transcripts. Recent strides in speech recognition technology, such as contextual understanding, entity formatting, real-time captioning, and translation have highlighted the industry's innovativeness. Now what's needed is for speech-to-text providers to hold themselves to a higher standard and take a much broader view of what makes a transcript genuinely accurate and useful.

Scaling neural networks enables models to learn better representations of speech from more data, which in turn allows increased contextual and pronunciation understanding. It is through this that the perfect transcript will be within our grasp.

With captioning, readability and punctuation are essential, whereas in contact centers, speed and knowing when the speaker changes are critical. In doing so, we can increase the inclusion of speech-to-text services and develop a host of new use cases, which will only stand to benefit the wider industry.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues