It’s Not Magic: The Importance of Data in Your Machine Learning and AI Algorithms

Article Featured Image

Machine learning is not black magic. A simple definition is the application of algorithms to data to uncover useful aspects of the input. There are clearly two parts to this process, though – the algorithms themselves along with the data to train them, and the input data being processed and fed in.

Quite simply, algorithms cannot work well on poor training data volume – and a deficit of this data leaves the system undernourished. Ultimately, the system is left hungry for more. With more data to consume, the system can be trained better and the outcomes are then stronger. Without question, there is a big need for an ample amount of data to offer the system a healthy helping to configure the best outcomes. What is crucial, though, is that the data collected for training is representative of the tasks you intend to perform.

The Difference in Data

When using machine learning (ML) specifically for automatic speech recognition (ASR) technology, two other factors also have to be considered when it comes to data:

  1. The data that is used for training models to build the actual product
  2. Data that is fed through the model by the customers that use it 

The two are related but should be considered differently as they are used at different times and should not need to be one and the same. Data used for training models is sourced data. It is data that enhances the model for customers to use. Data that is fed into the model by customers using the system should not have to be added to enhance the baseline product model. Additionally, the data security requirements are down to the customer not the ASR provider.

A Data Surplus 

For most of the history of ML, data has been a precious commodity. By necessity, the field has had to spend time developing techniques that made optimal use of small amounts of data.

Of late, however, large amounts of data are becoming increasingly available. On a global scale, there are commentators talking about data following Moore’s law, with the amount of raw data doubling roughly every two years. This is great news, right? With the explosion in the use of deep learning – which is even more data-hungry than more traditional machine learning methods – more data will help us learn better and develop more nuanced models. Well, that is true, but only up to a point.  

Out With the Bad, In With the Good 

It is also important to establish that more data input doesn’t always mean better output. The implicit assumption when it comes to ML and artificial intelligence (AI) is that more data leads to better models and AI systems. However, that is not always necessarily the case. If your data is of poor quality, adding more of it may actually harm your performance as your model will learn irrelevant or even incorrect associations.

The Real World 

So how does all of that translate in the real world?

People are impatient and constantly looking for ways to cope with being time-poor. Living in the noisy digital world we now occupy, how do businesses ensure they are using ML and AI solutions responsibly whilst improving customer experiences?  

For example, when people call a contact center, they want a speedy resolution. Using ASR can do things like integrate knowledge-based articles, enable real-time insights for agents, get the right answers the first time, and allow agents to see a full call history to ensure they are not covering previous ground. All of these things provide great efficiency. These capabilities sound really useful – but businesses are also required to ensure they are collecting and using the recorded voice data responsibly. 

Data security is becoming more prominent within the ever-evolving digital world and new regulations across industries and businesses are ensuring they and their providers have the correct data security in place. 

The New Kid On the Block

The AI and ML industry is endlessly developing, and the latest topic is continuous intelligence. To establish continuous intelligence, it obviously has to actually be continuous. 

In order to do continuous intelligence, you need to listen to everything all the time, which makes the security concern even higher. What does that “everything” encompass? And what is it being used for exactly? Who has access and how long has the technology been installed?

For example, in a contact center, continuous intelligence for voice capture can open up a whole host of insights to improve customer experience and business workflows. Once the business has the insight, though, it might no longer need to keep the original data. Should it be kept, what can it be used for if it is, and how do you help customers understand this? 

These days, we are increasingly concerned about the security of our data. This comes as no surprise, given almost daily news stories about Amazon’s Alexa listening to us in our own homes even without the wake word. So, the question then becomes, how do we satisfy an impatient society whilst also ensuring the security of people’s data? 

So How Important is Data Anyway? 

To ultimately satiate an ASR system, there needs to be enough data provided to execute the training so good systems can be built – but without committing consumers to giving away data for training that they consider private to achieve these results. 

ML algorithms are in a constant state of evolution and techniques are now available that are allowing smaller data sets to be used to bias systems already trained on big data, enabling the use of data within protected silos where needed. In some cases, smaller amounts of data can achieve “good enough” accuracy through the application of clever techniques and data use. The overall issue of data acquisition is not removed – but sometimes fewer data can provide solutions.

It is the responsibility of both the ASR provider and the business solution provider to ensure people’s privacy isn’t being compromised when it comes to capturing voice. If businesses begin to turn to continuous intelligence within their workflows, then the industry must ensure all data is kept secure at all times. 

We are already fighting an uphill battle with a lack of trust in voice technology, so the industry must maintain its search for ways to make the technology work better without people’s privacy being compromised.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

Speech Analytics Reveals Companies Aren’t F%!king Listening

As customer sentiment becomes increasingly angrier, businesses must consider the underlying issues before it's too late. So, what are the common triggers for the frustration, and how do we resolve them? 

The Internet of Things Is Getting Emotionally Intelligent

As IoT devices have exploded, we're approaching a new paradigm in which the internet-connected devices in our lives are emotionally intelligent and able to react to and interact with the world.

Are You Listening to Me? Why VoC is Crucial to Business Success

Today's consumers don't just want personalization, they expect it. And the call center is no exception. As organizations attempt to meet these expectations, gaining an understanding of the Voice of the Customer is imperative.

Protecting User Data: How Close is the US to its Own GDPR?

GDPR has already had wide-ranging consequences for companies collecting data, and now some are calling for federal regulations in the U.S. Voice-data isn't exempt from the regulations, and vendors need to be ready.