Speech Recognition - The Gaming Industry's Answer to Avoiding Fines

Article Featured Image

Spend any time on any social media platform these days and it is highly likely you will either experience abuse directly or watch the pile-on as others are targeted. Some 47 percent of internet users have experienced harassment or abuse, and new laws introduced by the U.K. government at the end of last year mean that firms, including the gaming industry, could face fines of up to 10 percent of their turnover or the blocking of their sites.

Faceless communities facilitate abuse, as it's all too easy to hide behind an online profile and be unaware of the visceral impact of rudeness, insults, and worse. In recent years, some public figures have made a career out of nastiness—their deliberately rude and insulting behavior attracting likes and attention, and no doubt copycats.

The new rules mean that the government will reserve the power for senior managers to be held liable. Online sites, such as social media sites and gaming platforms where user-generated content is shared or where people can talk to one another online will need to remove or limit the spread of illegal content, such as child abuse, terrorist material, and suicide content. Tech firms must also protect children from being exposed to harmful content or activity, such as grooming, bullying, and pornography.

High-profile cases have resulted in suicides among young people, and, as the gaming industry tends to attract millennials and Gen Z-ers, gaming companies need all the help they can get to detect and deter online abuse.

Traditionally, gaming culture is known for toxic language and behavior, where any online game will generate aggression, hostility, and trash talk, along with slurs and offensive terms. As is the case with many online spaces, the hostility disproportionately targets women, people of color, and those from LGBTQIA+ communities.

Technology that attempts to moderate the worst of toxic and hate speech is already in place. The Google subsidiary, Jigsaw, has tools that flag toxic comments, but its tendency to pick on black-authored speech because it uses white-aligned English to train its machine learning algorithms makes it an imperfect solution.

In addition, the tools are for text only, whereas the huge growth in voice platforms, such as Discord, indicates the necessity of tech that can identify abuse in speech.

The rise of machine learning large vocabulary continuous speech recognition (LVCSR) has become much more accurate and robust and is better able to cope with the sheer variety of languages and accents. In lab settings, claims have been made that the tech has the same speech recognition as humans.

LVCSR systems don't always work when faced with fast, conversational, and overlapping speech, typical in the online world. Having said that, simple keyword spotting systems can be used to identify language that might be offensive and flag it.

Context is all. Keyword spotting can result in a flood of false positives. Then there are the costs to consider. Any gaming platform boss is usually faced with using an API provided by one of the three big cloud-based LVCSR companies—Amazon, Google, or Microsoft, where prices for untrained speech recognition start at $1 for Microsoft. Google will undercut this slightly if you allow it to use your audio to train its models further, handing over yet more oh-so-precious data to a company that already has far too much of it.

Unsurprisingly, lockdowns all over the world resulted in a rise in gaming, reaching about 8.5 hours online a week for each gamer. A conservative estimate has a small games developer paying $35 a week just to transcribe users' speech, never mind putting the systems in place to analyze it. While the larger companies can clearly push for deeper discounts, the per-hour model of pricing is clearly broken in high-volume environments.

Speech recognition is processor intensive, but a well-engineered, highly tuned GPU-powered system can bring the price down significantly and allows the gaming company to train the system itself to cope with whatever slang its customers use.

Behavioral models that are trained on the real output from LVCSR models rather than traditional text-based models are also on the rise. This means the corruptions and mis–transcriptions seen from these systems can be baked into the model building process, making them more robust.

Intelligent Voice's work in behavioral analytics is making great strides in addressing this issue. A lot of automated systems appear to forget fun. The safety of domains should not come at the expense of freedom of speech or the enjoyment of gaming; systems need to be context-aware and take all elements of speech into account, keeping users safe without changing the user experience.

Nigel Cannings is chief technology officer of Intelligent Voice. He has more than 25 years experience in both law and technology and is a regular speaker at industry events.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues