Why do AI models struggle with online hate speech detection?

Published 18 June 2026 · world

As the United Nations marks the International Day for Countering Hate Speech, Al Jazeera examines how AI handles it – and where it falls short

As the United Nations marks the International Day for Countering Hate Speech, Al Jazeera examines how AI handles it – and where it falls short. Hate speech that once circulated in person now travels farther and faster via anonymous online accounts behind a screen. As the United Nations marks the International Day for Countering Hate Speech on June 18, UN Secretary-General Antonio Guterres has warned that social platforms are amplifying the threat. With artificial intelligence (AI) increasingly tasked with detecting and removing hate speech online, Al Jazeera looks at where these systems fall short compared with human judgement. How is hate speech defined? According to the UN, hate speech covers any communication – spoken, written or behavioural – that discriminates against or incites violence towards a person or group. The UN states that hate speech targets a person’s actual or perceived identity, race, ethnicity, religion, gender, sexual orientation or disability. And it isn’t limited to words, with the UN noting it can also take the form of images, cartoons, gestures and even objects. How many people encounter hate speech online? According to a 2023 joint survey of 8,000 people in 16 countries done by polling company Ipsos and the UN Educational, Scientific and Cultural Organization (UNESCO), more than two-thirds of internet users encountered hate speech online.

The survey also found that 33 percent of people thought LGBTQI people experienced the most cases of hate speech, followed by ethnic and racial minorities (28 percent) and women (18 percent). Meta, which owns Facebook, has removed fewer hateful posts since 2023. In the last quarter of 2025, the company removed 1.3 million posts from Instagram and 1.3 million from Facebook, compared to 7.4 million removed from Instagram and 5.8 million from Facebook in the fourth quarter of 2024. This came as the company shifted away from proactive detection of hate speech and relied more on users to report encounters. On the other hand, TikTok said it removed 96.3 percent of all hate speech and content in the fourth quarter of 2025 before it was reported. AI models detect hate speech differently To detect and combat the spread of hate speech online, social media companies have increasingly turned to AI, using content moderation systems powered by large language models (LLMs) that promise to automate content filtering across huge volumes of messages. In general, these systems use labeled datasets and pretrained language models to detect abusive language. They then apply rules or score thresholds to decide whether content is hateful or violates company policies. A 2025 study by researchers at the University of Pennsylvania found that these models vary widely in how they identify and classify hate speech, with significant inconsistencies across systems and demographic groups, raising concerns about bias and unequal protection online.