Skip to Content, Navigation, or Footer.
Friday, Dec. 19, 2025
The Daily Pennsylvanian

AI platforms display considerable variance in judging hate speech, Annenberg School study finds

03-20-25 Annenberg (Jean Park).jpg

Large language models powered by artificial intelligence differ in how they classify and respond to hate speech, according to recent findings by Annenberg School for Communication researchers.

In a study titled “Model-Dependent Moderation: Inconsistencies in Hate Speech Detection Across LLM-based Systems,” fourth-year Ph.D. candidate Neil Fasching and Communication professor Yphtach Lelkes tested seven models, including those developed by OpenAI, DeepSeek, and Google. They found that the models display inconsistencies when identifying and evaluating hate speech.

In the article, they wrote, “we find that these systems show substantial variation in their classification of identical content — what one flags as harmful, another might deem acceptable. The disparities reflect fundamental differences in how each model conceptualizes unacceptable speech, going beyond technical variations in architecture or training.”

The researchers analyzed more than 1.3 million different sentences referencing 125 demographic groups, including economic class, disability, immigration status, race, personal interest, and sexual orientation. The study utilized phrases with “weak incitement, strong incitement, or specific calls to action” to diversify the intensity of the prompts.

“Every group, hate speech phrase, and incitement option was paired uniformly to ensure complete factorial coverage across this dataset,” the researchers wrote. “Sentence examples included variations such as ‘All [group] are [hate speech phrase]’ or ‘Some [group] are [hate speech phrase], and [additional incitement phrase].’”

They found significant differences in how the models responded to the prompts. Mistral Moderation Endpoint had the highest “average hate value,” while OpenAI’s GPT-4o and Google’s Perspective API displayed the lowest average hate value, according to the study.

“The content moderation systems vary widely in their assessment of hateful material, as demonstrated by the significant differences in average hate speech values across models,” the researchers added.

Fasching and Lelkes also found internal inconsistencies in the models while evaluating hate speech. GPT-4o and Perspective API exhibited consistent decision-making patterns, while OpenAI’s Moderation Endpoint indicated a higher amount of variability.

“These differences highlight the challenge of balancing detection accuracy with avoiding over-moderation,” the researchers wrote.

The study also noted that specific groups receive more consistent evaluations by the LLMs. Prompts related to sexual orientation and gender produced the most consistent classifications, while education level, personal interest, and economic class groups had more inconsistencies in their evaluations.

“These inconsistencies are especially pronounced for specific demographic groups, leaving some communities more vulnerable to online harm than others,” Fasching told Penn Today.

According to the study, these inconsistencies suggest “that systems generally recognize hate speech targeting traditional protected classes more readily than content targeting other groups.”