Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

11d ago · Global · primary source: export.arxiv.org

Multi-source synthesis by The Embedding Report from 2 sources. Every numeric and quoted claim traces to a cited source body (see methodology).

Researchers have introduced a new method, interaction SSD, to model how semantic meaning varies across moderators, and applied it to a case study on racial identity and hate speech annotations.

The interaction SSD method estimates a main semantic gradient, an interaction gradient, and conditional gradients, making moderated meaning-outcome relationships statistically testable and interpretable[1]. A case study using the UC Berkeley Measuring Hate Speech corpus found that annotator racial identity moderates hate-speech judgments of comments targeting people of color. The study detected a significant moderation effect, with the shared gradient contrasting dehumanizing hostility with counter-speech[1]. Meanwhile, research on large language models (LLMs) and hate speech annotation found that LLMs align well with human judgments on behaviorally explicit dimensions, but evaluative dimensions are systematically inverted in LLMs[2]. A confidence-weighted Ridge regression was able to reconstruct continuous hate speech scores with up to 0.71 R^2[2]. Both studies were submitted on 26 May 2026[1][2].

safety-researchresearch-papercommentary

Background sources we checked (1)
  • en.wikipedia.org ↗ These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), …

Sources cited (2)

  1. arxiv.org ↗ E
  2. arxiv.org ↗ E
Spot something wrong? Report an issue