arXiv cs.CL·3 June 2026

The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction

Signal

Hype

In three linesFramework combining conformal prediction and collaborative filtering-style annotator representation to analyze LLM behavior against human annotators in content moderation. Introduces Ghost Prediction metric to quantify model-human divergences. Evaluation across 4 LLMs and 4 datasets shows larger models more confident on texts with no human alignment, revealing structural demographic bias.

Read source

Your take?

Evals AI safety Alignment Benchmarks

Summary generated by Claude — human-verified

The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction

Other angles on this story