Back to feed
arXiv cs.CL·

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

Signal
82
Hype
15
In three linesGeometric study showing inter-LLM agreement on subjective evaluations does not reflect human alignment. Across 41 LLM judges and 8 Indic languages, models use 30-50% of human score range, with evaluation axis nearly orthogonal to humans (87-89° vs 78-81°). LLM-LLM agreement (r≈0.35) exceeds LLM-human (r≈0.27-0.32). Only post-hoc calibration improves all rubrics.
Read source
Your take?
EvalsAlignmentBenchmarksPapers

Summary generated by Claude — human-verified