The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment
Signal
82
Hype
15
In three linesGeometric study showing inter-LLM agreement on subjective evaluations does not reflect human alignment. Across 41 LLM judges and 8 Indic languages, models use 30-50% of human score range, with evaluation axis nearly orthogonal to humans (87-89° vs 78-81°). LLM-LLM agreement (r≈0.35) exceeds LLM-human (r≈0.27-0.32). Only post-hoc calibration improves all rubrics.Read source
Your take?
Summary generated by Claude — human-verified