arXiv cs.CL·3 June 2026

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

Signal

Hype

In three linesGeometric study showing inter-LLM agreement on subjective evaluations does not reflect human alignment. Across 41 LLM judges and 8 Indic languages, models use 30-50% of human score range, with evaluation axis nearly orthogonal to humans (87-89° vs 78-81°). LLM-LLM agreement (r≈0.35) exceeds LLM-human (r≈0.27-0.32). Only post-hoc calibration improves all rubrics.

Read source

Your take?

Evals Alignment Benchmarks Papers

Summary generated by Claude — human-verified

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

Other angles on this story