Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?
Signal
75
Hype
15
In three linesarXiv paper proposing a formal statistical framework to combine LLM and human evaluations. Uses a doubly robust estimator (missing data approach) to determine optimal sample sizes of human ratings needed for benchmark validation, based on LLM judgment predictability.Read source
Your take?
Summary generated by Claude — human-verified