arXiv cs.LG·19 May 2026

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

Signal

Hype

In three linesarXiv paper proposing a formal statistical framework to combine LLM and human evaluations. Uses a doubly robust estimator (missing data approach) to determine optimal sample sizes of human ratings needed for benchmark validation, based on LLM judgment predictability.

Read source

Your take?

Evals Papers AI safety

Summary generated by Claude — human-verified

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

Other angles on this story