arXiv cs.CL·19 May 2026

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

Signal

Hype

In three linesarXiv paper proposing formal framework for combining LLM and human evaluations. Uses doubly robust estimator (missing data literature) to determine optimal number of human reviews needed. Shifts LLM role from substitutive to auxiliary in two-stage sampling design.

Read source

Your take?

Evals Benchmarks AI safety Alignment

Summary generated by Claude — human-verified

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

Other angles on this story