arXiv cs.AI·29 May 2026

Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

Signal

Hype

In three linesLarge-scale literature search study: Deep Research pipeline increases recall from below 20% to above 80% on RollingEval-Jun25 (250-paper benchmark). Critical analysis of human reference lists as ground truth: only 51% judged moderately relevant vs 86-88% for best AI re-rankers. Humans cite direct collaborators 2.5x more often.

Read source

Your take?

RAG Evals Benchmarks

Summary generated by Claude — human-verified

Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

Other angles on this story