Back to feed
arXiv cs.AI·

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

Signal
78
Hype
25
In three linesDeepWeb-Bench is a deep research benchmark evaluating 9 frontier models on tasks requiring massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. Errors stem primarily from derivation and calibration (>70%), not retrieval (12-14%). Strong and weak models fail differently: incomplete derivation vs hallucinated precision.
Read source
Your take?
BenchmarksReasoningAI AgentsRAG

Summary generated by Claude — human-verified