arXiv cs.AI·22 May 2026

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

Signal

Hype

In three linesDeepWeb-Bench is a deep research benchmark evaluating 9 frontier models on tasks requiring massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. Errors stem primarily from derivation and calibration (>70%), not retrieval (12-14%). Strong and weak models fail differently: incomplete derivation vs hallucinated precision.

Read source

Your take?

Benchmarks Reasoning AI Agents RAG

Summary generated by Claude — human-verified

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

Other angles on this story