arXiv cs.CL·28 May 2026

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

Signal

Hype

In three linesVibeSearchBench evaluates LLM agents on collaborative multi-turn search in real-world context. The benchmark comprises 200 bilingual (Chinese/English) tasks across 20 domains with schema-free knowledge graphs. Seven frontier models tested achieve max F1 of 30.30, exposing gaps in long-context reasoning and proactive intent elicitation.

Read source

Your take?

Benchmarks AI Agents Reasoning

Summary generated by Claude — human-verified

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

Other angles on this story