VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild
Signal
72
Hype
25
In three linesVibeSearchBench evaluates LLM agents on collaborative multi-turn search in real-world context. The benchmark comprises 200 bilingual (Chinese/English) tasks across 20 domains with schema-free knowledge graphs. Seven frontier models tested achieve max F1 of 30.30, exposing gaps in long-context reasoning and proactive intent elicitation.Read source
Your take?
Summary generated by Claude — human-verified