arXiv cs.AI·27 May 2026

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

Signal

Hype

In three linesStudy of 432 experiments across 6 models (4 capability tiers) testing whether higher-capability models need less structural guidance. Results refute monotone relationship: Gemini 2.5 Flash performance drops 29-38pp with increased harness verbosity. Qwen3.5-122B (reasoning) achieves 91.7% VTSR with strict harness. Six-label failure taxonomy introduced.

Read source

Your take?

AI Agents Evals Reasoning

Summary generated by Claude — human-verified

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

Other angles on this story