It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers
Signal
78
Hype
15
In three linesStudy of 432 experiments across 6 models (4 capability tiers) testing whether higher-capability models need less structural guidance. Results refute monotone relationship: Gemini 2.5 Flash performance drops 29-38pp with increased harness verbosity. Qwen3.5-122B (reasoning) achieves 91.7% VTSR with strict harness. Six-label failure taxonomy introduced.Read source
Your take?
Summary generated by Claude — human-verified