Back to feed
arXiv cs.AI·

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

Signal
78
Hype
15
In three linesStudy of 432 experiments across 6 models (4 capability tiers) testing whether higher-capability models need less structural guidance. Results refute monotone relationship: Gemini 2.5 Flash performance drops 29-38pp with increased harness verbosity. Qwen3.5-122B (reasoning) achieves 91.7% VTSR with strict harness. Six-label failure taxonomy introduced.
Read source
Your take?
AI AgentsEvalsReasoning

Summary generated by Claude — human-verified