ContractBench: Can LLM Agents Preserve Observation Contracts?
Signal
82
Hype
15
In three linesContractBench benchmarks LLM agents' ability to preserve observation contracts (temporally valid, byte-level intact artifacts) in API calls. Of 38 models tested, none exceed 80%: Claude-Opus-4.6 leads at 77.8%. Results show integrity and validity failures uncorrelated with model size, and non-monotonic regression in the GPT-5 family despite larger scale.Read source
Your take?
Summary generated by Claude — human-verified