arXiv cs.AI·19 May 2026

ContractBench: Can LLM Agents Preserve Observation Contracts?

Signal

Hype

In three linesContractBench benchmarks LLM agents' ability to preserve observation contracts (temporally valid, byte-level intact artifacts) in API calls. Of 38 models tested, none exceed 80%: Claude-Opus-4.6 leads at 77.8%. Results show integrity and validity failures uncorrelated with model size, and non-monotonic regression in the GPT-5 family despite larger scale.

Read source

Your take?

AI Agents Benchmarks Claude GPT Evals

Summary generated by Claude — human-verified

ContractBench: Can LLM Agents Preserve Observation Contracts?

Other angles on this story