arXiv cs.CL·25 May 2026

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

Signal

Hype

In three linesAudit of 11 long-context reasoning benchmarks finds none jointly control task position, filler content, and context length. Evaluation of 9 LLMs using Context Rot Evaluation (CRE) reveals sharp accuracy drops when target task moves from end to middle (e.g., Mimo-v2-Flash -88pp at 64K). Newer model releases show reduced positional vulnerability.

Read source

Your take?

Benchmarks Reasoning Evals

Summary generated by Claude — human-verified

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

Other angles on this story