Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
Signal
78
Hype
15
In three linesAudit of 11 long-context reasoning benchmarks finds none jointly control task position, filler content, and context length. Evaluation of 9 LLMs using Context Rot Evaluation (CRE) reveals sharp accuracy drops when target task moves from end to middle (e.g., Mimo-v2-Flash -88pp at 64K). Newer model releases show reduced positional vulnerability.Read source
Your take?
Summary generated by Claude — human-verified