arXiv cs.AI·19 May 2026

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

Signal

Hype

In three linesNew metric decomposing token efficiency of reasoning LLMs. Introduces trace-optional evaluation protocol separating completion rate, conditional correctness, and generated length. Evaluates 14 open-weight models on CogniLoad, GSM8K, ProofWriter, ZebraLogic. Identifies three distinct failure modes: logic-limited, context-limited, and verbosity-limited.

Read source

Your take?

Reasoning Evals Benchmarks

Summary generated by Claude — human-verified

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

Other angles on this story