arXiv cs.CL·19 May 2026

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

Signal

Hype

In three linesarXiv paper introducing a trace-optional evaluation protocol decomposing token efficiency of reasoning LLMs. Analyzes 14 open-weight models on CogniLoad, GSM8K, ProofWriter, ZebraLogic by separating completion rate, conditional correctness, and generated length. Identifies three failure modes: logic-limited, context-limited, or verbosity-limited.

Read source

Your take?

Reasoning Evals Benchmarks

Summary generated by Claude — human-verified

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

Other angles on this story