Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs
Signal
78
Hype
15
In three linesNew metric decomposing token efficiency of reasoning LLMs. Introduces trace-optional evaluation protocol separating completion rate, conditional correctness, and generated length. Evaluates 14 open-weight models on CogniLoad, GSM8K, ProofWriter, ZebraLogic. Identifies three distinct failure modes: logic-limited, context-limited, and verbosity-limited.Read source
Your take?
Summary generated by Claude — human-verified