Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs
Signal
78
Hype
15
In three linesarXiv paper introducing a trace-optional evaluation protocol decomposing token efficiency of reasoning LLMs. Analyzes 14 open-weight models on CogniLoad, GSM8K, ProofWriter, ZebraLogic by separating completion rate, conditional correctness, and generated length. Identifies three failure modes: logic-limited, context-limited, or verbosity-limited.Read source
Your take?
Summary generated by Claude — human-verified