Back to feed
arXiv cs.AI·

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

Signal
78
Hype
15
In three linesNew metric decomposing token efficiency of reasoning LLMs. Introduces trace-optional evaluation protocol separating completion rate, conditional correctness, and generated length. Evaluates 14 open-weight models on CogniLoad, GSM8K, ProofWriter, ZebraLogic. Identifies three distinct failure modes: logic-limited, context-limited, and verbosity-limited.
Read source
Your take?
ReasoningEvalsBenchmarks

Summary generated by Claude — human-verified