ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale
Signal
82
Hype
15
In three linesChaosBench-Logic v2 is a 40,886-question benchmark evaluating logical reasoning of 14 LLMs on 165 dynamical systems. The CARE protocol reveals critical failures: regime-transition reasoning remains near-random (MCC=0.05), while FOL deduction reaches MCC=0.52. Qwen 2.5-32B outperforms proprietary models on indicator diagnostics.Read source
Your take?
Summary generated by Claude — human-verified