Back to feed
arXiv cs.LG·

ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

Signal
82
Hype
15
In three linesChaosBench-Logic v2 is a 40,886-question benchmark evaluating logical reasoning of 14 LLMs on 165 dynamical systems. The CARE protocol reveals critical failures: regime-transition reasoning remains near-random (MCC=0.05), while FOL deduction reaches MCC=0.52. Qwen 2.5-32B outperforms proprietary models on indicator diagnostics.
Read source
Your take?
BenchmarksReasoningQwenEvals

Summary generated by Claude — human-verified