ChaosBench-Logic v2 (arXiv:2605.24305) runs 40,886 questions across 165 dynamical systems and exposes a clear split in current reasoning capabilities: first-order logic deduction reaches MCC=0.52, but reasoning over regime transitions stays near-random at MCC=0.05. Models can chain formal inferences but fail as soon as they need to detect a qualitative behavioral shift in a system. Qwen 2.5-32B outperforms proprietary models on indicator diagnostics — worth tracking for teams benchmarking open-weight models on structured scientific tasks. The CARE evaluation protocol ships with the benchmark and is directly reusable for internal evals.
On the agentic side, QUEST (arXiv:2605.24218) demonstrates that competitive deep-research agents can be trained with just 8K synthetic tasks and RL, across a 2B–35B range. Matching or beating proprietary systems on 8 research benchmarks — particularly in citation and report synthesis — validates the hypothesis that training signal quality matters more than volume. LLM-AutoSciLab adds to this: 67.6% symbolic accuracy on ActiveSciBench (57 enzymatic kinetics tasks + 45 GRN tasks) with 2–5x fewer experimental calls than baselines. All three papers point to the same bottleneck: it's no longer raw generation capacity but reasoning structure and exploration efficiency.
Two more technical papers are worth flagging for product teams. Raon-Speech (9B, trained on 1.38M hours) outperforms Qwen2.5-Omni and Fun-Audio-Chat across 42 audio benchmarks in English and Korean, with a SpeechChat variant for real-time full-duplex conversation trained on 119K hours of dialogue — a useful reference if you're evaluating multilingual voice stacks. CSP-Atlas identifies 106 dedicated neural circuits in a sparse 8-layer transformer trained on Python code: 62.5% of the most active neurons at intermediate layers are concept-specific to AST constructs. This isn't decorative interpretability — it opens a concrete path for auditing or constraining code model behavior at fine granularity.
ChaosBench-Logic v2 is a 40,886-question benchmark evaluating logical reasoning of 14 LLMs on 165 dynamical systems. The CARE protocol reveals critical failures: regime-transition reasoning remains near-random (MCC=0.05), while FOL deduction reaches MCC=0.52. Qwen 2.5-32B outperforms proprietary models on indicator diagnostics.
LLM-AutoSciLab proposes a closed-loop scientific discovery framework coupling hypothesis generation, hypothesis-conditioned experiment selection, and mechanism refinement. Evaluated on ActiveSciBench (57 enzyme-kinetics tasks, 45 gene-regulatory-network tasks), the system achieves 67.6% symbolic accuracy and 2-5x better sample efficiency than competing baselines.
Raon-Speech is a 9B multilingual speech language model (English/Korean) that understands and generates speech while preserving text capabilities. Trained on 1.38M hours of curated data, it outperforms 8 comparable audio models (Qwen2.5-Omni, Fun-Audio-Chat) across 42 benchmarks. Raon-SpeechChat extends it with real-time full-duplex conversation trained on 119K hours of dialogue.
QUEST is a family of open-source models (2B to 35B) trained as deep research agents via data synthesis pipeline and RL. With only 8K synthetic tasks, QUEST matches or exceeds proprietary systems across 8 research benchmarks, excels at citation grounding and report synthesis. Models, data, and training scripts released.
Study identifies 106 dedicated neural circuits in a sparse 8-layer transformer trained on Python code. Circuits organize by computational principles (atomicity, lexical ambiguity) rather than semantics. Up to 62.5% of loudest-firing neurons at mid-to-late layers are concept-specific for AST constructs.