Topic

#Reasoning

In AI, reasoning refers to a model's ability to solve problems through multi-step logical thinking, beyond pattern matching. OpenAI's o3 is a key example: it breaks down a problem before producing an answer.

40Articles

4Sources

73Avg. signal

Reddit r/LocalLLaMA·Jun 18

Quick thoughts on GLM-5.2 (Bonus: Censorship question answers)

GLM-5.2 shows excellent coherence over extremely long context and adaptive reasoning without excessive verbosity. User reports performance close to GPT-4.5 on heavy analysis and deep research, with faster inference than GLM-5.1. The model has its own distinct conversational signature.

Qwen Reasoning Open source

SIG

HYP

arXiv cs.CL·Jun 18

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

arXiv study assessing LLM ability to interpret negation in figurative language. Researchers annotate an existing dataset and evaluate multiple models. Finding: negation combined with figurativeness presents particular challenge, with performance heavily dependent on prompt style.

Evals Prompt engineering Reasoning

SIG

HYP

arXiv cs.AI·Jun 18

CEO-Bench: Can Agents Play the Long Game?

CEO-Bench evaluates agents' ability to handle complex long-horizon tasks by simulating a 500-day startup operation. The agent manages pricing, marketing, budgeting through a Python interface. Only Claude Opus 4.8 and GPT-5.5 exceed the $1M starting balance, neither consistently profitable.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·Jun 18

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

ForecastBench-Sim is a forecasting benchmark built on Freeciv game simulations. Models receive a structured game state and predict hidden future states; the benchmark continues the simulation to score forecasts. Enables questions at arbitrary time horizons, counterfactual worlds, and rare events.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 18

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

JetFlow improves speculative decoding by combining parallel drafting efficiency with branch-wise causal conditioning. On H100 GPUs, it achieves 9.64x speedup on MATH-500 and 4.58x on open-ended conversations, outperforming existing tree-based methods on dense and MoE Qwen3 models.

Benchmarks Code generation Open source

SIG

HYP

arXiv cs.CL·Jun 18

LLM Parameters for Math Across Languages: Shared or Separate?

Mechanistic analysis of mathematical reasoning in multilingual LLMs. Math-associated parameters exhibit partial cross-lingual overlap, concentrated in intermediate layers. English produces the largest set of math-relevant parameters, while lower-resource languages reveal smaller parameter sets.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

Dual Dimensionality for Local and Global Attention

Researchers propose Distance-Adaptive Representation (DAR): reduce key/value dimensionality beyond a local window in decoder-only Transformers. Nearby tokens require full representations for next-token prediction, while distant tokens can use 1/4 original dimensionality without performance loss. Tested on 70M–410M models and 1B fine-tuning.

Reasoning Infrastructure Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

BCL: Bayesian In-Context Learning Framework for Information Extraction

BCL is an optimization framework for information extraction using particle filtering and Bayesian updates to systematically refine label representations. It generalizes across sequence labeling and relation classification tasks, demonstrating consistent improvements over existing approaches across model scales.

Prompt engineering Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 18

TW-LegalBench: Measuring Taiwanese Legal Understanding

TW-LegalBench evaluates 13 LLMs on Taiwanese law using 16,000+ multiple-choice questions, 117 open-ended essays, and 14,000+ legal judgment prediction cases. Top models exceed lawyer qualification threshold (11%) but fall short for judges/prosecutors (1-2%). Models struggle to cite exact legal articles.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

ScholarSum introduces a hierarchical knowledge graph framework for abstractive scientific summarization. The system organizes documents into semantically coherent units, generates an initial draft, then refines it through iterative verification and rewriting to ensure logical coherence and factual faithfulness.

Papers RAG Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

Approximate Structured Diffusion for Sequence Labelling

New approach combining diffusion and CRF for sequence labelling in NLP. Method conditions a CRF on the full label sequence (noisy), bypassing span limitations of standard CRFs. Results: 16.5% error reduction on POS-tagging.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

Improving Medical Communication using Rubric-Guided Counterfactual Recommendations

LM-guided counterfactual recommendation pipeline to improve medical communication in text-based telemedicine. System identifies interpretable features (tone, personalization, clarity, completeness) and recommends minimal communication changes predicted to increase positive feedback (+6.41% mean gain). Modifications preserve medical content and physician control.

Reasoning Evals RAG

SIG

HYP

arXiv cs.LG·Jun 18

Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing

Gaussian Mixture Attention (GMA) replaces standard attention with probabilistic routing through K learned Gaussian mixture components. Queries and keys map to responsibility vectors in a shared latent space. GMA avoids explicit N×N matrix materialization, reducing memory complexity to O(NK) instead of O(N²). Competitive on long-context classification, but behind SDPA and Mamba on WikiText-103.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 18

Artemis: Anatomy-Resolved inTervention for Eliminating Multimodal NeuroImage confounderS

Artemis is a causal framework for graph neural networks addressing demographic confounders (age, sex) in multimodal brain imaging (fMRI + DTI). The method applies causal interventions at each brain region independently to learn invariant representations. Tested on ADNI, OASIS, and HCP benchmarks, it improves disease diagnosis and classification tasks.

Papers Reasoning Alignment

SIG

HYP

arXiv cs.LG·Jun 18

Ghost Attractor Networks: Basin-Structured Dynamical Decoders for Closed-Loop Sequential Generation

Ghost Attractor Networks introduce an efficient dynamical decoder for sequential generation in robotics. With 2.3M parameters, it matches the offline accuracy of a 1.07B-parameter Diffusion Transformer (462× fewer parameters, 32× lower latency). On LIBERO-10, phase conditioning improves success rate by 13.5 percentage points over MLP baseline.

Code generation Robotics Reasoning

SIG

HYP

arXiv cs.LG·Jun 18

Why SWAVE May Not Be All You Need:A Concept-Evolution Retrospective on Complex-Valued Recurrent Language Models

SWave is a complex-valued recurrent language model (169M parameters) trained on FineWeb-Edu. The paper documents its evolution across three phases, identifying structural failures (cos-domination collapse) and validating critical components (ComplexNorm, Wave Propagation Scan). Final PPL: 22.0 at step 89,861.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 18

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

New LLM inference scheduler replacing explicit length prediction with lightweight statistical signals and dynamic priority boosting. Reduces P99 TTLT by 35-50% vs SRPT with perfect length knowledge, and TTFT by 34-47% across production and open-source traces.

Benchmarks Infrastructure Reasoning

SIG

HYP

arXiv cs.LG·Jun 18

What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy

Study on grokking (delayed transition from memorization to generalization). Authors show weight norm doesn't directly control grokking delay but acts through logit scale. Fixing norm and varying output temperature, they recover 85% of delay by matching logit scale. Effect is loss-dependent (cross-entropy vs MSE). Logit scale and softmax saturation are the proximal variables.

Papers Reasoning Evals

SIG

HYP

arXiv cs.LG·Jun 18

Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

RL framework inspired by neuroscience that disentangles dynamics-specific and reward-specific features using locally linear embeddings (LLE) and adaptively fuses representations via attention mechanism. Improves learning efficiency on benchmark tasks compared to conventional RL approaches.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 18

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb is a benchmark of 372,648+ instances for evaluating defeasible abduction reasoning in language models. Best frontier models reach 65% under standard conditions but drop to 23.5% under rendering-robust evaluation, versus 100% for symbolic logic solvers. The benchmark includes three difficulty levels with polynomial-time verifiable gold standards.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 18

Optimizing Lithium Production Decisions under Geological, Demand, and Pricing Uncertainties: A POMDP Framework for Multi-Objective Decision Making

A POMDP framework optimizes lithium production decisions by incorporating geological, pricing, and demand uncertainties. POMDP solvers outperform human-inspired heuristics by dynamically adapting to price regimes (static, linear, exponential, stochastic) and optimally sequencing exploration, production, and technology choices.

Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 18

What Must Generalist Agents Remember?

Theoretical paper on memory requirements for generalist agents. Proves that agents performing near-optimally across multiple domains must maintain distinct memory distributions at observational bottlenecks. Memory enables domain disambiguation, transition-model reconstruction, and planning.

AI Agents Reasoning Papers

SIG

HYP

arXiv cs.AI·Jun 18

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

WorldLines is a long-horizon embodied agent benchmark testing memory in dynamic household environments. The dataset includes temporally extended traces with dialogues, actions, and object/device state changes. ObsMem, an observer-grounded memory framework, maintains visibility-aware memories and action-native state trails for state-informed decisions.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·Jun 18

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

BeliefDiffusion combines diffusion models and Model Predictive Control for navigation in partially observable environments. The framework generates multimodal belief distributions and plans efficient navigation strategies. Experiments on synthetic maps: outperforms RL and other generative approaches in success rate and path efficiency.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·Jun 18

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

RTSGameBench is a benchmark to evaluate strategic reasoning in Vision-Language Models (VLMs) using real-time strategy games. Built on Beyond All Reason, it offers multi-scenario evaluations, diagnostic mini-games targeting specific competencies, and a self-evolving generation framework. Current state-of-the-art VLMs fail at multi-agent coordination and complex task scaling.

Vision Reasoning Multi-agent

SIG

HYP

arXiv cs.AI·Jun 18

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

ThinkDeception introduces a progressive reinforcement learning framework for interpretable multimodal deception detection. Using MLLMs, it converts binary classification into explicit reasoning via Chain of Thought. VAC-GRPO with curriculum learning stratified into 4 difficulty tiers achieves SOTA on mainstream benchmarks.

Reasoning Reinforcement learning Vision

SIG

HYP

arXiv cs.AI·Jun 18

Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

New formal theory (HACD-H) modeling emergence of social intelligence in long-term human-AI interaction. Unified framework integrating emotional adaptation, social memory, and personality consistency. Study on 14,700 conversation turns reveals negative correlation between social intelligence and social cognitive energy (r=-0.391, p<0.001), with developmental phase-transition patterns.

Reasoning AI Agents Papers

SIG

HYP

arXiv cs.AI·Jun 18

NeSyCat Torch: A Differentiable Tensor Implementation of Categorical Semantics for Neurosymbolic Learning

NeSyCat Torch unifies neurosymbolic semantics (classical, fuzzy, probabilistic, neural) under a single truth definition parametrized by monads. Implemented in PyTorch, JAX, and HaskTorch, the framework interprets computational symbols via neural networks. On MNIST addition, outperforms LTN and DeepProbLog in speed and accuracy.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.CL·Jun 18

Continuous Audio Thinking for Large Audio Language Models

Continuous Audio Thinking (CoAT) adds a continuous latent workspace to large audio language models to preserve acoustic information (phonetics, prosody, affect, pitch) before text generation. Tested on Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo, CoAT improves performance on audio reasoning, music classification, and transcription with no additional decoding cost.

Reasoning Voice Qwen

SIG

HYP

arXiv cs.CL·Jun 18

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST is a self-supervised framework improving LLM pragmatic reasoning through counterfactual reasoning traces. Without human-labeled data, it combines supervised fine-tuning and reinforcement learning. On 4 benchmarks (PragMega, Ludwig, MetoQA, AltPrag), it gains +5.37% and +5.50% absolute for Qwen3-8B and Qwen3-14B.

Reasoning Reinforcement learning Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 18

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

arXiv paper on improving long-context reasoning via data-centric approach rather than reward engineering. Data recipe targeting retrieval, multi-evidence synthesis, reasoning (~14K examples). Tests on Qwen3 (4B/8B/30B): +7.2/+3.2/+6.4 points across 7 long-context benchmarks, transfer to agentic tasks (+4.8 GAIA, +7.0 BrowseComp).

Reinforcement learning Reasoning AI Agents

SIG

HYP

arXiv cs.LG·Jun 18

A Link between Shock-wave Theory and Symmetry-reduced Stochastic Gradient Descent for Artificial Neural Networks

Mathematical link established between shock-wave theory and symmetry-quotiented stochastic gradient descent dynamics for neural networks. After quotienting parameter symmetries and entropy coarse-graining, effective dynamics satisfy a viscous Hamilton-Jacobi equation. Applied to MLPs, CNNs, Transformers, and mean-field networks.

Papers Reasoning Reinforcement learning

SIG

HYP

arXiv cs.LG·Jun 18

ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets

ThousandWorlds is an ML benchmark for climate emulation of potentially habitable exoplanets. The dataset contains ~1800 simulations from 5 global climate models mapping 8 planetary parameters to 3D atmospheric fields. Three nested subsets and two evaluation protocols test 7 baselines; GP-based methods outperform standard deep learning.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.LG·Jun 18

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

LLMZero uses LLM agents with tree search to discover adaptive RL training strategies. The system identifies that capacity parameters accumulate monotonically while regularization parameters oscillate. Across 4 GRPO tasks, discovered strategies outperform the base model by 9-140% and grid search by 6-15%.

Reinforcement learning AI Agents Reasoning

SIG

HYP

arXiv cs.LG·Jun 18

Task-Restricted Symmetries in Recurrent Weight Space

Study of functional redundancy in single-layer tanh RNNs using ordered real Schur coordinates. Authors identify nonnormal couplings removable with minimal loss on specific tasks (copy, flip-flop, sine generation), revealing task-dependent approximate functional invariances rather than universal weight-space symmetries.

Papers Reasoning

SIG

HYP

arXiv cs.LG·Jun 18

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

Study shows SFT overtraining can invert model rankings during RLVR fine-tuning. On Qwen2.5-Coder-3B, increasing SFT depth raises pre-RL pass@1 but reduces GRPO pass@10 from 0.806 to 0.481. Pre-RL entropy positively correlates with RLVR outcomes (ρ=+0.69). Two-stage entropy-based diagnostic identifies high-risk checkpoints.

Reinforcement learning Fine-tuning Reasoning

SIG

HYP

arXiv cs.LG·Jun 18

Beyond AHI: An Interpretable Causal-Discovery-Guided Framework for Sleep Recovery in Connected Health

Causal framework for sleep recovery scoring from multimodal polysomnography. Uses DAG learning on two cohorts (MESA n=1540, MrOS n=825) to identify five physiological domains (respiratory burden, hypoxia, fragmentation, architecture, autonomic regulation). Sleep Recovery Score (SRS) achieves 2.5× stronger alignment with perceived recovery than standard AHI.

Papers Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 18

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

CaVe-VLM-CoT is a modular agentic-RAG framework reducing VLM hallucinations through a five-stage closed-loop pipeline (Extractor, Retriever, Solver, Citation Injector, Verifier). Ungrounded claims trigger targeted re-retrieval. 23 component-wise metrics and CaVeScore measure citation faithfulness and cross-modal grounding. Results: 87.1% accuracy on ScienceQA, 55.2% on MMMU.

Vision RAG AI Agents

SIG

HYP

arXiv cs.AI·Jun 18

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

Xcientist is a research harness that externalizes research synthesis and experimental validation for AI scientists into inspectable, contract-governed processes. It organizes literature evidence, idea states, implementation plans, and repair traces as persistent research artifacts, eliminating claim drift where runnable artifacts no longer support the originally claimed mechanism.

AI Agents Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 18

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

Novel LLM personalization: store user facts as surgical edits in a hash-keyed memory table (Engram) instead of global LoRA. Reduces memory footprint by 33,000x, improves indirect-reasoning accuracy by 5.6x on average, and enables stacking multiple users without cross-contamination.

Fine-tuning Reasoning Papers

SIG

HYP