Topic

#Reinforcement learning

Reinforcement learning is a method where an agent learns by receiving rewards or penalties based on its actions. DeepMind's AlphaGo used it to defeat the world's top Go players.

40Articles

5Sources

72Avg. signal

arXiv cs.CL·Jun 18

Steerable Cultural Preference Optimization of Reward Models

Novel SCPO algorithm for training reward models that balance diverse cultural preferences across subcommunities. Achieves 7-point improvements for minority reward models on PRISM and GlobalOpinionQA (7 countries), with 280% better training data efficiency than full-finetuning.

Alignment Reinforcement learning Evals

SIG

HYP

arXiv cs.LG·Jun 18

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

PROPEL is a framework training task generators via RL to create optimally difficult problems for agent learning. A lightweight probe predicts solver pass rate without repeated rollouts, reducing evaluation to a single forward pass. On code and SWE tasks, learnable-frontier generation increases from 10.1% to 20% (Qwen2.5-3B) and 9.8% to 19.6% (Qwen3.5-27B).

Reinforcement learning AI Agents Code generation

SIG

HYP

arXiv cs.LG·Jun 18

Self-CTRL: Self-Consistency Training with Reinforcement Learning

Self-CTRL optimizes consistency between language models' self-explanations and behavior via reinforcement learning. On probabilistic reasoning tasks, the method improves R² correlation from 0.24 to 0.64. In constitutional AI, it increases refusal prediction from 36% to 92% and reduces HarmBench failure rate from 15.0% to 0.5%.

Reinforcement learning Alignment AI safety

SIG

HYP

arXiv cs.LG·Jun 18

TMR-GGNN: Credit Card Fraud Detection based on Time-Aware Multi-Relational Guided Graph Neural Network

TMR-GGNN, a time-aware multi-relational graph neural network, detects credit card fraud by modeling heterogeneous interactions between customers, merchants, devices, and IPs. The model combines temporal relational attention, contrastive learning, and a composite loss function (InfoNCE + Focal Loss) to handle imbalanced data and reduce false negatives.

Reinforcement learning

SIG

HYP

arXiv cs.LG·Jun 18

Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

RL framework inspired by neuroscience that disentangles dynamics-specific and reward-specific features using locally linear embeddings (LLE) and adaptively fuses representations via attention mechanism. Improves learning efficiency on benchmark tasks compared to conventional RL approaches.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 18

Quantum Annealing Enhanced Reinforcement Learning for Accurate Remaining Useful Lifetime Prediction

QAQL framework couples quantum annealing with Q-learning for remaining useful life (RUL) prediction in predictive maintenance. Each Q-value update encoded as QUBO solved on D-Wave Advantage system. Validated on NASA C-MAPSS and fleet maintenance datasets: statistically significant improvements over classical and quantum baselines.

Reinforcement learning Benchmarks Papers

SIG

HYP

arXiv cs.AI·Jun 18

Optimizing Lithium Production Decisions under Geological, Demand, and Pricing Uncertainties: A POMDP Framework for Multi-Objective Decision Making

A POMDP framework optimizes lithium production decisions by incorporating geological, pricing, and demand uncertainties. POMDP solvers outperform human-inspired heuristics by dynamically adapting to price regimes (static, linear, exponential, stochastic) and optimally sequencing exploration, production, and technology choices.

Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 18

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

BeliefDiffusion combines diffusion models and Model Predictive Control for navigation in partially observable environments. The framework generates multimodal belief distributions and plans efficient navigation strategies. Experiments on synthetic maps: outperforms RL and other generative approaches in success rate and path efficiency.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·Jun 18

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

ThinkDeception introduces a progressive reinforcement learning framework for interpretable multimodal deception detection. Using MLLMs, it converts binary classification into explicit reasoning via Chain of Thought. VAC-GRPO with curriculum learning stratified into 4 difficulty tiers achieves SOTA on mainstream benchmarks.

Reasoning Reinforcement learning Vision

SIG

HYP

arXiv cs.AI·Jun 18

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

Safety Reflection Pretraining inserts short safety reflections into pretraining corpora to establish self-monitoring directly in language modeling. On 1.7B models pretrained on FineWeb-Edu, the method improves safety classification accuracy and substantially reduces success rates of inference-stage and finetuning attacks.

AI safety Alignment Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 18

NeSyCat Torch: A Differentiable Tensor Implementation of Categorical Semantics for Neurosymbolic Learning

NeSyCat Torch unifies neurosymbolic semantics (classical, fuzzy, probabilistic, neural) under a single truth definition parametrized by monads. Implemented in PyTorch, JAX, and HaskTorch, the framework interprets computational symbols via neural networks. On MNIST addition, outperforms LTN and DeepProbLog in speed and accuracy.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.CL·Jun 18

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST is a self-supervised framework improving LLM pragmatic reasoning through counterfactual reasoning traces. Without human-labeled data, it combines supervised fine-tuning and reinforcement learning. On 4 benchmarks (PragMega, Ludwig, MetoQA, AltPrag), it gains +5.37% and +5.50% absolute for Qwen3-8B and Qwen3-14B.

Reasoning Reinforcement learning Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 18

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

arXiv paper on improving long-context reasoning via data-centric approach rather than reward engineering. Data recipe targeting retrieval, multi-evidence synthesis, reasoning (~14K examples). Tests on Qwen3 (4B/8B/30B): +7.2/+3.2/+6.4 points across 7 long-context benchmarks, transfer to agentic tasks (+4.8 GAIA, +7.0 BrowseComp).

Reinforcement learning Reasoning AI Agents

SIG

HYP

arXiv cs.LG·Jun 18

A Link between Shock-wave Theory and Symmetry-reduced Stochastic Gradient Descent for Artificial Neural Networks

Mathematical link established between shock-wave theory and symmetry-quotiented stochastic gradient descent dynamics for neural networks. After quotienting parameter symmetries and entropy coarse-graining, effective dynamics satisfy a viscous Hamilton-Jacobi equation. Applied to MLPs, CNNs, Transformers, and mean-field networks.

Papers Reasoning Reinforcement learning

SIG

HYP

arXiv cs.LG·Jun 18

DRIFT: Refining Instruction Data via On-Policy Data Attribution

DRIFT refines SFT training data distribution using on-policy Influence Functions. The method uses model rollouts as validation targets to minimize proximity gap and debias gradient norm bias. Experiments on 7B instruction and reasoning models show consistent performance ceiling improvements over existing curation baselines.

Fine-tuning Reinforcement learning Evals

SIG

HYP

arXiv cs.LG·Jun 18

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

LLMZero uses LLM agents with tree search to discover adaptive RL training strategies. The system identifies that capacity parameters accumulate monotonically while regularization parameters oscillate. Across 4 GRPO tasks, discovered strategies outperform the base model by 9-140% and grid search by 6-15%.

Reinforcement learning AI Agents Reasoning

SIG

HYP

arXiv cs.LG·Jun 18

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

Study shows SFT overtraining can invert model rankings during RLVR fine-tuning. On Qwen2.5-Coder-3B, increasing SFT depth raises pre-RL pass@1 but reduces GRPO pass@10 from 0.806 to 0.481. Pre-RL entropy positively correlates with RLVR outcomes (ρ=+0.69). Two-stage entropy-based diagnostic identifies high-risk checkpoints.

Reinforcement learning Fine-tuning Reasoning

SIG

HYP

arXiv cs.AI·Jun 18

R2D-RL: A RoboCup 2D Soccer Environment for Multi-Agent Reinforcement Learning

R2D-RL bridges RoboCup 2D Soccer Simulator (RCSS2D) to Python MARL workflows via shared-memory communication. The environment supports full-field and scenario-based training with discrete/hybrid action spaces, action masks, EPV-based reward shaping, and parallel execution. Includes 11-vs-11 full-field benchmarks and baseline results.

Multi-agent Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 18

Skill-Guided Continuation Distillation for GUI Agents

SGCD, an iterative self-improvement framework, addresses off-trajectory states in GUI agents. The system first runs a plain policy, then uses a skill-guided policy to generate successful continuations. On OSWorld-Verified, SGCD improves success rates of three base models from ~30% to over 50%.

AI Agents Reinforcement learning Papers

SIG

HYP

Reddit r/LocalLLaMA·Jun 17

i post-trained a model to reliably roll a die

A user post-trained a model to reliably simulate a die roll (each face ~1/6), exposing that frontier LLMs (Claude, GPT, Kimi) consistently answer '4'. Uses this toy problem to explore exploration vs. exploitation in RL and model behavior.

Reinforcement learning Claude GPT

SIG

HYP

Reddit r/LocalLLaMA·Jun 17

SIQ-1 Qwen3.6 for autoresearch and autonomous agency

SIQ-1 Qwen3.6: PPO fine-tuning of Qwen-35B-A3 outperforming GLM-5.2 and Qwen-350B on autoresearch (karpathy benchmark) and bullshit-bench. Model + GGUF available on HuggingFace with demo agent.

Qwen Reinforcement learning AI Agents

SIG

HYP

Reddit r/MachineLearning·Jun 17

Next-Latent Prediction Transformers [R]

Microsoft Research presents Next-Latent Prediction (NextLat), a self-supervised learning method where transformers predict their own next latent state. This improves history compression into compact belief states, data efficiency, and accelerates inference up to 3.3x via recursive speculative decoding.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.LG·Jun 17

Rethinking Groups in Critic-Free RLVR

arXiv paper on critic-free reinforcement learning for LLMs. Authors challenge the role of rollout groups in existing methods and propose negative token filtering to enable stable single-rollout training, improving performance on agentic tasks compared to group-based RL techniques.

Reinforcement learning Reasoning AI Agents

SIG

HYP

arXiv cs.CL·Jun 17

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

OPD-Evolver is a slow-fast co-evolution framework that cultivates self-evolving agents through on-policy self-distillation. The system manages a four-level memory hierarchy to read, use, write, and maintain experience. Across multi-domain benchmarks, OPD-Evolver outperforms ReasoningBank (+11.5%) and Skill0 (+5.8%), with OPD-Evolver-9B rivaling Qwen3.5-397B and Step-3.5-Flash.

AI Agents Reasoning Reinforcement learning

SIG

HYP

arXiv cs.CL·Jun 17

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

LLM-as-Environment-Engineer framework: the policy model analyzes failure trajectories and proposes modifications to the next-stage RL training environment configuration. MAPF-FrozenLake testbed with multi-dimensional configurations. Qwen3-4B outperforms GPT and Gemini on proposed benchmarks.

Reinforcement learning Multi-agent Reasoning

SIG

HYP

arXiv cs.CL·Jun 17

Environment-Grounded Automated Prompt Optimization for LLM Game Agents

Automated prompt optimization framework for LLM agents in interactive environments. Decomposes observation-to-action pipeline into descriptor and action-selection agents, iteratively refines via LLM-driven evolutionary loop guided by environment returns. On BabyAI/BALROG: improves from 0% to 72.5% success on PutNext without fine-tuning.

AI Agents Prompt engineering Reinforcement learning

SIG

HYP

arXiv cs.CL·Jun 17

Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

RL-trained reasoning models often generate unnecessary reasoning after finding the correct answer (overthinking). This paper introduces Dynamic Rollout Editing (DRE), a training-time intervention during GRPO that edits successful trajectories continuing after answer emergence, preserving the verified prefix and weakening preference signals for unnecessary thinking.

Reinforcement learning Reasoning

SIG

HYP

arXiv cs.LG·Jun 17

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

PowerOPD stabilizes on-policy distillation for LLMs by replacing unbounded log-ratio rewards with Box-Cox power transformation. On 6 mathematical reasoning benchmarks with Qwen3, achieves +6.37 Avg@8/+5.71 Pass@8 gains vs vanilla OPD, reduces wall-clock time by 59.2% and peak GPU memory by 23.1%.

Fine-tuning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

WallZero: Mastering the Game of WallGo with Strategic Analysis

WallZero, an AlphaZero-based agent, masters WallGo, a strategic board game popularized by Netflix's The Devil's Plan (2025). On a 7×7 board, the agent defeats professional Go players with 1.98x more territory on average. Authors analyze game fairness and identify key strategies.

Reinforcement learning Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 17

Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations

Comparative study of three recurrent architectures (LSTM, GRU, Mamba) and two algorithms (PPO, SAC) for meta-reinforcement learning applied to input-constrained control barrier functions (ICCBF) in spacecraft proximity operations. Mamba + PPO outperforms other setups in safety, task completion, and fuel savings across cooperative and adversarial scenarios.

Reinforcement learning AI safety Robotics

SIG

HYP

arXiv cs.LG·Jun 17

Multi-Adapter PPO: A Cross-Attention Enhanced Wavelength Selection Framework for LIBS Quantitative Analysis

Multi-Adapter PPO framework for wavelength selection in LIBS quantitative analysis. Uses RL with cross-attention mechanisms and specialized adapters. Outperforms PSO by 28.4% in comprehensive score and 45.2% in prediction accuracy on steel and coal datasets. Code and dataset released.

Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 17

Online LLM Selection via Constrained Bandits with Time-Varying Demand

Online learning algorithm for dynamic LLM selection in edge-cloud systems under budget constraints (cost, latency). Formulated as constrained stochastic bandit with time-varying demand. Theoretical guarantees: sublinear regret and sublinear constraint violations.

AI Agents Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

Clinical decision support AI system using Digital Twins, Treatment Effect estimation, and Reinforcement Learning for adaptive real-time treatment recommendations. Validated on synthetic simulator and TCGA ovarian cancer dataset. Safety module with rule-based vital sign monitoring and clinician escalation for high-uncertainty cases.

Reinforcement learning Reasoning AI safety

SIG

HYP

arXiv cs.AI·Jun 17

Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

E³RL, a reinforcement learning method, addresses error propagation in long-horizon reasoning of LLMs. Using autoregressive cross-entropy as an epistemic uncertainty signal, the model can locally correct logical defects and reuse KV cache. On AIME, 4B and 8B models outperform SOTA by 5.349% and 6.514%.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 17

Informative Missingness to Generate Irregular Clinical Time Series

Diffusion-based approach to generate irregular clinical time series by jointly modeling laboratory values and observation patterns. Uses DACMI benchmark from MIMIC-III, extends TimeDiff framework to capture dependencies between patient physiology and clinician testing behavior under MNAR-like missingness.

Papers Benchmarks Reinforcement learning

SIG

HYP

arXiv cs.LG·Jun 17

Decision-Driven Geosteering Under Uncertainty: A Unified Framework for Sequential Decision Optimization

Sequential decision optimization framework for geosteering under geological uncertainty. Integrates particle filtering for probabilistic subsurface interpretation with value-based reinforcement learning. Compares three decision policies: Approximate Dynamic Programming, Deep Q-learning, and Dual DRL with dueling decomposition, validated on industrial simulator with realistic noise and drilling constraints.

Reinforcement learning Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 17

SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions

SkillChain-Gym is a benchmark for reskilling-aware production-inventory control. The environment models skill decay, certification lapses, training actions, and capacity constraints. Evaluation of production-only, reactive adaptive, and static-insurance policies over 60-shift horizons with operational and resilience metrics.

Benchmarks Reinforcement learning AI Agents

SIG

HYP

arXiv cs.AI·Jun 17

SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

SEAGym is an evaluation environment for measuring self-evolving LLM agent harness updates (prompts, memory, tools, interaction loop). The study compares ACE, TF-GRPO, and AHE on Terminal-Bench 2.0 and HLE, showing frequent updates don't guarantee held-out performance gains and source diversity affects harness reliability.

AI Agents Reinforcement learning Evals

SIG

HYP

arXiv cs.AI·Jun 17

Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games

Researchers propose Equation-to-Behavior Prompting to guide LLMs to simulate diverse cognitive models (Bayesian, motivated reasoning, Grether's α-β model). Large models approximate these specifications via prompting, but small models fail. RL training reduces belief error by 26.5% and improves performance by 2.5–12% on legal persuasion games.

Reasoning Reinforcement learning Evals

SIG

HYP

arXiv cs.AI·Jun 17

StepGuard: Guarding Web Navigation via Single-Step Calibration

StepGuard improves web navigation for AI agents via Dynamic Dual-Policy Optimization (DDPO) to handle reward conflicts and Confidence-Guided Adaptive Navigation Reflection (CANR) to calibrate per-step errors. The framework achieves state-of-the-art performance on standard web navigation benchmarks.

AI Agents Reinforcement learning Vision

SIG

HYP

Reinforcement learning — AI news · Signal IA