Topic

#Reinforcement learning

Reinforcement learning is a method where an agent learns by receiving rewards or penalties based on its actions. DeepMind's AlphaGo used it to defeat the world's top Go players.

40Articles
6Sources
74Avg. signal
arXiv cs.CL·

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

SPADER is an RL framework for tool-augmented LLM agents in Multi-Answer QA. It introduces Step-wise Peer Advantage (SPA) for fine-grained credit assignment over long trajectories, and a diversity-aware exploration reward promoting rare entity discovery. Evaluated on QAMPARI, Mintaka, WebQSP, QUEST: improves recall and F1 vs prompting and supervised RL baselines.

AI AgentsReinforcement learningReasoning
SIG
78
HYP
00
arXiv cs.AI·

Closed-Loop Neural Activation Control in Vision-Language-Action Models

CTRL-STEER introduces a closed-loop control framework for Vision-Language-Action (VLA) models. Instead of fixed steering coefficients, it adaptively adjusts intervention strength over time using PID or reinforcement learning controllers. Experiments on OpenVLA with LIBERO task suites demonstrate improved concept regulation stability and better steering-task success trade-offs without retraining the base model.

VisionAI AgentsReinforcement learning
SIG
72
HYP
00
arXiv cs.LG·

RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting

RAFT is a two-stage domain fine-tuning method that mitigates catastrophic forgetting. It refines data via self-conditioned rewriting and answer fusion, then applies on-policy distillation where the original model provides soft targets on student-generated trajectories. Across five domains, RAFT improves domain accuracy by 23.2% over standard SFT and recovers 18.2% of degradation on MS-Bench.

Fine-tuningReinforcement learningPapers
SIG
78
HYP
00
arXiv cs.LG·

AI-Guided Design and Optimization of Graphite-Based Anodes via Iterative Experimental Feedback

Iterative AI workflow optimizes graphite-based anodes through sequential learning and experimental feedback loops. Citrine Platform generates surrogate models and refines manufacturing constraints. Results: fabrication reliability improved from frequent failures to 100% success, cells ≥350 mAh/g increased from 28.4% to 84.8%, capacity retention rose from 42.1% to 97.3%.

Reinforcement learningBenchmarksTools
SIG
75
HYP
00
arXiv cs.LG·

ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate

ARCA introduces a token-level credit assignment method for LLM reinforcement learning that addresses degeneracy of intrinsic signals (surprisal, entropy reduction, policy divergence) under LoRA. It measures adapter salience directly via L2 norm of hidden-state residuals instead of output-distribution shifts. Tested on MATH/Qwen3-1.7B with GRPO, ARCA avoids pathological weight concentration.

Reinforcement learningFine-tuningReasoning
SIG
75
HYP
00
arXiv cs.AI·

MindZero: Learning Online Mental Reasoning With Zero Annotations

MindZero is a self-supervised reinforcement learning framework training multimodal LLMs to infer human mental states without annotations. The model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions. After training, inference becomes fast single-pass and outperforms model-based methods in both accuracy and efficiency.

ReasoningReinforcement learningAI Agents
SIG
72
HYP
00
arXiv cs.AI·

Capability Self-Assessment: Teaching LLMs to Know Their Limits

Modern LLMs systematically overestimate their competence and attempt unsolvable queries. Researchers propose Capability Self-Assessment (CSA), formulated as a policy-learning problem using reinforcement learning, to teach models to recognize their limits. RL significantly outperforms supervised fine-tuning, preserves original capabilities, and generalizes out-of-distribution.

Reinforcement learningAlignmentEvals
SIG
78
HYP
00
Reddit r/LocalLLaMA·

I spent months inside verl (an RL post-training framework), forked it, then stopped. Wrote up the internals, the tooling a fork costs, and a nasty NCCL bug.

A researcher who spent months inside verl (ByteDance's RL post-training framework) documents its internals: RLHF loop orchestration, single-controller pattern, data structures (DataProto), and a NCCL bug discovered. Abandoned fork but knowledge shared with the community.

Reinforcement learningAI AgentsOpen source
SIG
65
HYP
00
arXiv cs.AI·

HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster

Novel transformer-based architecture for autonomous resource management in heterogeneous satellite clusters (optical and SAR). Uses model-free reinforcement learning for real-time decision-making in Earth Observation missions. Demonstrates significant performance improvements and transferability across varying cluster sizes.

Multi-agentReinforcement learningReasoning
SIG
72
HYP
00
arXiv cs.AI·

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

arXiv study on iterative refinement of LLM-generated reward functions for sparse structured RL. Authors identify two dominant failure modes (reward flooding, semantic misunderstanding) and propose diagnostic-driven refinement guided by failure-mode taxonomy. Results: DoorKey-8x8 improves from 2.3% to 97.6%, KeyCorridor from 31.2% to 86.7%. Limitations: method restricted to PPO and sparse structured tasks.

Reinforcement learningLlamaPrompt engineering
SIG
72
HYP
00
Reinforcement learning — AI news · Signal IA