Archives

May 2026

3148 articles

arXiv cs.AI·

Surgical Post-Training: Proximal On-Policy Distillation for Reasoning with Knowledge Retention

SPOT (Surgical Post-Training) is an on-policy distillation framework that injects reasoning capabilities into LLMs while preserving prior knowledge. With only 4k rectified math pairs, it improves Qwen3-8B by 6.2% on average in 16 minutes on 8x H800 GPUs. The approach uses KL-constrained reward formulation to mitigate catastrophic forgetting.

Fine-tuningReinforcement learningReasoning
SIG
78
HYP
25
arXiv cs.AI·

GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry

GIST introduces targeted data selection for instruction tuning by replacing axis-aligned scaling with robust subspace alignment via SVD. It recovers task-specific subspaces from validation gradients and scores examples by alignment with target directions. GIST matches or outperforms state-of-the-art baselines using only 0.29% storage and 25% computational time.

Fine-tuningReinforcement learningPapers
SIG
78
HYP
15
arXiv cs.LG·

A Structural Threshold in Decision Capacity Governs Collapse in Self-Play Reinforcement Learning

An arXiv study shows that a threshold in decision capacity determines collapse in self-play reinforcement learning. Eliminating all positive-reach contingent decisions causes rapid convergence to a deterministic exploitation attractor. Preserving even a single contingent decision point prevents collapse, confirming the mechanism is co-adaptation under constraint.

Reinforcement learningPapersMulti-agent
SIG
72
HYP
15
arXiv cs.AI·

LaDi-RL: Latent Diffusion Reasoning Prevents Entropy Collapse in Reinforcement Learning

LaDi-RL optimizes LLM reasoning via RL in latent space using diffusion. Instead of optimizing token sequences, the method generates latent reasoning trajectories through iterative denoising. It solves credit assignment (rewards observed after decoding) via hierarchical latent-text rollouts. Gains: +9.4% code generation, +5.7% math reasoning on pass@1.

Reinforcement learningReasoningCode generation
SIG
78
HYP
25
arXiv cs.CL·

ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints

ToolMATH is a diagnostic benchmark for evaluating long-horizon tool use by language models. It converts math solutions into reusable Python tools with natural-language descriptions and typed schemas, then measures adaptability (success with replacement tools), robustness (stability under distractors), and tool connectivity (accuracy over long chains).

BenchmarksAI AgentsTools
SIG
72
HYP
18
arXiv cs.AI·

SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking

SynCABEL uses LLMs to generate contextualized synthetic training examples to address scarcity of annotated data in biomedical entity linking. The framework achieves state-of-the-art on MedMentions (English), QUAERO (French), and SPACCC (Spanish), reaching full human supervision performance with 60% less annotated data. An LLM-as-a-judge protocol evaluates clinical validity.

PapersBenchmarksRAG
SIG
78
HYP
15
arXiv cs.AI·

Tracking Drift: Variation-Aware Entropy Scheduling for Non-Stationary Reinforcement Learning

AES (Adaptive Entropy Scheduling) method dynamically adjusts entropy coefficient in non-stationary RL under environment drift. Proposes square-root scaling rule based on observable non-stationarity proxy. Evaluation across 4 algorithm variants, 12 tasks, 4 drift modes: reduces performance degradation from drift and accelerates recovery after abrupt changes.

Reinforcement learningReasoning
SIG
72
HYP
18
arXiv cs.AI·

Mapping Human Anti-collusion Mechanisms to Multi-agent AI Systems

arXiv paper developing a taxonomy of human anti-collusion mechanisms (sanctions, leniency & whistleblowing, monitoring & auditing, market design, governance) and mapping them to multi-agent AI systems. Highlights open challenges: attribution of emergent coordination, agent identity fluidity, boundary between beneficial cooperation and harmful collusion, adversarial adaptation.

Multi-agentAI AgentsAI safety
SIG
72
HYP
18
arXiv cs.AI·

SignRoundV2: Toward Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

SignRoundV2 is a post-training quantization framework for LLMs maintaining performance under extreme compression (2-4 bits). It combines adaptive mixed-precision strategy guided by gradients and lightweight stabilization techniques. Results show ~1% performance gap at 4.5 bits average in mixed MXFP, with substantial improvements in challenging 2-bit weight-only quantization.

Fine-tuningBenchmarksInfrastructure
SIG
78
HYP
18
arXiv cs.AI·

The Journal of Prompt-Engineered (Moral) Philosophy Or: Why AI-Assisted Ethics Research Requires Process Transparency

Paper on transparency requirements in AI-assisted ethics research. Authors argue output-only evaluation is insufficient; they propose a documentation-adequacy framework grounded in agent-integrity, comprising declaration, navigation, documentation account, process documentation, and development records. The paper itself demonstrates the framework with persistent archival.

Prompt engineeringAI safetyAlignment
SIG
72
HYP
15
arXiv cs.CL·

Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate

SDRL (Self-Debate Reinforcement Learning) trains LLMs to solve problems standalone AND benefit from multi-agent debate. The model samples multiple solutions, constructs debate context with diverse reasoning paths, then jointly optimizes initial and debate-conditioned responses. Results: consistent MAD performance gains across benchmarks and agent configurations.

ReasoningReinforcement learningMulti-agent
SIG
78
HYP
22
arXiv cs.AI·

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

EvolveR is a framework enabling LLM agents to learn from their own experiences through a closed-loop lifecycle. It combines offline self-distillation (synthesizing interaction trajectories into reusable strategic principles) and online interaction (actively retrieving distilled principles to guide decisions). Tested on complex multi-hop QA benchmarks, it outperforms existing agentic baselines.

AI AgentsReinforcement learningReasoning
SIG
75
HYP
25
arXiv cs.AI·

Extracting latent representations from X-ray spectra. Classification, regression, and accretion signatures of Chandra sources

Transformer autoencoder compresses Chandra X-ray spectra into 8D latent representation. Classification of 8 astrophysical source types: ~40% balanced accuracy overall, ~69% on AGN/compact objects. Latent features correlate with spectral and temporal properties, capturing physical information as effectively as hand-engineered features.

BenchmarksVision
SIG
72
HYP
15
arXiv cs.AI·

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

RLBFF combines human feedback and verifiable rewards for reward model training. The method extracts binary principles from natural language feedback (e.g., accuracy, code readability) and uses them as entailment tasks. Models achieve 86.2% on RM-Bench and 81.4% on JudgeBench (#1 as of September 2025). Qwen3-32B aligned with RLBFF matches o3-mini and DeepSeek R1 at <5% inference cost.

Reinforcement learningEvalsAlignment
SIG
82
HYP
25
arXiv cs.AI·

FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

FUNCanon breaks down long-horizon manipulation tasks into action sequences (actor-verb-object) and canonicalizes objects by functional affordances using VLM cues. FuncDiffuser, an object-centric and action-centric diffusion policy, learns on aligned data to generalize across object categories and enable cross-task behavior reuse.

RoboticsVisionAI Agents
SIG
75
HYP
25
arXiv cs.AI·

Individual utilities of life satisfaction reveal inequality aversion unrelated to political alignment

Experimental study with 300 UK participants measuring individual preferences for well-being and fairness. Utility functions estimated via Expected Utility Maximisation reveal inequality aversion unrelated to political alignment. Results challenge average life satisfaction as a policy metric and support nonlinear utility-based alternatives for value-aligned AI systems.

AlignmentAI safety
SIG
45
HYP
25
arXiv cs.CL·

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

Colluding LLM agents manipulate victim beliefs by coordinating truthful evidence fragments through public channels without covert communication. The Generative Montage framework (Writer-Editor-Director) constructs deceptive narratives via adversarial debate. Attack success rates reach 74.4% on proprietary models and 70.6% on open-weights across 14 LLM families. Advanced reasoning models show higher susceptibility.

AI AgentsMulti-agentAI safety
SIG
78
HYP
35
arXiv cs.AI·

Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity

Theoretical investigation of loss of plasticity (LoP) in deep learning under non-stationary environments. Authors identify two primary mechanisms: activation saturation and representational redundancy creating traps in parameter space. Paradox: properties promoting static generalization (low-rank representations) worsen LoP in continual learning.

Reinforcement learningPapersAlignment
SIG
75
HYP
15
arXiv cs.CL·

SignRoundV2: Toward Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

SignRoundV2 is a post-training quantization framework for LLMs maintaining performance under extreme compression (2-4 bits). It combines adaptive mixed-precision strategy guided by gradient information and lightweight stabilization techniques. Results show ~1% performance gap at 4.5 bits average in mixed MXFP settings, with substantial improvements in 2-bit weight-only quantization.

Fine-tuningBenchmarksOpen source
SIG
78
HYP
15
arXiv cs.AI·

OPERA: A Reinforcement Learning--Enhanced Orchestrated Planner-Executor Architecture for Reasoning-Oriented Multi-Hop Retrieval

OPERA is a retrieval-augmented generation (RAG) architecture coupling planning and execution via reinforcement learning. A Goal Planning Module decomposes complex questions into sub-goals, executed by a Reason-Execute Module with specialized components for reasoning and retrieval. Training uses MAPGRPO, a GRPO variant. Superior results on complex multi-hop benchmarks.

RAGReinforcement learningReasoning
SIG
75
HYP
25
arXiv cs.AI·

Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs

New approach to generate formal theorem proving challenges by leveraging theoretical computer science (TCS). Framework automatically synthesizes problem-proof pairs in Lean4 and Markdown across two domains: Busy Beaver and Mixed Boolean Arithmetic. DeepSeekProver-V2-671B achieves 57.5% on Busy Beaver but only 12% on Mixed Boolean Arithmetic, revealing major gaps in long-form proof generation.

ReasoningBenchmarksPapers
SIG
78
HYP
15
arXiv cs.AI·

HTSC-2025: A Benchmark Dataset of Ambient-Pressure High-Temperature Superconductors for AI-Driven Critical Temperature Prediction

HTSC-2025 is an open-source benchmark of high-temperature superconducting materials discovered 2023-2025 (X₂YH₆ systems, MXH₃ perovskites, M₃XH₈, BCN-doped cage structures, 2D honeycomb). Addresses the lack of standardized datasets for fair comparison of AI algorithms predicting critical transition temperatures.

BenchmarksPapersOpen source
SIG
75
HYP
25
arXiv cs.AI·

Patients Speak, AI Listens: LLM-based Analysis of Online Reviews Uncovers Key Drivers for Urgent Care Satisfaction

arXiv study analyzing 10,000+ Google Maps reviews of urgent care facilities (DMV, Florida) using GPT prompt engineering for aspect-based sentiment extraction. Findings: interpersonal factors and operational efficiency are strongest drivers of patient satisfaction; technical quality, finances, facilities show no significant independent effects. Population density alone shows modest correlation with ratings.

GPTPrompt engineeringRAG
SIG
65
HYP
25