Archives

May 2026

3148 articles

arXiv cs.AI·

AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

AMR-SD introduces asymmetric meta-reflective self-distillation to improve token-level credit assignment in LLM reinforcement learning. The method compresses diagnostic signals into self-generated Socratic hints and uses Causal Information Gain with asymmetric ReLU-gated threshold for sparse token-level advantage modulation, preventing late-stage training collapse.

Reinforcement learningReasoningAlignment
SIG
72
HYP
18
arXiv cs.AI·

OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models

OCCAM is a framework for explaining black-box image classifier decisions through causal visual concepts. It discovers concepts in open-set manner, localizes them via text-guided segmentation, and measures causal contribution through object-level interventions. OCCAM aggregates interventional evidence to induce a structured ontology revealing concept dependencies and systematic model biases.

VisionEvalsReasoning
SIG
75
HYP
15
arXiv cs.AI·

QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

QSTRBench is a benchmark evaluating LLMs' ability to reason with qualitative spatial and temporal reasoning (QSTR). It covers 9 calculi (Point Algebra, Allen's Interval Algebra, RCC-5/8/22, etc.) with composition tables, converse relations, and conceptual neighbourhoods. Tested models outperform guessing but none answer all questions correctly. RCC-22 proves most difficult.

BenchmarksReasoningEvals
SIG
75
HYP
15
arXiv cs.AI·

Scheduling That Speaks: An Interpretable Programmatic Reinforcement Learning Framework

ProRL is a programmatic reinforcement learning framework for combinatorial optimization (job shop scheduling). It generates interpretable policies as human-readable programs via a domain-specific language (DSL-S), exploring the program space through local search and Bayesian optimization. Outperforms classical heuristics and DRL baselines with minimal training episodes.

Reinforcement learningReasoningBenchmarks
SIG
75
HYP
15
arXiv cs.AI·

Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

arXiv paper on spatial limitations of MLLMs in multi-agent environments. Models suffer from a "Cartesian Illusion": lack grounded 3D topological understanding. Authors propose an Epistemic Sensory Bottleneck module with Anchor-Based Embodied Spatial Decomposition CoT to improve second-order spatial inference (Theory of Mind). Zero-shot baseline: 42% accuracy.

VisionMulti-agentReasoning
SIG
72
HYP
28
arXiv cs.AI·

TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning

TaskGround is a task-inference framework for household agents operating on complete scenes. It structures reasoning in three steps: grounding (extracting relevant context), inference (executable structure), execution (action sequences). Evaluated on FullHome (400 tasks), it improves success rates and makes Qwen3.5-9B competitive with GPT-4 while reducing token costs by 18x.

AI AgentsReasoningRobotics
SIG
78
HYP
25
arXiv cs.AI·

Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

ConsumerSimBench, a benchmark built from 1,553 Chinese social-media topics and 23,122 reaction criteria, evaluates whether LLMs can reconstruct real consumer reaction patterns. Gemini-3.1-Pro covers only 47.8% of criteria, revealing a major gap between technical performance and consumer intuition. A multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6%.

BenchmarksEvalsMulti-agent
SIG
72
HYP
25
arXiv cs.AI·

New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions

New zeroth-order hard-thresholding algorithm with variance reduction for ℓ0-constrained optimization. Addresses SZOHT's limitation on random directions by mitigating conflict between ZO gradient deviation and hard-thresholding expansivity. Improved convergence rates validated on ridge regression and black-box adversarial attacks.

Reinforcement learning
SIG
72
HYP
15
arXiv cs.AI·

Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

LGBO (LLM-Guided Bayesian Optimization) embeds LLM semantic reasoning into every Bayesian Optimization iteration via a region-lifted preference mechanism. Tested on physics, chemistry, biology, and materials science benchmarks, LGBO reaches 90% of best observed value in 6 iterations for Fe-Cr battery electrolyte optimization, versus 10+ for standard BO.

ReasoningBenchmarksPapers
SIG
78
HYP
25
arXiv cs.AI·

LAST-RAG: Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation for Knowledge-Conditioned Degradation Model Selection

LAST-RAG proposes a method for selecting stochastic degradation models to estimate remaining useful life (RUL). It combines observed trajectories and domain context via retrieval from a local evidence bank, with RCRUS mechanism to prevent premature model elimination. Experiments show outperformance versus statistical and prognostic baselines.

RAGReasoningBenchmarks
SIG
72
HYP
15
arXiv cs.AI·

KISS - Knowledge Infrastructure for Scientific Simulation: A Scaffolding for Agentic Earth Science

KISS introduces a Knowledge Infrastructure (KI) enabling AI agents to execute complex Earth science simulations. On 3,000 trials, KI-equipped agents produced physically plausible simulations in 84% of cases vs. <40% without KI. An automated Knowledge Dissection Toolkit (KDT) generated 119 KIs across 14 Earth-science domains, showing operational expertise is structured and extractable rather than ad hoc.

AI AgentsReasoningBenchmarks
SIG
78
HYP
25
arXiv cs.AI·

EGI: A Multimodal Emotional AI Framework for Enhancing Scrum Master Real-time Self-Awareness

EGI is a multimodal framework to monitor unconscious emotions of Scrum Masters in real-time. The system combines speech-to-text transcription (WER 10%), prosody analysis, emotional vocabulary matching, and context-aware suggestions via open-source multi-module API. Testing shows significant improvement in emotional awareness during simulated agile meetings.

VoiceAI AgentsAI safety
SIG
45
HYP
35
arXiv cs.AI·

SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

SAPO improves generative recommendation by aligning reinforcement learning optimization to individual reasoning steps. Instead of assigning a single advantage to the entire response, SAPO computes separate group-relative advantages for each reasoning step and SID token, stabilizing training and outperforming baselines across three real-world datasets.

Reinforcement learningReasoningCode generation
SIG
75
HYP
15
arXiv cs.AI·

Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

Dual-process memory architecture for scientific agents: decouples episodic window (10 messages) from semantic consolidation (3 tokens/message). Evaluation on 15,000 messages across 6 LLMs (OpenAI, Anthropic, Google): maintains 70-85% accuracy at 10,000 messages with 62% fewer tokens. Identifies trade-offs: Dual Process excels at numeric/temporal queries, RAG for historical retrieval.

AI AgentsReasoningRAG
SIG
78
HYP
25
arXiv cs.AI·

Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis

MEMOIR, a memory-guided tree-search framework, automatically synthesizes solvers for combinatorial optimization using LLMs. With a two-level memory hierarchy (branch-local and global), it achieves 96.7% solution validity across 7 problems (scheduling, routing, packing), outperforming baselines by 9.2 points and reducing run-to-run validity variance by over an order of magnitude.

AI AgentsReasoningCode generation
SIG
78
HYP
25
arXiv cs.AI·

Multi-Party Multi-Objective Optimization as Consensus Search: Runtime Analysis of Cross-Party Recombination

Theoretical study of multi-objective evolutionary algorithms for multi-party optimization (MPMOP). On MP-JCG benchmark, payoff-guided mutation requires Θ(n²) fitness evaluations to cross a gap region, while CPR-NSGA-II achieves O(n log n) via cross-party recombination. Runtime analysis on BPBOMST (multi-party minimum spanning tree) with instance-parameterized bounds.

Multi-agentBenchmarksPapers
SIG
72
HYP
08
arXiv cs.AI·

Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design

Theoretical paper on computational challenges of token economics in LLM systems. Introduces the "Token Economics Trilemma": tensions between fine-grained valuation, low-latency execution, and allocation optimality. Identifies three technical areas: real-time value accounting, constrained resource allocation, and economic-aware system architecture.

InfrastructureBenchmarksReasoning
SIG
45
HYP
25
arXiv cs.AI·

CyberCorrect: A Cybernetic Framework for Closed-Loop Self-Correction in Large Language Models

CyberCorrect formalizes LLM self-correction as a closed-loop control system. A tri-modal error detector (self-consistency, verbalized confidence, logic-chain verification) and type-directed correction controller achieve 79.8% accuracy on CyberCorrect-Bench (440 reasoning tasks), +6.2pp over existing methods, reducing overshoot by 41% via convergence control.

ReasoningEvalsPapers
SIG
78
HYP
25
arXiv cs.AI·

From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

Shallow neural network agents master the card game Schnapsen through reinforcement learning. RLBot, trained via asynchronous Monte Carlo updates, outperforms MLPBot (supervised imitation) and achieves statistically significant wins against RdeepBot, a search-based baseline. Combining learned value functions with deeper lookahead during gameplay improves performance.

Reinforcement learningBenchmarksPapers
SIG
72
HYP
15
arXiv cs.AI·

MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop

MADP is a multi-agent architecture for enterprise document automation, combining deep learning classification and LLM extraction with human validation. Deployed on 955 real documents, it achieves 97% full-pipeline automation and reduces FTE requirements by 70%. 98.5% document-level accuracy with human-in-the-loop; 69% CO2 reduction vs manual processing.

Multi-agentAI AgentsCode generation
SIG
78
HYP
25
arXiv cs.AI·

Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

Large Reasoning Models generate traces aligned with human reaction times, but this alignment persists regardless of inference-time reasoning budget. Study across GPT-OSS-20B and GPT-OSS-120B: three effort levels, six cognitive tasks. Token allocation tracks fine-grained human difficulty patterns and reflects a structure crystallized at training time, not modulated in real-time.

ReasoningBenchmarksPapers
SIG
72
HYP
15
arXiv cs.AI·

The Alpha Illusion: Reported Alpha from LLM Trading Agents Should Not Be Treated as Deployment Evidence

Critical study of LLM-based trading agents (FinCon, FinMem, TradingAgents, FinAgent, QuantAgent, FLAG-Trader). Reported Sharpe ratios do not constitute deployment evidence: temporal contamination, unmodeled frictions, and insufficient predictive calibration invalidate claims. Proposes P1-P6 protocol and modular architecture with LLM as audit interface.

AI AgentsBenchmarksEvals
SIG
75
HYP
15
arXiv cs.AI·

PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models

PersonaArena is a dynamic simulation framework for evaluating and improving persona-level role-playing in LLMs. It leverages a filtered corpus of user-generated social content, constructs a nuanced persona bank, and simulates multi-turn interactions in social environments. A multi-agent debating judge provides holistic and unbiased assessment.

AI AgentsMulti-agentEvals
SIG
65
HYP
35
arXiv cs.AI·

Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

Study of autonomous AI agents in multi-echelon supply chains using MIT Beer Game. Reasoning models reduce costs by 67% vs human teams, but reveal an 'agent bullwhip effect': amplification of decision unreliability across echelons. A GRPO-based reinforcement-learning post-training framework using system-level rewards improves reliability and reduces tail events.

AI AgentsMulti-agentReasoning
SIG
78
HYP
25
arXiv cs.AI·

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

HT-GRPO, a hierarchical reinforcement learning method for diffusion multi-modal models, organizes optimization into three stages (global, structure, refinement). It solves multiple unmasking sequences and assigns differentiated rewards based on token importance. Tests on MMaDA and Lumina-DiMOO show gains on GenEval and DPG benchmarks.

Reinforcement learningImage generationBenchmarks
SIG
72
HYP
25
arXiv cs.AI·

GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction

GRID is an end-to-end framework for constructing security knowledge graphs from cyber threat intelligence articles. Using Qwen3-4B-Instruct, it combines graph extraction, text revision, and a task bank (multi-choice questions + regex) to generate stable rewards. On 249 CTI articles, the Task-bank Reward model achieves 84.62% precision, 64.91% recall, and 68.53% Avg F1.

Reinforcement learningBenchmarks
SIG
72
HYP
18