Archives

May 2026

3148 articles

arXiv cs.AI·

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

AutoLLMResearch introduces an agentic framework to automate configuration of expensive LLM experiments. The system learns from low-fidelity experiments to extrapolate toward promising high-fidelity configurations. LLMConfig-Gym provides a multi-fidelity environment with >1M GPU hours of verified experiment outcomes.

AI AgentsReinforcement learningBenchmarks
SIG
75
HYP
25
arXiv cs.AI·

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

EnactToM is an evolving benchmark with 300 multi-agent embodied tasks in 3D household environments with partial observability. It tests functional Theory of Mind—acting optimally on implicit beliefs—rather than literal belief questions. All seven frontier models score 0.0% on hard task completion, with 93% of failures traced to epistemic coordination breakdowns.

Multi-agentReasoningBenchmarks
SIG
78
HYP
25
arXiv cs.AI·

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

ScaleLogic, a synthetic logical reasoning framework, demonstrates that RL can teach long-horizon reasoning to LLMs. Training compute follows a power law with proof depth (T ∝ D^γ, R² > 0.99), with exponent γ increasing from 1.04 to 2.60 as logical expressiveness grows. Models trained on more expressive logics transfer better (+10.66 points on downstream benchmarks).

Reinforcement learningReasoningBenchmarks
SIG
82
HYP
18
arXiv cs.AI·

HAAS: A Policy-Aware Framework for Adaptive Task Allocation Between Humans and Artificial Intelligence Systems

HAAS is a framework for adaptive task allocation between humans and AI systems in software engineering and manufacturing. It combines rule-based governance constraints with contextual-bandit learning. Results show governance is not binary but a tunable design variable: moderate governance improves operational performance and reduces fatigue in manufacturing while remaining competitive as the learner gains experience.

AI AgentsMulti-agentReinforcement learning
SIG
72
HYP
18
arXiv cs.AI·

MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

MolClaw is an autonomous agent with a three-tier hierarchical architecture (70 skills) for drug molecule evaluation, screening, and optimization. It integrates 30+ specialized resources and achieves state-of-the-art performance on MolBench, a benchmark spanning 8 to 50+ sequential tool calls. Gains concentrate on structured workflow orchestration rather than ad hoc scripting.

AI AgentsMulti-agentBenchmarks
SIG
78
HYP
25
arXiv cs.AI·

Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

Neurosymbolic architecture with ontologies (Role, Domain, Interaction) for enterprise LLM agents. Controlled experiment (1,800 runs, Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B): ontology-constrained agents outperform ungrounded agents on metric accuracy and role consistency (p < .001). 2x greater lift in localized domains (Vietnam) where LLM training coverage is weak.

AI AgentsClaudeReasoning
SIG
78
HYP
25
arXiv cs.CL·

Patients Speak, AI Listens: LLM-based Analysis of Online Reviews Uncovers Key Drivers for Urgent Care Satisfaction

arXiv study analyzing 10,000+ Google Maps reviews of urgent care facilities (DMV, Florida) using GPT and prompt engineering. Interpersonal factors and operational efficiency emerge as primary satisfaction drivers, while technical quality, finances, and facilities show no significant independent effects. Population density alone correlates with ratings among socioeconomic factors.

GPTPrompt engineeringPapers
SIG
65
HYP
25
arXiv cs.CL·

Mitigating Extrinsic Gender Bias for Bangla Classification Tasks

Investigation of extrinsic gender bias in Bangla pretrained language models. Four manually annotated task-specific benchmark datasets constructed (sentiment analysis, toxicity detection, hate speech, sarcasm detection) with minimal-pair gender perturbations. RandSymKL debiasing strategy proposed, combining symmetric KL divergence and cross-entropy loss. Implementation and datasets publicly released.

BenchmarksAI safetyAlignment
SIG
72
HYP
15
arXiv cs.AI·

When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

Study reveals a safety vulnerability in personalized dialogue agents: long-term memory biases intent inference and legitimizes harmful queries. PS-Bench benchmark shows personalization increases attack success rates by 15.8%–243.7% versus stateless baselines. A lightweight detection-reflection method is proposed to mitigate this safety degradation.

AI safetyAI AgentsBenchmarks
SIG
78
HYP
25
arXiv cs.AI·

Inference-Time Diversity in RL-Trained Lean Theorem Provers: A Diagnostic Study

RL-trained Lean theorem provers suffer mode-collapse at inference: doubling sampling from k=32 to k=64 on miniF2F-test with DeepSeek-Prover-V1.5-RL solves zero additional theorems (42/244). Fixed structural diversity of 15 tactic skeletons recovers +45% relative improvement at k=16 (+12.3±4.2 theorems). Phenomenon is RL-specific and orthogonal to scaling.

ReasoningReinforcement learningBenchmarks
SIG
78
HYP
15
arXiv cs.CL·

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Vision-OPD introduces regional-to-global self-distillation to improve fine-grained visual understanding in MLLMs. The framework transfers the model's privileged perception on evidence-centered crops to its full-image policy via token-level KL divergence minimization on on-policy rollouts. Competitive results on fine-grained visual understanding benchmarks without external models or ground-truth labels.

VisionReinforcement learningPapers
SIG
78
HYP
25
arXiv cs.AI·

QuickLAP: Quick Language-Action Preference Learning for Semi-Autonomous Agents

QuickLAP fuses physical and language feedback to learn robot reward functions in real time using a Bayesian framework. LLMs extract reward feature attention masks and preference shifts from free-form utterances, integrated with physical corrections via closed-form update rule. Achieves 70% error reduction vs physical-only and heuristic multimodal baselines in semi-autonomous driving simulator.

AI AgentsReinforcement learningReasoning
SIG
78
HYP
25
arXiv cs.AI·

ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning

ALIGN is a vision-language framework to infer precise accident coordinates from Bangla news reports and map-based cues. Using an agentic architecture combining OCR, LLM, and vision-language models, the system reduces localization error from 10.9 km to 0.593 km on validation data and 0.465 km on official Dhaka Metropolitan Police records.

VisionAI AgentsMulti-agent
SIG
78
HYP
25
arXiv cs.AI·

WELD: The First Naturalistic Long-Period Small-Team Workplace Emotion Dataset for Ubiquitous Affective Computing

WELD is the first emotion dataset in naturalistic workplace context spanning 30.1 months (Nov 2021–May 2024) with 49 employees from a Chinese software company. 733,780 seven-class facial-expression probability vectors validate three established phenomena and reveal six asymmetric emotional regimes. Exposes FER model bias: over-prediction of 'angry' on neutral Asian faces (0.194 vs 0.05).

VisionEvalsAI safety
SIG
82
HYP
15
arXiv cs.AI·

An AI system to help scientists write expert-level empirical software

ERA, an AI system combining LLM and Tree Search, automatically generates expert-level scientific software. It discovered 40 novel bioinformatics methods outperforming top human-developed approaches, generated 14 epidemiological models surpassing the CDC ensemble for COVID-19 hospitalization forecasting, and produced expert-level solutions for geospatial analysis and neural activity prediction.

AI AgentsReasoningCode generation
SIG
82
HYP
28
arXiv cs.AI·

EndoCogniAgent: Closed-Loop Agentic Reasoning with Self-Consistency Validation for Endoscopic Diagnosis

EndoCogniAgent is a closed-loop agentic framework for iterative endoscopic diagnosis. It couples fine-grained visual evidence acquisition and multi-step reasoning via self-consistency validation (knowledge and temporal consistency). On EndoAgentBench (6,132 QA pairs from 11 datasets), the system achieves 85.23% accuracy on perception and 71.13% clinical acceptance on reasoning tasks.

AI AgentsReasoningVision
SIG
78
HYP
25
arXiv cs.AI·

GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

GenoMAS is an LLM-based multi-agent framework for gene expression analysis. Six specialized agents orchestrated via typed message-passing protocols combine structured workflows with autonomous adaptability. On GenoTEX benchmark: 89.13% correlation for preprocessing, F1 of 60.48% for gene identification (+10.61% and +16.85% vs prior art).

Multi-agentAI AgentsCode generation
SIG
82
HYP
18
arXiv cs.AI·

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Vision-OPD introduces regional-to-global self-distillation to improve fine-grained visual understanding in MLLMs. The framework transfers the model's privileged perception on evidence-centered crops to its full-image policy via KL divergence minimization between token distributions. Competitive results on fine-grained visual understanding benchmarks without external models or ground-truth labels.

VisionReinforcement learningBenchmarks
SIG
72
HYP
18
arXiv cs.AI·

Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents

Reversa is a reverse documentation engineering framework converting legacy systems into operational specifications for AI agents. A multi-agent pipeline extracts implicit business rules, synthesizes architecture, and generates traceable specifications with confidence marking. Case study: COBOL-to-Go ATM migration producing 517 claims, 10 identified gaps, and 53 Gherkin scenarios.

AI AgentsMulti-agentCode generation
SIG
72
HYP
25
arXiv cs.AI·

Data Presentation Over Architecture: Resampling Strategies for Credit Risk Prediction with Tabular Foundation Models

Comparative study of tabular foundation models (TFMs) vs classical models on credit default prediction. On Home Credit and Lending Club datasets, context construction strategy (balanced vs uniform sampling) explains more AUC-ROC variance than model choice: +3-4 AUC points. With 5K-10K balanced examples, TFMs match classical GBDTs while improving default-class recall.

Benchmarks
SIG
75
HYP
15
arXiv cs.CL·

Agentic Chunking and Bayesian De-chunking of AI Generated Fuzzy Cognitive Maps: A Model of the Thucydides Trap

Automatic generation of fuzzy cognitive maps (FCMs) from text using LLM agents that chunk text into overlapping segments. Convex mixing of chunk FCMs produces a cyclic FCM knowledge graph. Operator-level Bayesian inference generates "de-chunked" FCMs. Demonstration on Thucydides Trap model: 7 out of 8 FCMs predicted armed conflict. Gemini 3.1 served as chunking agent.

AI AgentsGeminiRAG
SIG
65
HYP
25
arXiv cs.AI·

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

RAT (Randomized Advantage Transformation) estimates Tikhonov-regularized natural policy gradients via direct backpropagation without explicit Fisher matrix construction. The method applies the Woodbury formula and randomized block Kaczmarz iterations on on-policy mini-batches. Results match or exceed established natural-gradient methods on continuous and visual control benchmarks.

Reinforcement learningReasoningPapers
SIG
75
HYP
15
arXiv cs.CL·

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

Researchers identify Entropy-Gradient Inversion, a negative correlation between token entropy and logit gradients, as a geometric fingerprint of Large Reasoning Models' reasoning capability. They propose Correlation-Regularized Group Policy Optimization (CorR-PO), an RL method embedding this inversion signature into reward regularization, outperforming baselines across multiple reasoning benchmarks.

ReasoningReinforcement learningBenchmarks
SIG
78
HYP
25
arXiv cs.AI·

DBES: A Systematic Benchmark and Metric Suite for Evaluating Expert Specialization in Large-Scale MoEs

DBES is a diagnostic framework for evaluating expert specialization in Mixture-of-Experts models. Five theoretically grounded metrics measure domain isolation and routing specialization. Testing on Qwen, DeepSeek, and GLM reveals distinct specialization paradigms. Targeted post-training on specialized expert paths improves performance by 66–94% using only 15% of original training resources.

BenchmarksQwenDeepSeek
SIG
82
HYP
18
arXiv cs.AI·

Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap

Six modern tabular foundation models form a highly redundant ensemble (mean Q-statistic 0.961). On 153 OpenML classification tasks, the best ensemble (two-level cascade stacking) gains +0.18% accuracy at 253× compute cost. Friedman-Nemenyi analysis places three ensembles and the best single model in the same equivalence group. Greedy selection is recommended as practical default.

BenchmarksPapers
SIG
75
HYP
15
arXiv cs.AI·

Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights

Reinforcement learning framework for predicting customer trajectories in retail spaces. RL-based approach outperforms TSP/PNN heuristics (average 28% deviation from shortest paths) by modeling bounded rationality. Validated on real convenience store data: RL predictions better align with observed behavior, more accurate impulse purchase rates and shelf traffic estimates, enabling practical layout optimization.

Reinforcement learningAI AgentsBusiness
SIG
72
HYP
18
arXiv cs.CL·

Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

Researchers train KinGPT (25M parameters) on chess data and demonstrate that high benchmark scores of chess-trained LLMs stem primarily from pattern-matching rather than genuine rule understanding. LLM-Modulo, a verifier-in-the-loop framework, improves RedPajama 3B from 1.2% to 21.2% best-move accuracy. Training code, datasets, and model checkpoints open-sourced.

BenchmarksEvalsFine-tuning
SIG
75
HYP
25
arXiv cs.AI·

Qumus: Realization of An Embodied AI Quantum Material Experimentalist

Qumus is the first embodied AI quantum materials experimentalist: an autonomous robotic mini-laboratory capable of hypothesis generation, protocol planning, and experimental execution on 2D quantum materials. It achieved first-time AI creation of graphene and fabrication of atomically thin field-effect transistors via van der Waals stacking, with closed-loop error correction.

AI AgentsMulti-agentRobotics
SIG
82
HYP
35