Archives

May 2026

3148 articles

arXiv cs.AI·

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

HINT-SD proposes targeted self-distillation for training long-horizon LLM agents. The method uses full-trajectory hindsight to identify failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. On BFCL v3 and AppWorld, it improves over dense per-turn feedback baselines by up to 18.80% while achieving 2.26× lower time per training step.

AI AgentsReinforcement learningReasoning
SIG
75
HYP
15
arXiv cs.AI·

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

BacktestBench is the first large-scale benchmark for automated quantitative backtesting, containing 18,246 annotated QA pairs from 6 million real market records. AutoBacktest, a multi-agent system, translates natural language strategies into reproducible backtests via Summarizer-Retriever-Coder coordination. Evaluation on 23 LLMs identifies key performance factors.

AI AgentsMulti-agentCode generation
SIG
78
HYP
25
arXiv cs.AI·

SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models

SAFE-SVD proposes a compression method for physics foundation models (PFMs) that preserves physical fidelity. The technique models layer sensitivity in the output function space, avoiding severe performance degradation caused by conventional methods. Experiments show substantial gains in compression ratios while maintaining accuracy across multiple models and datasets.

PapersBenchmarksInfrastructure
SIG
72
HYP
28
arXiv cs.AI·

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

Distinguishable Deletion (D²) unifies knowledge deletion and refusal for LLM unlearning. The method uses an energy index to erase undesirable knowledge in latent representations rather than specific tokens, avoiding biased deletion and re-emergence of harmful content. Energy-based Unlearning Alignment (EUA) applies this mechanism at training and inference.

AI safetyAlignmentPapers
SIG
72
HYP
25
arXiv cs.AI·

EmoMind: Decoding Affective Captions from Human Brain fMRI

EmoMind decodes affective captions directly from brain fMRI signals. The system first retrieves a neutral scene description from brain-decoded visual features, then rewrites it using a continuous 34-dimensional emotion vector extracted from the same fMRI recording. Evaluated on two independent emotion fMRI datasets, EmoMind outperforms GPT-4 with discrete emotion labels across all validation axes.

VisionReasoningEvals
SIG
75
HYP
25
arXiv cs.AI·

Domain Incremental Learning for Pandemic-Resilient Chest X-Ray Analysis

Replay-based continual learning method for adapting pneumonia detection models across clinical domain variations without catastrophic forgetting. Incorporates class-aware balanced replay and dynamically reweighted class-imbalance loss. Achieves 88.66% accuracy on PneumoniaMNIST with 5 simulated domains, outperforming Experience Replay and Fine-Tuning baselines.

Reinforcement learningVisionBenchmarks
SIG
72
HYP
15
arXiv cs.AI·

Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science

SEED is a framework representing experimental conditions as typed actor-flow graphs to study multi-agent systems and human-AI workflows. It enables describing conditions, evaluating structural novelty, and generating candidate designs under constraints. Empirical test on medical-triage task shows SEED-guided designs provide clearer interaction changes, assumptions, and governance checks.

AI AgentsMulti-agentEvals
SIG
72
HYP
18
arXiv cs.AI·

Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

Alice is an online executable world-model learning system that discovers environment dynamics without rule descriptions or reward signals. The agent induces transition laws from interaction alone, treating preservation conflicts as structural signal to refine hypothesis classes. Evaluation on Baba in Wonderland shows substantial improvement under prior misalignment.

ReasoningReinforcement learningPapers
SIG
72
HYP
15
arXiv cs.AI·

Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework

buddyMe, open-source multi-model framework, integrates three agent interaction paradigms: multi-agent orchestration (Generator-Evaluator), ReAct loops, memory-augmented interaction. Five-stage pipeline tested on 4 real cases (museum guides, weather, tour planning). Results: 20% requirement omission detection, 30% redundant tool invocations, adversarial consensus in 2-3 rounds (70% scenarios).

AI AgentsMulti-agentReasoning
SIG
72
HYP
28
arXiv cs.AI·

MetaCogAgent: A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation

MetaCogAgent is a multi-agent LLM framework where each agent evaluates task-capability alignment via a Metacognitive Self-Assessment Unit before execution. The system combines verbalized uncertainty and historical capability profiles to route tasks to best-suited agents. On MetaCog-Eval benchmark (700 tasks), it achieves 82.4% accuracy (+8.7% vs baselines) with 5% fewer API calls than AutoGen.

Multi-agentAI AgentsReasoning
SIG
72
HYP
28
arXiv cs.AI·

Verify-Gated Completion as Admission Control in a Governed Multi-Agent Runtime: A Bounded Architecture Case Study

Study of verify-gated completion pattern for controlling persistent multi-agent systems. Bounded implementation: 99.5% verification success rate (1,791/1,800 events), 98.58% rule agreement with governance verifier. Results limited to decision inspectability and fail-closed behavior; no safety guarantees or task-level coverage claims supported.

Multi-agentAI AgentsAI safety
SIG
65
HYP
15
arXiv cs.AI·

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

WebGameBench is a requirement-to-application benchmark evaluating whether coding agents can convert a web game specification into a browser-playable application. Across 111 tasks and 12 agents, the best configuration achieves 76.9% usable rate but only 20.2% excellent rate, revealing a gap between minimum delivery and full requirement satisfaction.

AI AgentsCode generationBenchmarks
SIG
78
HYP
25
arXiv cs.LG·

Byzantine-Resilient Federated Learning via QUBO-Based Client Selection on Quantum Annealers

Quantum annealing approach for selecting trustworthy clients in federated learning against Byzantine attacks. Reformulates client selection as QUBO problem jointly optimizing over all subsets. MultiSignal hybrid ensemble achieves 95.3% detection accuracy at 100 clients on MNIST vs 91.8% for classical MultiKrum, with major gains on Sparse Lie (+23.2 points) and Advanced Lie (+4.8 points).

Reinforcement learningAI safetyBenchmarks
SIG
72
HYP
25
arXiv cs.AI·

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

Researchers identify Entropy-Gradient Inversion, a negative correlation between token entropy and logit gradients, as a geometric fingerprint of Large Reasoning Models' reasoning capability. They propose Correlation-Regularized Group Policy Optimization (CorR-PO), embedding this inversion signature into RL reward regularization, outperforming baselines across multiple reasoning benchmarks.

ReasoningReinforcement learningBenchmarks
SIG
78
HYP
15
arXiv cs.AI·

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

PROTEA is an interface for offline debugging and refinement of multi-agent LLM workflows. It evaluates intermediate outputs with configurable rubrics, localizes bottlenecks via workflow graph visualization, and generates targeted prompt revisions. On two production-adjacent workflows, PROTEA improves document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38.

Multi-agentAI AgentsPrompt engineering
SIG
78
HYP
18
arXiv cs.AI·

Agentic Chunking and Bayesian De-chunking of AI Generated Fuzzy Cognitive Maps: A Model of the Thucydides Trap

Automatic generation of feedback fuzzy cognitive maps (FCMs) from text using LLM agents to chunk text with overlaps. Convex mixing of chunk FCMs produces representative cyclic FCM knowledge graphs. Operator-level Bayesian inference generates de-chunked posterior FCMs. Demonstrated on Allison's Thucydides Trap model: 7 out of 8 FCM knowledge graphs predicted war when stimulated.

AI AgentsReasoningGemini
SIG
45
HYP
35
arXiv cs.AI·

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

SVFSearch is a multimodal benchmark for short-video frame search in the Chinese gaming domain. It contains 5,000 test examples and 4,198 training examples based on real game scenes. Evaluation compares direct QA, RAG, Plan-Act-Replan agents, and learned search models: best open-source model reaches 66.4%, best practical agent 79.1%, oracle 95.4%.

BenchmarksAI AgentsRAG
SIG
78
HYP
15
arXiv cs.AI·

FLAG: Foundation model representation with Latent diffusion Alignment via Graph for spatial gene expression prediction

FLAG is a latent diffusion framework for predicting spatial gene expression from H&E images. It integrates a spatial graph encoder and Gene Foundation Model alignment to address the Gene Dimension Curse and preserve biological relationships (gene coordination, spatial distribution). Introduces novel structural evaluation metrics: GSC and SSC.

PapersVisionReasoning
SIG
72
HYP
18
arXiv cs.AI·

VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

VISAFF is a framework for Emotion Recognition in Conversation (ERC) using vision-language models. It combines two stages: speaker-centered affective grounding and reliability-guided affective complementation. The tuning-free approach leverages frozen VLMs' reasoning capabilities, integrating visual, textual, and acoustic signals to improve accuracy without expensive fine-tuning.

VisionMulti-agentPapers
SIG
72
HYP
25
arXiv cs.AI·

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

SCICONVBENCH benchmarks LLMs on multi-turn clarification of ill-posed scientific problems across fluid mechanics, solid mechanics, materials science, and PDEs. Best models resolve only 52.7% of disambiguation cases in fluid mechanics, but perform better on inconsistency detection. Evaluates clarification behavior, conversational grounding, and specification fidelity.

BenchmarksReasoningCode generation
SIG
78
HYP
15
arXiv cs.LG·

QuantFPFlow: Quantum Amplitude Estimation for Fokker--Planck Policy Optimisation in Continuous Reinforcement Learning

QuantFPFlow integrates quantum amplitude estimation into stochastic policy optimization via Fokker-Planck formulation. Grover-amplified achieves quadratic speedup O(1/ε) vs classical O(1/ε²). On continuous control, outperforms SAC (1295.7 vs 1284.0 reward) and finds global optimum 10.4% more frequently (33.9% vs 30.7%).

Reinforcement learningReasoningPapers
SIG
72
HYP
28
arXiv cs.AI·

Train the Trainers -- An Agentic AI Framework for Peer-Based Mental Health Support in Battlefield Environments

Agentic AI framework for peer-based mental health support in military operations. Recovered soldiers trained as peer facilitators supervise specialized AI agents (symptom triage, interventions, documentation) in air-gapped environments. Prototype developed with U.S. Army Health Center. Goal: reduce evacuations, accelerate care, maintain human oversight.

AI AgentsMulti-agentAI safety
SIG
72
HYP
28
arXiv cs.AI·

Beyond Imperfect Alternatives with Rulemapping: A Neuro-Symbolic Case Study on Online Hate Speech

Neuro-symbolic study comparing LLMs constrained by deterministic logic scaffolds (Rulemapping) versus unconstrained prompting for hate speech moderation under German Criminal Code (§130). Rulemapping achieves precision 0.80-0.86 and recall 0.82-0.89 versus 0.34-0.49 with unconstrained prompting, eliminating conflation of moral offense with legal illegality.

ReasoningAI safetyRegulation
SIG
75
HYP
15
arXiv cs.LG·

M$^2$FedAQI: Multimodal Federated Learning for Air Quality Prediction on Heterogeneous Edge Devices

M²FedAQI introduces a lightweight multimodal federated framework for decentralized Air Quality Index (AQI) prediction across heterogeneous edge devices. The system fuses visual and tabular data through feature modulation-based fusion. Evaluated on PM25Vision and TRAQID datasets, it achieves 11% accuracy improvement, 3.53% AUC gain, 12.2% F1-score increase, and 18% R² improvement over baselines.

VisionBenchmarksPapers
SIG
72
HYP
25
arXiv cs.AI·

Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

Fre-Res introduces adaptive video-token compression for video MLLMs. The framework separates spatial details (high-fidelity anchors) from temporal evolution (residual-frequency tokens via 1D-DCT). A Spatial-Guided Absorber aligns frequency dynamics with visual embeddings. Results: near full-token performance with substantial reduction in token length across short and long-video benchmarks.

VisionVideo generationEvals
SIG
72
HYP
18
arXiv cs.AI·

CheckSupport: A Local LLM-Powered Tool for Automated Manuscript Submission Checklist Selection and Completion

CheckSupport is an open-source system using locally-deployed LLMs to automate reporting checklist recommendation and completion for scientific manuscripts. Evaluated on peer-reviewed manuscripts, it achieves 90% accuracy for checklist recommendations and 88% for item-level completion, processing each manuscript in 12.5 seconds on CPU-only hardware.

LlamaPrompt engineeringEvals
SIG
75
HYP
15