May 2026

3149 articles

Content-Style Identification via Differential Independence

New arXiv paper introducing CSDI (content-style differential independence) to identify content and style factors in multi-domain generative models. Relaxes prior statistical independence conditions via blockwise orthogonality constraints on Jacobian subspaces. Demonstrates identifiability even with dependent content/style and dense Jacobians.

Papers Image generation Reasoning

SIG

HYP

arXiv cs.AI·May 19

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

SynPro, a synthetic data generation framework, helps LLMs learn more thoroughly from limited organic corpora through rephrasing and reformatting operations. Optimized via reinforcement learning, it unlocks 3.7-5.2x more effective tokens than simple repetition on 400M and 1.1B models, even surpassing the non-data-bound oracle at 1.1B scale.

Reinforcement learning Benchmarks Open source

SIG

HYP

arXiv cs.AI·May 19

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

HINT-SD proposes targeted self-distillation for training long-horizon LLM agents. The method uses full-trajectory hindsight to identify failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. On BFCL v3 and AppWorld, it improves over dense per-turn feedback baselines by up to 18.80% while achieving 2.26× lower time per training step.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 19

Multi-agent AI systems outperform human teams in creativity

Multi-agent LLM teams outperform human teams in creativity (Cohen's d=1.50) across 4,541 AI ideas versus 341 human ideas on six tasks. Advantage driven by novelty while maintaining usefulness. LLM teams benefit from efficient exploration (high semantic spread, shorter paths), while humans benefit from smooth conversational flow (high local coherence, frequent pivots).

Multi-agent Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Spiker-LL: An Energy-Efficient FPGA Accelerator Enabling Adaptive Local Learning in Spiking Neural Networks

Spiker-LL is an FPGA accelerator for spiking neural networks (SNNs) enabling adaptive on-device learning. Built on Spiker+ architecture, it implements the STSF local learning rule with minimal overhead. On MNIST/F-MNIST/DIGITS: 93% accuracy, sub-millisecond latency, <0.1 mJ per inference, DSP-free.

Reasoning Infrastructure Open source

SIG

HYP

arXiv cs.AI·May 19

AdaptiveLoad: Towards Efficient Video Diffusion Transformer Training

AdaptiveLoad optimizes video diffusion Transformer training (DiT, MMDiT) by addressing load imbalance from quadratic attention complexity. Two components: dual-constraint adaptive load balancing and fused LayerNorm-Modulate CUDA kernel. On Wan 2.1: computational imbalance reduced from 39% to 18.9%, peak VRAM utilization +22.7%, training throughput +27.2%.

Video generation Infrastructure Benchmarks

SIG

HYP

arXiv cs.AI·May 19

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

BacktestBench is the first large-scale benchmark for automated quantitative backtesting, containing 18,246 annotated QA pairs from 6 million real market records. AutoBacktest, a multi-agent system, translates natural language strategies into reproducible backtests via Summarizer-Retriever-Coder coordination. Evaluation on 23 LLMs identifies key performance factors.

AI Agents Multi-agent Code generation

SIG

HYP

arXiv cs.AI·May 19

A More Word-like Image Tokenization for MLLMs

DiVT (Disentangled Visual Tokenization) clusters patch embeddings into coherent semantic units for MLLMs, creating discrete meaningful visual tokens instead of continuous streams. Adapts token budget to image complexity, reducing memory and latency while improving LLM compatibility.

Vision Code generation

SIG

HYP

arXiv cs.AI·May 19

Self-Improving Tabular Language Models via Iterative Reward-Guided Post-Training

TabGRAA, a group-relative advantage alignment method, improves tabular language models through iterative reward-guided post-training. Across five benchmarks, it outperforms adapted DPO, KTO, and NPO baselines, optimizing the fidelity-utility-privacy trade-off beyond supervised fine-tuning alone.

Reinforcement learning Fine-tuning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling

Babel is a black-box jailbreak method exploiting a vulnerability in LLM safety alignment: safety relies on sparse attention heads, leaving representational space weakly monitored. Through optimized obfuscation and iterative refinement, Babel achieves 82.67% success on GPT-4o and 78.33% on Claude-3-5-haiku within ~40 queries.

AI safety Alignment GPT

SIG

HYP

arXiv cs.AI·May 19

SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models

SAFE-SVD proposes a compression method for physics foundation models (PFMs) that preserves physical fidelity. The technique models layer sensitivity in the output function space, avoiding severe performance degradation caused by conventional methods. Experiments show substantial gains in compression ratios while maintaining accuracy across multiple models and datasets.

Papers Benchmarks Infrastructure

SIG

HYP

arXiv cs.AI·May 19

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

ESI-Bench is a benchmark for embodied spatial intelligence spanning 10 task categories on OmniGibson. Agents must combine perception, locomotion, and manipulation to actively accumulate evidence. Experiments show active exploration outperforms passive approaches, but failures stem primarily from poor action choices rather than weak perception.

Vision Robotics Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Echoes in Filter Bubble: Diagnosing and Curing Popularity Bias in Generative Recommenders

Study on popularity bias in Generative Recommenders (GRs). Authors identify bias stems from token-level optimization flaw and undifferentiated item tokenization. They propose Ghost, a GR with asymmetric unlikelihood optimization and skeleton-founded tokenization, validated across 3 datasets.

Papers Benchmarks Alignment

SIG

HYP

arXiv cs.AI·May 19

PRISMat: Policy-Driven, Permutation-Invariant Autoregressive Material Generation

PRISMat is a permutation-invariant autoregressive model for crystal material generation. Lighter and faster than LLMs, it reduces prediction error for cleavage energy and work function by 4× (MAE 0.188 eV/Å² and 2.79 eV). Applicable to high-throughput materials discovery.

Papers Benchmarks Code generation

SIG

HYP

arXiv cs.AI·May 19

A Global-Local Graph Attention Network for Traffic Forecasting

New arXiv paper proposing GLGAT (Global-Local Graph Attention Network) for traffic forecasting. The model combines a global attention matrix for the entire graph with local attention matrices per vertex, using pairwise encoding and event-based adjacency matrix. Experiments on two real-world traffic datasets show competitive performance against state-of-the-art baselines.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Scalable Uncertainty Reasoning in Knowledge Graphs

Thesis proposing a modular framework for reasoning over uncertainty in knowledge graphs at three levels: imprecise attribute values, probabilistic triple existence, and incomplete schema. Combines probabilistic literals, tractable probabilistic circuits via SPARQL compilation, and topology-aware geometric embeddings for schema reasoning.

Reasoning RAG

SIG

HYP

arXiv cs.AI·May 19

Observation-Aligned Mask Priors for Learning Physical Dynamics from Authentic Occlusions

A framework learns authentic occlusion mask distributions using Bayesian Flow Networks to train diffusion-based reconstruction models on incomplete observations. Tested on oceanographic satellite data (256×256), it improves MSE and PSNR over diffusion baselines by preventing zero-query dead zones.

Papers Benchmarks Vision

SIG

HYP

arXiv cs.AI·May 19

From Prompts to Protocols: An AI Agent for Laboratory Automation

An AI agent integrating LLMs with laboratory orchestration enables scientists to create and monitor automated protocols using natural language. Tested on three simulated labs (chemistry, biology, materials science), the agent achieves 97% first-attempt success rate and reduces required interface actions by an order of magnitude.

AI Agents Reasoning Tools

SIG

HYP

arXiv cs.AI·May 19

AgentWall: A Runtime Safety Layer for Local AI Agents

AgentWall is a runtime safety layer for local AI agents. It intercepts proposed agent actions before execution, evaluates them against an explicit declarative policy, requires human approval for sensitive operations, and records a complete audit trail. Implemented as an MCP-enforcing proxy and native OpenClaw plugin, it achieves 92.9% policy enforcement accuracy with sub-millisecond overhead.

AI Agents AI safety MCP

SIG

HYP

arXiv cs.AI·May 19

Cross-Domain Molecular Relational Learning: Leveraging Chemical Structure-Activity Analysis

DisTrans, a domain adversarial training network, optimizes cross-domain molecular relational learning by integrating topological structures and visual modalities. Using gradient reversal and semantic alignment of functional groups, the method outperforms 16 baselines across two cross-domain strategies.

Papers Benchmarks Vision

SIG

HYP

arXiv cs.AI·May 19

EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

EvoMemBench is a unified benchmark evaluating LLM agent memory across two axes: scope (in-episode vs. cross-episode) and content (knowledge-oriented vs. execution-oriented). Comparison of 15 memory methods: long-context baselines remain highly competitive, retrieval-based methods dominate knowledge-intensive tasks, procedural methods excel for execution-oriented tasks.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 19

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

AMARIS introduces persistent evaluation memory to improve rubrics in LLM RL fine-tuning. The system accumulates evaluation diagnostics over time, uses static and dynamic retrieval to contextualize rubric modifications, and adds ~5% time overhead. Experiments show consistent gains across closed and open-ended domains.

Reinforcement learning Fine-tuning Evals

SIG

HYP

arXiv cs.AI·May 19

MCQ Difficulty Prediction via Modeling Learner Heterogeneity Using Data-Driven Cognitive Profiling

MCQ difficulty prediction via data-driven cognitive profiling. Persona-driven framework using latent class analysis (LCA) on EEDI dataset, LLM simulation of response distributions per persona, aggregation with topic context and Ridge Regression. Improvement: MSE 0.367→0.274, R²=0.686.

Evals Reasoning

SIG

HYP

arXiv cs.AI·May 19

LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

LongMINT is a benchmark evaluating agents' memory management in long contexts (up to 1.8M tokens) with multi-target interference. 15.6k QA pairs across 4 domains (state tracking, dialogue, Wikipedia revisions, GitHub commits). 7 systems tested (LLMs, RAG, agents) achieve 27.9% average accuracy, limited by retrieval and memory construction.

AI Agents Benchmarks RAG

SIG

HYP

arXiv cs.AI·May 19

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

Distinguishable Deletion (D²) unifies knowledge deletion and refusal for LLM unlearning. The method uses an energy index to erase undesirable knowledge in latent representations rather than specific tokens, avoiding biased deletion and re-emergence of harmful content. Energy-based Unlearning Alignment (EUA) applies this mechanism at training and inference.

AI safety Alignment Papers

SIG

HYP

arXiv cs.AI·May 19

VolTA-3D: Self-Supervised Learning for Brain MRI using 3D Volumetric Token Alignment

VolTA-3D is a self-supervised 3D Vision Transformer framework for brain MRI. It aligns global and local tokens in a student-teacher paradigm and enforces fine-grained structural reconstruction. Evaluated on hippocampal segmentation and classification tasks (sex, Alzheimer's), it outperforms random baselines and demonstrates improved transferability across domain shifts.

Vision Papers

SIG

HYP

arXiv cs.AI·May 19

SLASH the Sink: Sharpening Structural Attention Inside LLMs

LLMs spontaneously reconstruct graph topology via sawtooth attention patterns, but this structural understanding is diluted by attention sink. SLASH, a training-free solution, re-amplifies this understanding through plug-and-play attention redistribution, showing significant gains on graph tasks and molecular prediction across diverse LLMs.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

TTG (Token Games) is an evaluation framework where language models challenge each other by creating programming puzzles. The system uses pairwise duels and Elo ratings to compare 10 frontier models. Results match existing benchmarks (Humanity's Last Exam) for under $200 USD without human puzzle curation.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning

New IBAL method to strengthen MARL robustness against inter-agent interaction disruptions. Framework uses information-theoretic approach to construct attacks that degrade coordination by perturbing observations and actions, then trains agents to remain reliable. Demonstrated improvement over existing baselines and agent-missing scenarios.

Multi-agent Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

Stable Audio 3

Stable Audio 3 is a family of latent diffusion models (small, medium, large) for variable-length audio generation and editing. Models use a novel semantic-acoustic autoencoder and adversarial post-training to generate music and sounds in under 2s on H200 or seconds on MacBook Pro M4. Small and medium weights are released.

Open source

SIG

HYP

arXiv cs.AI·May 19

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

TTE-Flash replaces explicit Chain-of-Thought traces with latent think tokens to accelerate reasoning-aware multimodal representations. TTE-Flash-2B outperforms explicit-CoT counterparts on MMEB-v2 while maintaining constant inference cost. Latent tokens remain interpretable both textually and visually.

Reasoning Vision Embeddings

SIG

HYP

arXiv cs.AI·May 19

Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

Calibrate-Then-Act (CTA) is a framework enabling LLM agents to explicitly reason about cost-uncertainty tradeoffs before acting. By providing inferred priors about environment state, CTA improves optimal decision-making on retrieval-augmented QA, synthetic tasks, and file-reading coding tasks.

AI Agents Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

EmoMind: Decoding Affective Captions from Human Brain fMRI

EmoMind decodes affective captions directly from brain fMRI signals. The system first retrieves a neutral scene description from brain-decoded visual features, then rewrites it using a continuous 34-dimensional emotion vector extracted from the same fMRI recording. Evaluated on two independent emotion fMRI datasets, EmoMind outperforms GPT-4 with discrete emotion labels across all validation axes.

Vision Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix

BM25 improvement for code retrieval using q-logarithmic transformation of RSJ-odds IDF. On CoIR CodeSearchNet Go, NDCG@10 rises from 0.2575 to 0.4874 (+89.3%). Drop-in fix with no latency cost, parameterized by corpus hapax density.

Code generation RAG Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Universal Dynamics of Punctuated Progress

Analysis of 6.8M solutions across 6.7K tasks in 9 domains (materials, structural biology, AI, computational biomedicine, data science, theoretical CS, F1, wheel building). Three universal patterns: heavy-tailed waiting times, sublinear record accumulation, temporal correlation of breakthroughs. Minimal model unifies radical innovation and incremental refinement.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 19

Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework

ConceptAgent, a training-free multi-agent framework, bypasses concept erasure in diffusion models by exploiting denoising dynamics. The black-box approach awakens suppressed concepts by initializing the denoising trajectory from surrogate-guided noisy states, without access to model parameters.

Multi-agent AI safety Image generation

SIG

HYP

arXiv cs.AI·May 19

Domain Incremental Learning for Pandemic-Resilient Chest X-Ray Analysis

Replay-based continual learning method for adapting pneumonia detection models across clinical domain variations without catastrophic forgetting. Incorporates class-aware balanced replay and dynamically reweighted class-imbalance loss. Achieves 88.66% accuracy on PneumoniaMNIST with 5 simulated domains, outperforming Experience Replay and Fine-Tuning baselines.

Reinforcement learning Vision Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Multi-Object Tracking Consistently Improves Wildlife Inference

Researchers apply Multi-Object Tracking (MOT) to camera-trap data to improve wildlife species classification. By fusing softmax probabilities across temporal trajectories, the method gains 5.1% weighted F1-Score on best-performing MOT models, eliminating inconsistent predictions across consecutive frames.

Vision Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science

SEED is a framework representing experimental conditions as typed actor-flow graphs to study multi-agent systems and human-AI workflows. It enables describing conditions, evaluating structural novelty, and generating candidate designs under constraints. Empirical test on medical-triage task shows SEED-guided designs provide clearer interaction changes, assumptions, and governance checks.

AI Agents Multi-agent Evals

SIG

HYP

arXiv cs.AI·May 19

Stabilizing Temporal Inference Dynamics for Online Surgical Phase Recognition

New framework to stabilize temporal predictions in surgical phase recognition. Introduces TEC loss (training), EGTP (inference), and TFI (metric). Reduces prediction fragmentation on Cholec80 and AutoLaparo while maintaining frame-wise accuracy.

Vision Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization

Novel adversarial imitation learning algorithm combining off-policy learning with double Q-network stabilization. Reduces sample inefficiency of GAIL by eliminating on-policy algorithm dependency (TRPO) and reward engineering requirements.

Reinforcement learning AI Agents Papers

SIG

HYP

arXiv cs.AI·May 19

SLEIGHT-Bench: A Benchmark of Evasion Attacks Against Agent Monitors

SLEIGHT-Bench is a benchmark of 40 evasion attacks against LLM-based coding agent monitors. Claude Opus 4.6 with extended thinking catches only 23% of attacks (24/40 never detected). Evasion strategies exploit model priors, instruction ambiguity, and state manipulation.

AI Agents AI safety Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Counterparty Modeling is Not Strategy: The Limits of LLM Negotiators

Researchers test LLM agents' negotiation abilities in a controlled multi-attribute bargaining environment. Agents accurately model counterparty preferences but fail to convert this knowledge into winning strategy. Final agreements are driven by opening anchors rather than actual utility weights.

Reasoning AI Agents Evals

SIG

HYP

arXiv cs.AI·May 19

Semantic Smoothing via Novel View Synthesis for Robust SAR Image Classification

Adversarial defense for SAR image classification using semantic smoothing. Replaces isotropic noise with structured geometric transformations generated by novel view synthesis, conditioned on acquisition geometry. Improves robustness against FGSM, PGD, OTSA, SMGAA while increasing clean classification accuracy.

AI safety Vision Evals

SIG

HYP

arXiv cs.AI·May 19

Recall Isn't Enough: Bounding Commitments in Personalized Language Systems

arXiv paper introducing CBEA+LCV, a method for validating commitments in personalized language systems. Rather than treating personalization as recall, the approach structures constraints before generation. Across 360 fixtures, achieves zero failures at 0.49-0.60 availability versus 0.003-0.092 for baselines, with 74-75% payload reduction.

Reasoning Evals AI safety

SIG

HYP

arXiv cs.AI·May 19

Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

Alice is an online executable world-model learning system that discovers environment dynamics without rule descriptions or reward signals. The agent induces transition laws from interaction alone, treating preservation conflicts as structural signal to refine hypothesis classes. Evaluation on Baba in Wonderland shows substantial improvement under prior misalignment.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·May 19

Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework

buddyMe, open-source multi-model framework, integrates three agent interaction paradigms: multi-agent orchestration (Generator-Evaluator), ReAct loops, memory-augmented interaction. Five-stage pipeline tested on 4 real cases (museum guides, weather, tour planning). Results: 20% requirement omission detection, 30% redundant tool invocations, adversarial consensus in 2-3 rounds (70% scenarios).

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.AI·May 19

Evidential Information Fusion on Possibilistic Structure

Paper introduces a reversible transformation between belief functions and possibilistic structures to overcome Dempster's rule limitations. Proposes a belief evolution network and triangular norm family for flexible evidential information fusion from non-distinct sources with improved conflict management.

Papers Reasoning

SIG

HYP

arXiv cs.AI·May 19

AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

AnchorDiff introduces a masked diffusion framework for radiology report generation, integrating clinical anchors derived from knowledge graphs. Unlike traditional autoregressive models, this bidirectional approach uses topology-aware training based on RadGraph and iterative refinement. SOTA on MIMIC-CXR and MIMIC-RG4 benchmarks.

Papers Benchmarks Code generation

SIG

HYP

arXiv cs.AI·May 19

F2IND-IT! -- Multimodal Fuzzy Fake Indian News Detection using Images and Text

Multimodal framework for detecting fake news in Indian media combining images and text. Uses ResNet-50 for visual features, DistilBERT for textual embeddings, and an Adaptive Neuro-Fuzzy Inference System (ANFIS) to generate fuzzy reliability scores. Evaluated on IFND dataset with superior results across accuracy, precision, recall, and F1-scores.

Vision Embeddings Evals

SIG

HYP

arXiv cs.AI·May 19

Towards Sustainable Growth: A Multi-Value-Aware Retrieval Framework for E-Commerce Search

GrowthGR, a retrieval framework for e-commerce, addresses the "Matthew effect" by balancing immediate conversion and long-term new item growth. Deployed on Taobao, it combines item long-term value prediction (ItemLTV) and multi-value-aware policy optimization (MoPO), achieving +5.3% new item GMV and +0.3% overall search GMV.

RAG Reinforcement learning Business

SIG

HYP

arXiv cs.AI·May 19

Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate

TIDE is a prompt optimization framework using a Trial and Debate mechanism to improve argumentative essay understanding. Evaluated on three tasks (Automated Essay Scoring, Argument Component Detection, Argument Relation Identification), it mitigates noisy training data impact and enhances optimization stability.

Prompt engineering Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

MetaCogAgent: A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation

MetaCogAgent is a multi-agent LLM framework where each agent evaluates task-capability alignment via a Metacognitive Self-Assessment Unit before execution. The system combines verbalized uncertainty and historical capability profiles to route tasks to best-suited agents. On MetaCog-Eval benchmark (700 tasks), it achieves 82.4% accuracy (+8.7% vs baselines) with 5% fewer API calls than AutoGen.

Multi-agent AI Agents Reasoning

SIG

HYP

arXiv cs.AI·May 19

HyperPersona: A Multi-Level Hypergraph Framework for Text-Based Automatic Personality Prediction

HyperPersona introduces a hypergraph-based framework for text-based automatic personality prediction. The model explicitly captures language hierarchy (document, sentence, word) through hypergraph structure, then applies a transformer-based graph encoder to model multi-level dependencies. Achieves superior performance on Big Five personality dimensions.

Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Verify-Gated Completion as Admission Control in a Governed Multi-Agent Runtime: A Bounded Architecture Case Study

Study of verify-gated completion pattern for controlling persistent multi-agent systems. Bounded implementation: 99.5% verification success rate (1,791/1,800 events), 98.58% rule agreement with governance verifier. Results limited to decision inspectability and fail-closed behavior; no safety guarantees or task-level coverage claims supported.

Multi-agent AI Agents AI safety

SIG

HYP

arXiv cs.AI·May 19

Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

Study showing chess-trained language models memorize rather than generalize. KinGPT (25M params) outperforms ChessGPT (3B) and C1-4B on chess benchmarks, but analysis reveals pattern-matching. LLM-Modulo, a verifier-in-the-loop framework, improves RedPajama 3B from 1.2% to 21.2% move accuracy. Code and models open-sourced.

Benchmarks Evals Fine-tuning

SIG

HYP

arXiv cs.AI·May 19

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

AutoRubric-T2I automatically synthesizes explicit rubrics to guide Vision-Language Model judges for text-to-image alignment evaluation. Using <0.01% of annotation data required by traditional reward models, it outperforms baselines on MMRB2 and improves generation quality with Flow-GRPO on diffusion models.

Image generation Vision Evals

SIG

HYP

arXiv cs.AI·May 19

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

WebGameBench is a requirement-to-application benchmark evaluating whether coding agents can convert a web game specification into a browser-playable application. Across 111 tasks and 12 agents, the best configuration achieves 76.9% usable rate but only 20.2% excellent rate, revealing a gap between minimum delivery and full requirement satisfaction.

AI Agents Code generation Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

CMI (Causal Memory Intervention) selects relevant memories for long-horizon LLM agents through controlled causal interventions rather than semantic similarity. Causal-LoCoMo benchmark introduced with useful memories, distractors, and synthetic harmful memories. CMI outperforms baselines (vector, graph, reflection, summary) in robustness against misleading memories.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·May 19

Byzantine-Resilient Federated Learning via QUBO-Based Client Selection on Quantum Annealers

Quantum annealing approach for selecting trustworthy clients in federated learning against Byzantine attacks. Reformulates client selection as QUBO problem jointly optimizing over all subsets. MultiSignal hybrid ensemble achieves 95.3% detection accuracy at 100 clients on MNIST vs 91.8% for classical MultiKrum, with major gains on Sparse Lie (+23.2 points) and Advanced Lie (+4.8 points).

Reinforcement learning AI safety Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

Researchers identify Entropy-Gradient Inversion, a negative correlation between token entropy and logit gradients, as a geometric fingerprint of Large Reasoning Models' reasoning capability. They propose Correlation-Regularized Group Policy Optimization (CorR-PO), embedding this inversion signature into RL reward regularization, outperforming baselines across multiple reasoning benchmarks.

Reasoning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Interactive Evaluation Requires a Design Science

Position paper on interactive evaluation of LLMs. Models deployed as systems acting over time (tools, environments, agents) require evaluation paradigm distinct from static benchmarks. Authors propose taxonomy, design principles, and reporting standards to assess process, recoverability, coordination, robustness, and system-level performance.

AI Agents Evals Benchmarks

SIG

HYP

arXiv cs.AI·May 19

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

SWIM aligns vision-language representations for fine-grained video object understanding from text prompts alone. Uses mask supervision during training to guide cross-modal attention. Constructs NL-Refer dataset with precise natural language referring expressions. Outperforms visual-prompt-based methods on fine-grained benchmarks.

Vision RAG Embeddings

SIG

HYP

arXiv cs.AI·May 19

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

PROTEA is an interface for offline debugging and refinement of multi-agent LLM workflows. It evaluates intermediate outputs with configurable rubrics, localizes bottlenecks via workflow graph visualization, and generates targeted prompt revisions. On two production-adjacent workflows, PROTEA improves document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38.

Multi-agent AI Agents Prompt engineering

SIG

HYP

arXiv cs.AI·May 19

Evaluating Cognitive Age Alignment in Interactive AI Agents

ChildAgentEval, an interactive benchmark inspired by the WISC scale, evaluates cognitive age alignment of multimodal AI agents on reasoning tasks matched to developmental stages. Results show current agents fail at simple tasks children solve easily, exposing a fundamental gap between AI and human intelligence.

AI Agents Multi-agent Evals

SIG

HYP

arXiv cs.AI·May 19

Agentic Chunking and Bayesian De-chunking of AI Generated Fuzzy Cognitive Maps: A Model of the Thucydides Trap

Automatic generation of feedback fuzzy cognitive maps (FCMs) from text using LLM agents to chunk text with overlaps. Convex mixing of chunk FCMs produces representative cyclic FCM knowledge graphs. Operator-level Bayesian inference generates de-chunked posterior FCMs. Demonstrated on Allison's Thucydides Trap model: 7 out of 8 FCM knowledge graphs predicted war when stimulated.

AI Agents Reasoning Gemini

SIG

HYP

arXiv cs.AI·May 19

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

SVFSearch is a multimodal benchmark for short-video frame search in the Chinese gaming domain. It contains 5,000 test examples and 4,198 training examples based on real game scenes. Evaluation compares direct QA, RAG, Plan-Act-Replan agents, and learned search models: best open-source model reaches 66.4%, best practical agent 79.1%, oracle 95.4%.

Benchmarks AI Agents RAG

SIG

HYP

arXiv cs.AI·May 19

Shared Backbone PPO for Multi-UAV Communication Coverage with Connection Preservation

Shared Backbone PPO algorithm for multi-UAV swarm communication coverage optimization. Sharing base module between Actor and Critic networks improves training efficiency. Graph information aggregation module integrated to model inter-agent communication conditions.

Reinforcement learning Multi-agent AI Agents

SIG

HYP

arXiv cs.AI·May 19

TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?

TeleCom-Bench is a 22,678-sample benchmark evaluating 8 LLMs on real telecom tasks (intent recognition, entity extraction, root cause analysis, solution generation). Models achieve 90% on linguistic tasks but collapse to 30% on procedural execution, revealing an 'Execution Wall': LLMs diagnose well but fail as field engineers.

Benchmarks Reasoning AI Agents

SIG

HYP

arXiv cs.AI·May 19

FLAG: Foundation model representation with Latent diffusion Alignment via Graph for spatial gene expression prediction

FLAG is a latent diffusion framework for predicting spatial gene expression from H&E images. It integrates a spatial graph encoder and Gene Foundation Model alignment to address the Gene Dimension Curse and preserve biological relationships (gene coordination, spatial distribution). Introduces novel structural evaluation metrics: GSC and SSC.

Papers Vision Reasoning

SIG

HYP

arXiv cs.LG·May 19

Edge-AI-Driven Learning-to-Rank for Decentralized Task Allocation in Circular Smart Manufacturing

Decentralized task allocation framework for circular smart manufacturing using Edge-AI and learning-to-rank. Each machine evaluates incoming tasks with local information (processing capability, queue state, resource contention). Results: improved delay, better deadline adherence, enhanced energy efficiency in simulation.

AI Agents Reinforcement learning Evals

SIG

HYP

arXiv cs.AI·May 19

LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models

LEAP introduces end-to-end unstructured pruning for LLMs via per-weight Bernoulli-Gumbel-sigmoid relaxation. Across five model families (0.5B–8B) at 50–60% sparsity, LEAP improves average zero-shot accuracy by +2.59 points over ADMM baseline.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Evidence-Grounded Frontier Mapping and Agentic Hypothesis Generation in Nanomedicine

pArticleMap combines article embeddings, similarity-graph analysis, and audited LLM workflows to generate evidence-grounded research hypotheses in nanomedicine. The system targets low-density bridge regions and cluster interfaces for discovery support. Retrospective evaluation: 10.8% gold recovery rate, recall@10 of 15.9%, future-neighborhood rate of 61.0%.

AI Agents RAG Embeddings

SIG

HYP

arXiv cs.AI·May 19

A Practical Noise2Noise Denoising Pipeline for High-Throughput Raman Spectroscopy

Noise2Noise denoising pipeline for high-throughput Raman spectroscopy using 1D convolutional autoencoder. Trained on repeated short acquisitions (5 ms), no external reference required. Evaluated on mineral sample: RMSE, SNR, SSIM and K-means classification. Preserves chemical coherence while accelerating acquisition.

Papers Benchmarks Code generation

SIG

HYP

arXiv cs.AI·May 19

VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

VISAFF is a framework for Emotion Recognition in Conversation (ERC) using vision-language models. It combines two stages: speaker-centered affective grounding and reliability-guided affective complementation. The tuning-free approach leverages frozen VLMs' reasoning capabilities, integrating visual, textual, and acoustic signals to improve accuracy without expensive fine-tuning.

Vision Multi-agent Papers

SIG

HYP

arXiv cs.AI·May 19

Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning

QCEA reformulates medical entity alignment as a query-conditioned correspondence problem, integrating semantic encoding and graph-based representation learning. Evaluated on TCM-WM knowledge graphs (SymMap), the model improves Hit@K and MRR metrics, and demonstrates gains in RAG for evidence retrieval and answer accuracy.

RAG Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Learning Lifted Action Models from Traces with Minimal Information About Actions and States

Learning lifted STRIPS+ action models from partial traces with minimal observability assumptions. Authors relax prior work by allowing partial observability of both actions and states. Three cases formalized: no state observability, full observability of selected predicates, local observability of predicates. Completeness results and experiments provided.

Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

SCICONVBENCH benchmarks LLMs on multi-turn clarification of ill-posed scientific problems across fluid mechanics, solid mechanics, materials science, and PDEs. Best models resolve only 52.7% of disambiguation cases in fluid mechanics, but perform better on inconsistency detection. Evaluates clarification behavior, conversational grounding, and specification fidelity.

Benchmarks Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 19

Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment

Position paper arguing that a three-layer probabilistic architecture (semantic intent/policy compliance, environmental validity, dynamical feasibility) is structurally required for safe LLM agent deployment. Each layer must independently certify one safety dimension via composable probabilistic guarantees.

AI Agents AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches

An agentic framework uses an LLM to assist users in real-time re-optimization of OR models. The LLM translates requests into structured model modifications, selects re-optimization techniques, and returns implementable solutions. Tested on supply chain and university exam scheduling.

AI Agents Reasoning RAG

SIG

HYP

arXiv cs.LG·May 19

QuantFPFlow: Quantum Amplitude Estimation for Fokker--Planck Policy Optimisation in Continuous Reinforcement Learning

QuantFPFlow integrates quantum amplitude estimation into stochastic policy optimization via Fokker-Planck formulation. Grover-amplified achieves quadratic speedup O(1/ε) vs classical O(1/ε²). On continuous control, outperforms SAC (1295.7 vs 1284.0 reward) and finds global optimum 10.4% more frequently (33.9% vs 30.7%).

Reinforcement learning Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra

Systematic optimization of real-time diffusion models on Apple M3 Ultra (60-core GPU, 512 GB unified memory). CoreML conversion of distilled SDXS-512 combined with 3-thread camera pipeline achieves 22.7 FPS at 512x512 resolution. Demonstrates that CUDA optimization insights don't transfer to Apple Silicon's unified memory architecture.

Image generation Benchmarks Infrastructure

SIG

HYP

arXiv cs.AI·May 19

AI Slop or AI-enhancement? Student perceptions of AI-generated media for an English for Academic Purposes course

Implementation study of Google Notebook LM in an English for Academic Purposes course (106 students, Hong Kong). Generated videos, podcasts, and infographics via RAG. Students rated visual and multimodal content highly; video preference correlated positively with academic performance. High cognitive load negatively associated with grades.

RAG Evals Tools

SIG

HYP

arXiv cs.AI·May 19

Generative AI in K-12 Classrooms: A Midyear Implementation Report

Mid-year report on Colleague AI usage across 12 Washington State school districts (September–December 2025). Joint study by Colleague AI and AmplifyLearn.AI (University of Washington) analyzing teacher engagement with generative AI in K-12 classrooms.

Tools Business

SIG

HYP

arXiv cs.AI·May 19

DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models

DACA-GRPO improves reinforcement learning for diffusion language models by addressing temporal credit assignment and mean-field likelihood bias. It introduces Denoising Progress Scores and Stratified Masking Likelihood, achieving gains up to 7.4pp on code generation and 5.6pp on math reasoning across seven benchmarks.

Reinforcement learning Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 19

Train the Trainers -- An Agentic AI Framework for Peer-Based Mental Health Support in Battlefield Environments

Agentic AI framework for peer-based mental health support in military operations. Recovered soldiers trained as peer facilitators supervise specialized AI agents (symptom triage, interventions, documentation) in air-gapped environments. Prototype developed with U.S. Army Health Center. Goal: reduce evacuations, accelerate care, maintain human oversight.

AI Agents Multi-agent AI safety

SIG

HYP

arXiv cs.AI·May 19

SENSE: Satellite-based ENergy Synthesis for Sustainable Environment

SENSE is a generative diffusion-based framework that jointly synthesizes realistic urban satellite imagery and aligned building energy consumption and height maps. Tested on NYC, Boston, Lyon, and Busan, it generates annotated synthetic data using <20% labeled data, improving prediction performance by 10% IoU and reducing error by 3-11% NMBE.

Image generation Code generation Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Theoretical paper proposing optimizers respecting symmetries of modern neural architectures. Introduces equivariant update rules for embeddings, LM heads, SwiGLU MLPs, and MoE routers. Validation on dense and sparse MoE models (Qwen3, Gemma 3, OLMoE, gpt-oss) shows improved validation loss vs AdamW.

Papers Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

SAGE, a self-evolving framework, improves spatial reasoning in VLMs by enforcing logical consistency through geometric and linguistic duality operations. Applied as a lightweight GRPO post-training stage, it corrects inconsistencies under predictable transformations and shows gains on video and spatial reasoning benchmarks.

Vision Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots

Framework for active 3D scene graph generation from RGB cameras only, without depth sensors. Unifies perception and planning around a structured representation. On Replica dataset, achieves F1-score parity with depth-based baselines. Semantic-driven viewpoint selection detects 2× more objects than geometric frontier baseline.

Vision Robotics AI Agents

SIG

HYP

arXiv cs.AI·May 19

Beyond Imperfect Alternatives with Rulemapping: A Neuro-Symbolic Case Study on Online Hate Speech

Neuro-symbolic study comparing LLMs constrained by deterministic logic scaffolds (Rulemapping) versus unconstrained prompting for hate speech moderation under German Criminal Code (§130). Rulemapping achieves precision 0.80-0.86 and recall 0.82-0.89 versus 0.34-0.49 with unconstrained prompting, eliminating conflation of moral offense with legal illegality.

Reasoning AI safety Regulation

SIG

HYP

arXiv cs.AI·May 19

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

Systematic analysis of 40 agent safety benchmarks (2023-2026). Benchmarks exhibit incompatible threat models, fragmented metrics, and inconsistent risk coverage. Concordance test (Kendall's W = 0.10, p = 0.94) reveals no ranking alignment across evaluation dimensions. Releases structured metadata and proposes minimum reporting standards.

AI Agents AI safety Evals

SIG

HYP

arXiv cs.LG·May 19

Mixing Times of Glauber Dynamics on Masked Language Models

Masked language models (MLMs) define local conditional distributions incompatible with any consistent global joint distribution. Authors model iterative resampling as Glauber dynamics Markov chain, proving O(n log n) mixing time under bounded cross-token influence, but showing exponential metastability at low temperature with persistent semantic basins.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

UVTran: Accurate Hole-Filling Parameterization with Transformers

UVTran, a transformer-based framework, solves N-sided hole filling in CAD by predicting an auxiliary projection surface via cross-attention biased toward nearby control points, voxelizing coordinates, and progressive-resolution training. On benchmark, it improves tolerance-satisfaction rate by 12% over industrial and academic baselines while producing fairer trimmed surfaces.

Papers Reasoning

SIG

HYP

arXiv cs.LG·May 19

M$^2$FedAQI: Multimodal Federated Learning for Air Quality Prediction on Heterogeneous Edge Devices

M²FedAQI introduces a lightweight multimodal federated framework for decentralized Air Quality Index (AQI) prediction across heterogeneous edge devices. The system fuses visual and tabular data through feature modulation-based fusion. Evaluated on PM25Vision and TRAQID datasets, it achieves 11% accuracy improvement, 3.53% AUC gain, 12.2% F1-score increase, and 18% R² improvement over baselines.

Vision Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

ProxyKV introduces a cross-model proxy pruning framework to accelerate long-context LLM inference. A lightweight Small-Model Proxy asynchronously scores KV cache importance for the target model. Tested on Llama-3.1, Qwen-2.5, and Qwen-3: recovers 98.7% of KVZip accuracy with up to 3.21× prefilling speedup (Llama-3.1-8B, dual-GPU) and sustains gains up to 170k tokens.

Llama Qwen Reasoning

SIG

HYP

arXiv cs.AI·May 19

Federated Nested Learning: Collaborative Training of Self-Referential Memories for Test-Time Adaptation

FedNL reformulates federated learning as a three-level nested optimization system. Embeds Titans-based linear attention for zero-shot test-time adaptation without additional training. Tested on Non-IID MMLU and long-context benchmarks with constant inference memory.

Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

StrLoRA introduces a streaming continual visual instruction tuning framework for MLLMs. Unlike existing methods restricted to predefined tasks, StrCVIT handles data streams with dynamic, interleaved tasks. StrLoRA employs two-stage expert routing with task-aware selection and token-wise weighting, stabilized via routing-stability regularization.

Multi-agent Fine-tuning Vision

SIG

HYP

arXiv cs.AI·May 19

Leveraging Graph Structure in Seq2Seq Models for Knowledge Graph Link Prediction

GA-S2S combines T5-small with a Relational Graph Attention Network (RGAT) for knowledge graph link prediction. The model jointly encodes textual features and full k-hop subgraph topology around the query entity. On CoDEx, GA-S2S outperforms Seq2Seq baselines with 19% relative accuracy gain.

Benchmarks RAG Papers

SIG

HYP

arXiv cs.AI·May 19

Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

Fre-Res introduces adaptive video-token compression for video MLLMs. The framework separates spatial details (high-fidelity anchors) from temporal evolution (residual-frequency tokens via 1D-DCT). A Spatial-Guided Absorber aligns frequency dynamics with visual embeddings. Results: near full-token performance with substantial reduction in token length across short and long-video benchmarks.

Vision Video generation Evals

SIG

HYP