May 2026

3149 articles

AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge

AASIST3 enhances speech deepfake detection by integrating Kolmogorov-Arnold Networks (KAN) into the AASIST framework. The model achieves minDCF=0.5357 (closed) and 0.1414 (open) on ASVspoof 2024, doubling prior performance. Code released on HuggingFace.

Voice AI safety Benchmarks

SIG

HYP

arXiv cs.AI·May 19

$\texttt{SynC}$: Synergistic Boosting of Structure and Representation for Deep Graph Clustering

SynC, a deep graph clustering framework, leverages synergistic relationship between representation learning and structure augmentation via a Transform Input Graph Auto-Encoder (TIGAE). The model shares weights across two stages to reduce parameters and improves generalization on low homophily graphs.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Universal Time-Series Representation Learning: A Survey

Survey on universal time-series representation learning. Proposes taxonomy based on three fundamental elements for state-of-the-art deep learning methods. Covers extraction of hidden patterns without manual feature engineering, with resources and future research directions.

Papers Benchmarks Embeddings

SIG

HYP

arXiv cs.CL·May 19

Scaling Laws for Code: A More Data-Hungry Regime

Empirical study of 117 experiments (0.2B–3.8B parameters, 2B–128B tokens) on scaling laws for Code LLMs. Code requires higher data-to-parameter ratio than natural language. Farseer law outperforms Chinchilla. Code-NL mixtures benefit NL under resource constraints but harm it at higher compute budgets.

Code generation Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Revisiting Long-term Time Series Forecasting: An Investigation on Linear Mapping

arXiv study on long-term time series forecasting (LTSF). Authors show that a simple linear layer (affine mapping) dominates performance on standard benchmarks. Analysis reveals models learn similar transition matrices, capture periodic patterns well but fail on non-periodic signals. Code available.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.AI·May 19

Property-Guided LLM Program Synthesis for Planning

Property-guided program synthesis approach reduces LLM costs by replacing simple numeric scores with formal property verification. When a property is violated, the system provides concrete counterexamples to guide repair. On PDDL planning domains, this method generates 7× fewer programs and drastically reduces evaluation costs while improving solution quality.

Code generation Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

Imperfect World Models are Exploitable

Formal study of imperfect world model exploitation in RL. Authors define exploitation as divergence between policy preferences in the model versus true environment. They prove exploitation is essentially unavoidable on large policy sets and establish theoretical bridge with reward hacking.

Reinforcement learning Reasoning AI safety

SIG

HYP

arXiv cs.CL·May 19

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

Red-Bandit is a red-teaming framework that adapts real-time specialized LoRA experts for different attack styles (manipulation, slang) via reinforcement learning. A multi-armed bandit algorithm dynamically selects the optimal expert based on target model response safety. State-of-the-art results on AdvBench with more readable prompts.

AI safety Fine-tuning Reinforcement learning

SIG

HYP

arXiv cs.CL·May 19

Prompt reinforcing for long-term planning of large language models

Prompt optimization framework inspired by reinforcement learning to improve long-term planning in LLM multi-turn interactions. Method modifies only task instruction via turn-by-turn feedback and experience replay. Significant improvements on text-to-SQL and task-oriented dialogue, generalizes across LLM agents.

Prompt engineering Reinforcement learning AI Agents

SIG

HYP

arXiv cs.CL·May 19

Can LLMs Generate and Solve Linguistic Olympiad Puzzles?

Study on LLMs' (including OpenAI o1) ability to solve and generate linguistic puzzles from Linguistic Olympiads. Models outperform humans on most puzzle types except writing systems and understudied languages. Automated puzzle generation could expand interest in linguistics and support rare language dissemination.

GPT OpenAI Benchmarks

SIG

HYP

arXiv cs.AI·May 19

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

LongAct is a benchmark for evaluating autonomous planning in long-horizon household tasks specified via free-form instructions. HoloMind, a VLM-driven agent with DAG-based hierarchical planner, Multimodal Spatial Memory, and Episodic Memory, achieves 59% goal completion and 16% full-task success with GPT-5 and Qwen3-VL models.

Benchmarks AI Agents Reasoning

SIG

HYP

arXiv cs.AI·May 19

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

ClawForge is a benchmark framework for CLI agents testing persistent state and conflict handling. 17 scenarios, 6 ability categories. Seven frontier models evaluated: best score 45.3%, widest gap 17-90% driven by whether agents inspect existing state before acting.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 19

CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models

CarbonScaling is a hardware-aware analytical framework modeling carbon footprint of frontier LLM training. It integrates neural scaling laws, distributed training strategies, accelerator modeling, and operational/embodied carbon accounting. Source code released on GitHub.

Benchmarks Papers Infrastructure

SIG

HYP

arXiv cs.AI·May 19

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

AutoLLMResearch introduces an agentic framework to automate configuration of expensive LLM experiments. The system learns from low-fidelity experiments to extrapolate toward promising high-fidelity configurations. LLMConfig-Gym provides a multi-fidelity environment with >1M GPU hours of verified experiment outcomes.

AI Agents Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Causal Bias Detection in Generative Artificial Intelligence

arXiv paper proposing a theoretical framework for detecting causal bias in generative AI models. Authors formalize causal fairness specific to generative models (vs standard ML), derive causal decompositions to quantify bias impacts across different causal pathways, and demonstrate their methodology by analyzing race and gender bias in large language models.

Papers AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

VLATIM, a new benchmark based on The Incredible Machine 2, evaluates Vision-Language Models' logical reasoning in point-and-click puzzle games. Results reveal a significant gap: large proprietary models excel at planning but struggle with precise visual grounding, failing to match human-level problem-solving.

Vision Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

EnactToM is an evolving benchmark with 300 multi-agent embodied tasks in 3D household environments with partial observability. It tests functional Theory of Mind—acting optimally on implicit beliefs—rather than literal belief questions. All seven frontier models score 0.0% on hard task completion, with 93% of failures traced to epistemic coordination breakdowns.

Multi-agent Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

ScaleLogic, a synthetic logical reasoning framework, demonstrates that RL can teach long-horizon reasoning to LLMs. Training compute follows a power law with proof depth (T ∝ D^γ, R² > 0.99), with exponent γ increasing from 1.04 to 2.60 as logical expressiveness grows. Models trained on more expressive logics transfer better (+10.66 points on downstream benchmarks).

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

HAAS: A Policy-Aware Framework for Adaptive Task Allocation Between Humans and Artificial Intelligence Systems

HAAS is a framework for adaptive task allocation between humans and AI systems in software engineering and manufacturing. It combines rule-based governance constraints with contextual-bandit learning. Results show governance is not binary but a tunable design variable: moderate governance improves operational performance and reduces fatigue in manufacturing while remaining competitive as the learner gains experience.

AI Agents Multi-agent Reinforcement learning

SIG

HYP

arXiv cs.CL·May 19

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

arXiv paper demonstrates that Mixture-of-Experts (MoE) models can outperform dense architectures under strictly equal resource constraints (parameters, training compute, data). Researchers identify an optimal activation rate region consistent across model sizes. Validated on ~200 2B-scale and 50 7B-scale models (50 trillion tokens processed).

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.AI·May 19

MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

MolClaw is an autonomous agent with a three-tier hierarchical architecture (70 skills) for drug molecule evaluation, screening, and optimization. It integrates 30+ specialized resources and achieves state-of-the-art performance on MolBench, a benchmark spanning 8 to 50+ sequential tool calls. Gains concentrate on structured workflow orchestration rather than ad hoc scripting.

AI Agents Multi-agent Benchmarks

SIG

HYP

arXiv cs.AI·May 19

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

CheeseBench evaluates 6 open-weight LLMs (3B-72B) on 9 behavioral neuroscience paradigms (Morris water maze, T-maze, etc.). Qwen2.5-VL-7B achieves 52.6% success on ASCII vs 32.1% random and 78.9% rodent baselines. Scaling >7B yields diminishing returns; longer context and chain-of-thought degrade performance.

Benchmarks Reasoning Vision

SIG

HYP

arXiv cs.AI·May 19

General-purpose LLMs as Models of Human Driver Behavior: The Case of Simplified Merging

Study comparing OpenAI o3 and Google Gemini 2.5 Pro as models of human driving behavior in a simplified merging scenario. LLMs reproduce intermittent operational control and tactical dependencies, but fail to capture responses to dynamic velocity cues. Prompt ablations reveal model-specific inductive biases that do not transfer across LLMs.

GPT Gemini Reasoning

SIG

HYP

arXiv cs.AI·May 19

A Comparative Study in Surgical AI: Potential and Limitations of Data, Compute, and Scaling

Comparative study on surgical AI: multi-billion parameter Vision Language Models fail at neurosurgical tool detection despite extensive training. Scaling experiments show diminishing improvements. Obstacles persist across architectures, suggesting data and compute alone are insufficient.

Vision Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 19

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

Study showing that unlearning in LLMs merely suppresses information at surface level—models recover original behavior through minimal fine-tuning. Authors introduce representation-level analysis framework (PCA, CKA, Fisher information) to assess genuine data erasure and identify four forgetting regimes based on reversibility and catastrophicity.

Papers AI safety Alignment

SIG

HYP

arXiv cs.CL·May 19

UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities

UniversalRAG extends retrieval-augmented generation (RAG) to heterogeneous multi-modal corpora (text, images, videos) with variable granularities. The framework proposes modality-aware routing to avoid intra-modal bias and dynamically retrieve from the appropriate corpus. Validated on 10 multi-modal benchmarks.

RAG Vision Multi-agent

SIG

HYP

arXiv cs.AI·May 19

Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

Neurosymbolic architecture with ontologies (Role, Domain, Interaction) for enterprise LLM agents. Controlled experiment (1,800 runs, Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B): ontology-constrained agents outperform ungrounded agents on metric accuracy and role consistency (p < .001). 2x greater lift in localized domains (Vietnam) where LLM training coverage is weak.

AI Agents Claude Reasoning

SIG

HYP

arXiv cs.AI·May 19

Interactive Benchmarks

New Interactive Benchmarks evaluation paradigm assesses model reasoning through budgeted multi-turn interaction. Two settings: Interactive Proofs (logic, UI2Html, mathematics with objective feedback) and Interactive Games (strategic reasoning). Reveals substantial gaps in current interactive capabilities.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·May 19

Patients Speak, AI Listens: LLM-based Analysis of Online Reviews Uncovers Key Drivers for Urgent Care Satisfaction

arXiv study analyzing 10,000+ Google Maps reviews of urgent care facilities (DMV, Florida) using GPT and prompt engineering. Interpersonal factors and operational efficiency emerge as primary satisfaction drivers, while technical quality, finances, and facilities show no significant independent effects. Population density alone correlates with ratings among socioeconomic factors.

GPT Prompt engineering Papers

SIG

HYP

arXiv cs.CL·May 19

Supervising the search process produces reliable and generalizable information-seeking agents

RAG-Gym, a framework supervising the search process rather than final answers, improves autonomous search agents. Re²Search++ uses process supervision and reasoning reflection to generate higher-quality queries, achieving significant gains on multi-hop benchmarks with better out-of-domain generalization.

RAG AI Agents Reasoning

SIG

HYP

arXiv cs.AI·May 19

BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning

BioProAgent combines LLMs with finite state machines to plan biological experiments in wet-labs. The system enforces a Design-Verify-Rectify workflow and reduces token consumption by ~6× through symbolic abstraction. On BioProBench, it achieves 95.6% physical compliance versus 21.0% for ReAct.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Online Algorithms with Unreliable Guidance

New arXiv paper introducing OAG (Online Algorithms with Unreliable Guidance), a model for ML-augmented online decision-making separating predictive and algorithmic components. Presents DTB (drop-or-trust-blindly) compiler converting standard online algorithms into learning-augmented versions. Demonstrates optimal guarantees on bipartite matching, caching, and uniform metrical task systems.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 19

Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models

Novel LGPT (Learnable Graph Pooling Token) method to integrate graphs into LLMs. Uses learnable tokens to represent graphs without information loss. 4.13% improvement on GraphQA benchmark without LLM fine-tuning.

Prompt engineering RAG Benchmarks

SIG

HYP

arXiv cs.AI·May 19

The Alien Space of Science: Sampling Coherent but Cognitively Unavailable Research Directions

A framework uses language models to identify 'alien' research directions—coherent with existing literature but unlikely under current researcher distribution. On 16,068 AI/NLP papers, the method explores 3.5-7× broader conceptual space than baselines while maintaining scientific coherence.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Long reasoning models (LRMs) generate redundant chains of thought uncorrelated with correctness. The paper discovers LRMs implicitly know when to stop thinking. SAGE (Self-Aware Guided Efficient Reasoning) exploits this via a novel sampling paradigm, improving accuracy and efficiency on mathematical benchmarks.

Reasoning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Mitigating Conversational Inertia in Multi-Turn Agents

LLMs exhibit 'conversational inertia' in multi-turn agent scenarios: they over-imitate their previous responses instead of exploring. Authors identify this bias through attention analysis and propose Context Preference Learning to favor low-inertia responses. Validated across 8 agent environments.

AI Agents Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

Supervised sparse auto-encoders for interpretable and compositional representations

Supervised sparse auto-encoders improve model interpretability by aligning learned features with human semantics. Tested on Stable Diffusion 3.5, they enable compositional generalization and image editing through feature-level intervention.

Image generation Papers

SIG

HYP

arXiv cs.AI·May 19

Enhancing Table Reasoning with Deterministic Table-State Rewards

TABROUGE, a deterministic reward metric based on Longest Common Subsequence, improves LLM table reasoning without training. RE-TAB, a plug-and-play framework using TABROUGE, gains 26.7 pp across six backbones and three benchmarks, reducing test-time scaling samples by 33%.

Reasoning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Hybrid Feature Combinations with CNN for Bangla Fake News Classification

Study on Bangla fake news detection using CNN. Combines semantic, statistical, and character-level features on BanFakeNews-2.0 dataset. Results show that hybrid feature combinations significantly improve recall and F1-scores compared to individual features.

Code generation Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Responsible Agentic AI Requires Explicit Provenance

arXiv paper argues agentic AI cannot be responsible without explicit traceable provenance. Authors formalize provenance through causal attribution function and responsibility tensor, demonstrate it is computable and intervenable online, and identify responsibility gaps in current multi-agent systems.

AI Agents Multi-agent AI safety

SIG

HYP

arXiv cs.CL·May 19

Mitigating Extrinsic Gender Bias for Bangla Classification Tasks

Investigation of extrinsic gender bias in Bangla pretrained language models. Four manually annotated task-specific benchmark datasets constructed (sentiment analysis, toxicity detection, hate speech, sarcasm detection) with minimal-pair gender perturbations. RandSymKL debiasing strategy proposed, combining symmetric KL divergence and cross-entropy loss. Implementation and datasets publicly released.

Benchmarks AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

Study reveals a safety vulnerability in personalized dialogue agents: long-term memory biases intent inference and legitimizes harmful queries. PS-Bench benchmark shows personalization increases attack success rates by 15.8%–243.7% versus stateless baselines. A lightweight detection-reflection method is proposed to mitigate this safety degradation.

AI safety AI Agents Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Inference-Time Diversity in RL-Trained Lean Theorem Provers: A Diagnostic Study

RL-trained Lean theorem provers suffer mode-collapse at inference: doubling sampling from k=32 to k=64 on miniF2F-test with DeepSeek-Prover-V1.5-RL solves zero additional theorems (42/244). Fixed structural diversity of 15 tactic skeletons recovers +45% relative improvement at k=16 (+12.3±4.2 theorems). Phenomenon is RL-specific and orthogonal to scaling.

Reasoning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Reinforcement Learning for LLM Post-Training: A Survey

Comprehensive survey of reinforcement learning post-training methods for LLMs. Unifies RLHF (DPO), RLVR (PPO, GRPO) and SFT within a single policy gradient framework. Detailed technical analysis of offline and iterative approaches with standardized notation for direct comparison.

Reinforcement learning Alignment Papers

SIG

HYP

arXiv cs.AI·May 19

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

MirrorBench is a benchmarking framework to evaluate user-proxy agents in conversational systems. It combines 6 metrics (MATTR, Yule's K, HD-D, GTEval, Pairwise Indistinguishability, Rubric-and-Reason) to measure realism of LLM-generated user utterances across 4 public datasets. Open-source code released.

AI Agents Evals Benchmarks

SIG

HYP

arXiv cs.AI·May 19

The Homogenization Problem in LLMs: Towards Meaningful Diversity in AI Safety

arXiv paper on homogenization in LLMs: models reproduce and amplify human biases through mode collapse. Authors propose a framework to characterize homogenization in terms of normativity (queer theory) and introduce 'xeno-reproduction' as a mitigation strategy promoting diversity. Experiment on Claude 3.5 Haiku demonstrates gender bias.

Claude AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

FormuLLA: A Large Language Model Approach to Generating Novel 3D Printable Formulations

FormuLLA fine-tunes LLMs (Llama2, GPT, Claude) on 1400+ FDM formulations to recommend pharmaceutical excipients and predict filament mechanical properties. Llama2 outperforms; smaller models suffer catastrophic forgetting even with this dataset size.

Llama Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Vision-OPD introduces regional-to-global self-distillation to improve fine-grained visual understanding in MLLMs. The framework transfers the model's privileged perception on evidence-centered crops to its full-image policy via token-level KL divergence minimization on on-policy rollouts. Competitive results on fine-grained visual understanding benchmarks without external models or ground-truth labels.

Vision Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·May 19

AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

AgroCoT is a VQA benchmark with 4,759 Chain-of-Thought samples designed to evaluate reasoning capabilities of Vision-Language Models in agriculture. Evaluation of 30 VLMs (proprietary and open-source) reveals significant gaps in zero-shot reasoning, highlighting the importance of CoT for precision farming applications.

Vision Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 19

QuickLAP: Quick Language-Action Preference Learning for Semi-Autonomous Agents

QuickLAP fuses physical and language feedback to learn robot reward functions in real time using a Bayesian framework. LLMs extract reward feature attention masks and preference shifts from free-form utterances, integrated with physical corrections via closed-form update rule. Achieves 70% error reduction vs physical-only and heuristic multimodal baselines in semi-autonomous driving simulator.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 19

ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning

ALIGN is a vision-language framework to infer precise accident coordinates from Bangla news reports and map-based cues. Using an agentic architecture combining OCR, LLM, and vision-language models, the system reduces localization error from 10.9 km to 0.593 km on validation data and 0.465 km on official Dhaka Metropolitan Police records.

Vision AI Agents Multi-agent

SIG

HYP

arXiv cs.AI·May 19

WELD: The First Naturalistic Long-Period Small-Team Workplace Emotion Dataset for Ubiquitous Affective Computing

WELD is the first emotion dataset in naturalistic workplace context spanning 30.1 months (Nov 2021–May 2024) with 49 employees from a Chinese software company. 733,780 seven-class facial-expression probability vectors validate three established phenomena and reveal six asymmetric emotional regimes. Exposes FER model bias: over-prediction of 'angry' on neutral Asian faces (0.194 vs 0.05).

Vision Evals AI safety

SIG

HYP

arXiv cs.AI·May 19

ProtoSiTex: Learning Semi-Interpretable Prototypes for Multi-label Text Classification

ProtoSiTex is a semi-interpretable framework for fine-grained multi-label text classification. It combines unsupervised prototype discovery with supervised classification using hierarchical loss. Experiments on a new hotel reviews benchmark and two public benchmarks demonstrate SOTA performance with faithful, human-aligned explanations.

Evals Papers

SIG

HYP

arXiv cs.AI·May 19

CoLLM-NAS: Collaborative Large Language Models for Efficient Knowledge-Guided Neural Architecture Search

CoLLM-NAS combines two complementary LLMs for neural architecture search: a Navigator LLM guides search direction, a Generator LLM synthesizes candidates. On ImageNet and NAS-Bench-201, the method reduces search costs by 4–10× while outperforming existing NAS methods.

AI Agents Multi-agent Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Position: AI Evaluations Should be Grounded on a Theory of Capability

Position paper arguing that AI model evaluations should be grounded in an explicit theory of capability rather than treating scores as direct measurements. Authors empirically demonstrate that reported performance depends strongly on evaluator modeling assumptions and propose an 'Evaluation Card' to document underlying decisions.

Evals Benchmarks

SIG

HYP

arXiv cs.AI·May 19

An AI system to help scientists write expert-level empirical software

ERA, an AI system combining LLM and Tree Search, automatically generates expert-level scientific software. It discovered 40 novel bioinformatics methods outperforming top human-developed approaches, generated 14 epidemiological models surpassing the CDC ensemble for COVID-19 hospitalization forecasting, and produced expert-level solutions for geospatial analysis and neural activity prediction.

AI Agents Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 19

EndoCogniAgent: Closed-Loop Agentic Reasoning with Self-Consistency Validation for Endoscopic Diagnosis

EndoCogniAgent is a closed-loop agentic framework for iterative endoscopic diagnosis. It couples fine-grained visual evidence acquisition and multi-step reasoning via self-consistency validation (knowledge and temporal consistency). On EndoAgentBench (6,132 QA pairs from 11 datasets), the system achieves 85.23% accuracy on perception and 71.13% clinical acceptance on reasoning tasks.

AI Agents Reasoning Vision

SIG

HYP

arXiv cs.CL·May 19

Beyond Sentiment Classification: A Generative Framework for Emotion Intensity Evaluation in Text

Novel approach to emotion modeling using fine-tuned generative language models to output continuous intensity scores (0-100) instead of discrete classification. Demonstrates improved generalization and transfer to sentiment and arousal, particularly valuable for finance applications.

Papers Fine-tuning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

GenoMAS is an LLM-based multi-agent framework for gene expression analysis. Six specialized agents orchestrated via typed message-passing protocols combine structured workflows with autonomous adaptability. On GenoTEX benchmark: 89.13% correlation for preprocessing, F1 of 60.48% for gene identification (+10.61% and +16.85% vs prior art).

Multi-agent AI Agents Code generation

SIG

HYP

arXiv cs.AI·May 19

CooT: Learning to Coordinate In-Context with Coordination Transformers

CooT is a multi-agent framework using in-context learning for real-time adaptation to unfamiliar partners. Evaluated on Overcooked and Google Research Football, it outperforms population-based methods, gradient-based fine-tuning, and Meta-RL baselines without parameter updates.

Multi-agent AI Agents Reasoning

SIG

HYP

arXiv cs.AI·May 19

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

Study on latency of computer-use agents on OSWorld: LLM calls for planning and reflection dominate total time. 16 agents tested require 2.7–4.3× more steps than optimal human trajectories. Each successive step takes 3× longer than initial steps.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 19

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

AMARIS enhances rubric-based RL by integrating persistent evaluation memory. The system accumulates evaluation diagnostics over time, retrieves them via static and semantic search, and continuously adapts reward rubrics. Experiments show performance gains with ~5% time overhead.

Reinforcement learning Fine-tuning Evals

SIG

HYP

arXiv cs.AI·May 19

A Survey on Foundation Models for Personalized Federated Intelligence

Survey on integrating foundation models (LLMs, Gemini, Grok) with federated learning to enable artificial personalized intelligence (API). Proposes personalized federated intelligence (PFI) paradigm combining privacy, generalization, and edge personalization, trustworthy adaptation, and refinement via retrieval-augmented generation.

Papers Fine-tuning RAG

SIG

HYP

arXiv cs.AI·May 19

Automated Knowledge Component Generation for Interpretable Knowledge Tracing in Coding Problems

Automated LLM-based pipeline to generate and tag knowledge components (KCs) for open-ended programming problems. KCGen-KT framework leverages LLM-generated KCs for knowledge tracing. Evaluation on two real-world student code submission datasets shows it outperforms existing KT methods and human-written KCs on future response prediction.

Llama Code generation Evals

SIG

HYP

arXiv cs.CL·May 19

Prompt2Fingerprint: Plug-and-Play LLM Fingerprinting via Text-to-Weight Generation

Prompt2Fingerprint introduces a framework for LLM fingerprinting via parameter generation. Instead of fine-tuning each model separately, a specialized generator maps textual descriptions to low-rank parameter increments in a single forward pass, eliminating retraining costs.

Prompt engineering Fine-tuning AI safety

SIG

HYP

arXiv cs.AI·May 19

A Machine With Human-Like Memory Systems

Paper presents an AI agent with semantic and episodic memory systems inspired by cognitive science. Authors design and release "The Room" environment (OpenAI Gym-compatible) showing that combining both memory types outperforms single systems. Multi-agent collaboration improves performance.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.AI·May 19

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

DashAttention introduces a differentiable hierarchical sparse attention method using adaptive α-entmax transformation to select variable numbers of KV blocks. Unlike NSA and InfLLMv2, it maintains full differentiability and achieves 75% sparsity with accuracy comparable to full attention. GPU-aware Triton implementation provides significant speedup.

Reasoning Benchmarks Infrastructure

SIG

HYP

arXiv cs.AI·May 19

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Vision-OPD introduces regional-to-global self-distillation to improve fine-grained visual understanding in MLLMs. The framework transfers the model's privileged perception on evidence-centered crops to its full-image policy via KL divergence minimization between token distributions. Competitive results on fine-grained visual understanding benchmarks without external models or ground-truth labels.

Vision Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

Study of 38 models on 8,900 scholarly references: factual recall quality follows a sigmoid combining model size and topic frequency in training data. These two variables explain 60% of variance across dense models, rising to 74-94% within individual model families.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.AI·May 19

Semantic Generative Tuning for Unified Multimodal Models

Semantic Generative Tuning (SGT) aligns visual understanding and generation in unified multimodal models by using image segmentation as a generative proxy. High-level semantic tasks improve feature linear separability and visual-textual attention allocation, outperforming decoupled training approaches.

Vision Image generation Fine-tuning

SIG

HYP

arXiv cs.AI·May 19

Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents

Reversa is a reverse documentation engineering framework converting legacy systems into operational specifications for AI agents. A multi-agent pipeline extracts implicit business rules, synthesizes architecture, and generates traceable specifications with confidence marking. Case study: COBOL-to-Go ATM migration producing 517 claims, 10 identified gaps, and 53 Gherkin scenarios.

AI Agents Multi-agent Code generation

SIG

HYP

arXiv cs.AI·May 19

COOPO: Cyclic Offline-Online Policy Optimization Algorithm

COOPO is a hybrid offline-online reinforcement learning algorithm that cycles between KL-regularized offline training and online fine-tuning. Periodic returns to offline training eliminate catastrophic forgetting and distribution drift. On D4RL benchmarks, COOPO reduces online interactions while improving final returns compared to state-of-the-art hybrids.

Reinforcement learning Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Statistical Limits and Efficient Algorithms for Differentially Private Federated Learning

Study of trade-offs between estimation accuracy, differential privacy, and communication cost in federated learning. Proposes FedHybrid and FedNewton, improvements over FedAvg and FedSGD with finite-sample MSE upper bounds and minimax lower bounds. Evaluation on logistic regression and neural networks (MNIST, CIFAR-10).

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees

Distillation of tabular foundation models (TabICLv2) into boosted trees (XGBoost/CatBoost) for ultra-fast CPU inference. Solves soft target collapse via stratified out-of-fold labeling. Across 153 datasets: 0.882 macro-mean AUC (96.5% of teacher) at 1.9 ms on CPU, 38–860x speedup. Open-sourced as TabTune library.

Fine-tuning Benchmarks Open source

SIG

HYP

arXiv cs.AI·May 19

Post-Trained MoE Can Skip Half Experts via Self-Distillation

ZEDA, a self-distillation framework, converts post-trained static MoE models into dynamic variants. On Qwen3-30B-A3B and GLM-4.7-Flash, it reduces 50% of expert FLOPs with marginal accuracy loss and achieves 1.20× end-to-end inference speedup.

Qwen Fine-tuning Reasoning

SIG

HYP

arXiv cs.AI·May 19

Data Presentation Over Architecture: Resampling Strategies for Credit Risk Prediction with Tabular Foundation Models

Comparative study of tabular foundation models (TFMs) vs classical models on credit default prediction. On Home Credit and Lending Club datasets, context construction strategy (balanced vs uniform sampling) explains more AUC-ROC variance than model choice: +3-4 AUC points. With 5K-10K balanced examples, TFMs match classical GBDTs while improving default-class recall.

Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Position: Weight Space Should Be a First-Class Generative AI Modality

Neural network weights constitute a first-class data modality. This position paper proposes treating checkpoints as generative data: synthesizing weights on demand matches or exceeds fine-tuning while reducing adaptation costs by orders of magnitude. High-performing models occupy structured regions of weight space (symmetry, modularity, shared subspaces).

Fine-tuning Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

Stochastic Penalty-Barrier Methods for Constrained Machine Learning

New SPBM method for constrained optimization in deep learning. Combines penalty methods, barrier methods, and exponential dual averaging to handle non-convexity and non-smoothness. Demonstrates effectiveness on fairness, physics-informed networks, and symbolic knowledge integration with linear overhead up to 10k constraints.

Reinforcement learning Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

SAME: A Semantically-Aligned Music Autoencoder

SAME is an autoencoder for stereo music and general audio achieving 4096× temporal compression while maintaining reconstruction quality. The architecture combines a transformer backbone, semantic regularization, phase-aware reconstruction losses and improved discriminators. Two variants (SAME-L and SAME-S) are released in open-weights.

Open source Papers

SIG

HYP

arXiv cs.AI·May 19

CATA: Continual Machine Unlearning via Conflict-Averse Task Arithmetic

CATA introduces a continual machine unlearning method for vision-language models (VLMs). It represents each unlearning request as a task vector and aggregates historical vectors by suppressing conflicting components, ensuring forgetting effectiveness, model fidelity, and persistence against knowledge re-emergence.

Vision AI safety Papers

SIG

HYP

arXiv cs.AI·May 19

Not What You Asked For: Typographic Attacks in Household Robot Manipulation

Researchers demonstrate typographic attacks against household manipulation robots using CLIP. By placing adversarial stickers, they achieve 67.8% attack success rate on HomeRobot benchmark in Habitat simulation, causing physical grasping and transport errors of wrong objects.

Vision Robotics AI safety

SIG

HYP

arXiv cs.CL·May 19

Agentic Chunking and Bayesian De-chunking of AI Generated Fuzzy Cognitive Maps: A Model of the Thucydides Trap

Automatic generation of fuzzy cognitive maps (FCMs) from text using LLM agents that chunk text into overlapping segments. Convex mixing of chunk FCMs produces a cyclic FCM knowledge graph. Operator-level Bayesian inference generates "de-chunked" FCMs. Demonstration on Thucydides Trap model: 7 out of 8 FCMs predicted armed conflict. Gemini 3.1 served as chunking agent.

AI Agents Gemini RAG

SIG

HYP

arXiv cs.AI·May 19

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

RAT (Randomized Advantage Transformation) estimates Tikhonov-regularized natural policy gradients via direct backpropagation without explicit Fisher matrix construction. The method applies the Woodbury formula and randomized block Kaczmarz iterations on on-policy mini-batches. Results match or exceed established natural-gradient methods on continuous and visual control benchmarks.

Reinforcement learning Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks

OverEager-Gen is a benchmark measuring out-of-scope actions by autonomous coding agents on benign tasks. On Claude Code, removing the consent declaration raises the overeager rate from 0% to 17.1%. The study validates 500 scenarios across 4 products (Claude Code, OpenHands, Codex CLI, Gemini CLI) and 6 base models.

AI Agents Code generation AI safety

SIG

HYP

arXiv cs.AI·May 19

Estimating Item Difficulty with Large Language Models as Experts

Study evaluating three off-the-shelf LLMs to estimate difficulty of educational items without response data. Across 6 primary-school mathematics domains, Spearman correlations show moderate-to-strong alignment with empirical difficulties. Pairwise comparisons outperform absolute judgements; adding token probabilities and few-shot examples improves results.

Prompt engineering Evals Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Key-Gram: Extensible World Knowledge for Embodied Manipulation

Key-Gram is a conditional-memory framework separating linguistic knowledge from visual reasoning for embodied control. It decomposes instructions into task-specific key-grams, retrieves linguistic priors via O(1) hashed lookup, and injects them into hidden layers. Achieves 29.5% gains on RoboTwin2.0, 35.8% on LIBERO-Plus, 15.4% on real-world tasks.

Robotics Vision AI Agents

SIG

HYP

arXiv cs.AI·May 19

STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

STT-Arena is a benchmark of 227 interactive tasks measuring LLMs' ability to detect and adapt to spatio-temporal changes. Claude-4.6-Opus achieves under 40% accuracy. Authors identify three recurring failure modes and propose STT-Agent-4B combining iterative trajectory refinement with online RL.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 19

Probing for Representation Manifolds in Superposition

A supervised method called Manifold Probe discovers representation manifolds in superposition within neural networks. Tested on Llama 2-7b, it identifies linear manifolds for time and space, and demonstrates causal control by steering model completions about release years of movies and songs.

Llama Reasoning

SIG

HYP

arXiv cs.CL·May 19

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

Researchers identify Entropy-Gradient Inversion, a negative correlation between token entropy and logit gradients, as a geometric fingerprint of Large Reasoning Models' reasoning capability. They propose Correlation-Regularized Group Policy Optimization (CorR-PO), an RL method embedding this inversion signature into reward regularization, outperforming baselines across multiple reasoning benchmarks.

Reasoning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Beyond Morphology: Quantifying the Diagnostic Power of Color Features in Cancer Classification

arXiv study demonstrates that color features alone (RGB/HSV histograms, statistical moments) achieve 89% accuracy in binary cancer/benign classification in histopathology, excluding morphological information. Authors propose these simple features as lightweight pre-screening tool before complex deep learning models.

Vision Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

DBES: A Systematic Benchmark and Metric Suite for Evaluating Expert Specialization in Large-Scale MoEs

DBES is a diagnostic framework for evaluating expert specialization in Mixture-of-Experts models. Five theoretically grounded metrics measure domain isolation and routing specialization. Testing on Qwen, DeepSeek, and GLM reveals distinct specialization paradigms. Targeted post-training on specialized expert paths improves performance by 66–94% using only 15% of original training resources.

Benchmarks Qwen DeepSeek

SIG

HYP

arXiv cs.CL·May 19

SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening

SafeLens introduces a two-tier video moderation architecture (fast-and-slow) to reduce inference costs. The framework filters SafeWatch dataset to 2.4% via influence-guided filtering and augments with Chain-of-Thought traces. It outperforms SafeWatch-8B, OmniGuard-7B, GPT-5.4, and Gemini-3.1-pro on real and AI-generated video benchmarks.

Vision AI safety Reasoning

SIG

HYP

arXiv cs.AI·May 19

GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets

GAMMA is a quantizer-agnostic mixed-precision framework that automatically allocates bit precision per module without quantization-aware training. Using teacher-forced hidden-state reconstruction and integer programming, it achieves +12.99 Avg. over fixed baselines on Llama/Qwen 8B-32B, matching 3-bit quality at 2.5-bit average.

Llama Qwen Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Prompt2Fingerprint: Plug-and-Play LLM Fingerprinting via Text-to-Weight Generation

Prompt engineering Fine-tuning AI safety

SIG

HYP

arXiv cs.AI·May 19

Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap

Six modern tabular foundation models form a highly redundant ensemble (mean Q-statistic 0.961). On 153 OpenML classification tasks, the best ensemble (two-level cascade stacking) gains +0.18% accuracy at 253× compute cost. Friedman-Nemenyi analysis places three ensembles and the best single model in the same equivalence group. Greedy selection is recommended as practical default.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights

Reinforcement learning framework for predicting customer trajectories in retail spaces. RL-based approach outperforms TSP/PNN heuristics (average 28% deviation from shortest paths) by modeling bounded rationality. Validated on real convenience store data: RL predictions better align with observed behavior, more accurate impulse purchase rates and shelf traffic estimates, enabling practical layout optimization.

Reinforcement learning AI Agents Business

SIG

HYP

arXiv cs.AI·May 19

Building Reliable Arithmetic Multipliers Under NBTI Aging and Process Variations

Paper on mitigating NBTI aging in arithmetic multipliers used in AI. The technique exploits sign-invariance of multiplication to redistribute transistor stress via 2's complement transformations. Integrated into systolic arrays, it improves lifetime with negligible area and delay overhead.

Papers Benchmarks AI safety

SIG

HYP

arXiv cs.CL·May 19

Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

Researchers train KinGPT (25M parameters) on chess data and demonstrate that high benchmark scores of chess-trained LLMs stem primarily from pattern-matching rather than genuine rule understanding. LLM-Modulo, a verifier-in-the-loop framework, improves RedPajama 3B from 1.2% to 21.2% best-move accuracy. Training code, datasets, and model checkpoints open-sourced.

Benchmarks Evals Fine-tuning

SIG

HYP

arXiv cs.AI·May 19

Qumus: Realization of An Embodied AI Quantum Material Experimentalist

Qumus is the first embodied AI quantum materials experimentalist: an autonomous robotic mini-laboratory capable of hypothesis generation, protocol planning, and experimental execution on 2D quantum materials. It achieved first-time AI creation of graphene and fabrication of atomically thin field-effect transistors via van der Waals stacking, with closed-loop error correction.

AI Agents Multi-agent Robotics

SIG

HYP

arXiv cs.AI·May 19

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

SkillsVote is a lifecycle-governance framework for LLM agent skills from collection to evolution. It profiles a million-scale open-source corpus for quality and verifiability, then decomposes trajectories into skill-linked subtasks with outcome attribution. Results: +7.9pp on Terminal-Bench 2.0 (GPT-5.2) and +2.6pp on SWE-Bench Pro.

AI Agents Benchmarks Code generation

SIG

HYP