May 2026

3149 articles

CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers

CoCoReviewBench is a 3,900-paper benchmark (ICLR, NeurIPS) to evaluate AI reviewer systems. It addresses metric bias by using reviewer-author-meta-review discussions as expert annotations. Results show AI reviewers suffer from hallucinations and reasoning models are more effective reviewers.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·May 19

Language models fail at extended rule following

Language models fail to reliably apply simple rules over long sequences. Test on 126 model variants: all models cannot count above a model-dependent threshold. Failures are abrupt and persist despite increasing model size and computation. Mechanistic probing shows models use finite internal states to simulate counting, exhausting them beyond threshold.

Reasoning Benchmarks AI Agents

SIG

HYP

arXiv cs.AI·May 19

Geometry-Aware Attention Guidance for Diffusion Models via Modern Hopfield Dynamics

GAG (Geometry-Aware Attention Guidance) improves diffusion models without additional training by guiding attention via modern Hopfield dynamics. Theoretical analysis proves sparse-dense discrepancy acts as directional acceleration signal. Universal method tested on FLUX.1, FLUX.2, Qwen-Image with quality gains and minimal computational overhead.

Image generation Papers Reasoning

SIG

HYP

arXiv cs.AI·May 19

Human-Certified Module Repositories for the AI Age

Human-Certified Module Repositories (HCMRs) introduce an architectural model for building trustworthy software in AI-assisted development. Addressing risks from unverified components in modular ecosystems, this framework combines human oversight with automated analysis to certify modules and enable safe assembly by both humans and AI agents.

AI Agents Code generation AI safety

SIG

HYP

arXiv cs.AI·May 19

Surgical Post-Training: Proximal On-Policy Distillation for Reasoning with Knowledge Retention

SPOT (Surgical Post-Training) is an on-policy distillation framework that injects reasoning capabilities into LLMs while preserving prior knowledge. With only 4k rectified math pairs, it improves Qwen3-8B by 6.2% on average in 16 minutes on 8x H800 GPUs. The approach uses KL-constrained reward formulation to mitigate catastrophic forgetting.

Fine-tuning Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 19

IdGlow: Dynamic Identity Modulation for Multi-Subject Generation

IdGlow is a mask-free Flow Matching framework for multi-subject generation. It uses task-adaptive timestep scheduling and temporal gating to preserve identities during complex transformations (e.g., age transformation). A fine-grained group-level DPO stage eliminates artifacts and harmonizes textures.

Image generation Fine-tuning Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

Flowette: Flow Matching with Graphette Priors for Graph Generation

Flowette is a continuous flow matching framework for graph generation with recurring subgraph motifs. The model uses a GNN-based transformer to learn a velocity field, incorporates optimal transport-based coupling, and introduces graphettes, a probabilistic family of graph structure models generalizing graphons. Achieves state-of-the-art results on multiple benchmarks.

Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry

GIST introduces targeted data selection for instruction tuning by replacing axis-aligned scaling with robust subspace alignment via SVD. It recovers task-specific subspaces from validation gradients and scores examples by alignment with target directions. GIST matches or outperforms state-of-the-art baselines using only 0.29% storage and 25% computational time.

Fine-tuning Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·May 19

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Reward Models

CodeScaler is a reward model for training and inference scaling of code LLMs. Trained on verified preference data, it outperforms execution-based RL by +1.55 points on Qwen3-8B and +4.23 on Qwen3-14B. At inference, it reduces latency 10× while maintaining performance comparable to unit test approaches.

Code generation Reinforcement learning Qwen

SIG

HYP

arXiv cs.AI·May 19

SkillJect: Effectively Automating Skill-Based Prompt Injection for Skill-Enabled Agents

SkillJect automates prompt-injection attacks against skill-enabled LLM agents. The framework hides malicious payloads in auxiliary helper scripts and rewrites SKILL.md instructions using a front-loaded inducement strategy. A multi-agent loop (Attack/Victim/Evaluate) optimizes attack effectiveness across platforms and backend LLMs.

AI Agents AI safety Prompt engineering

SIG

HYP

arXiv cs.AI·May 19

Learning Native Continuation for Action Chunking Flow Policies

Legato is a training-time continuation method for action-chunked flow-based VLA policies. It initializes denoising from a mixture of known actions and noise, and reshapes flow dynamics to ensure consistency between training and inference. Real-world experiments: ~10% improvements in trajectory smoothness and task completion time versus RTC across five manipulation tasks.

Vision Code generation Reasoning

SIG

HYP

arXiv cs.AI·May 19

Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty

Calibrated Uncertainty Distillation (CUD) improves knowledge distillation by preserving the teacher's calibrated uncertainty instead of overconfident predictions. The approach guides students to learn from balanced distributions between confident signals and structured uncertainty, improving accuracy, calibration, and robustness under distribution shift.

SIG

HYP

arXiv cs.CL·May 19

From graphemic dependence to lexical structure: a Markovian perspective on Dante's Commedia

Study of Dante's Divina Commedia using vowel-consonant encoding modeled as a four-state Markov chain. Graphemic memory index increases gradually from Inferno to Paradiso. Trigram analysis reveals recurrent configurations linked to lexical environments and orthographic phenomena.

Papers

SIG

HYP

arXiv cs.CL·May 19

Beyond the Final Actor: Modeling the Dual Roles of Creator and Editor for Fine-Grained LLM-Generated Text Detection

New RACE method detects LLM-generated text across 4 fine-grained categories (pure human, pure LLM, human-polished-by-LLM, humanized-LLM). Uses Rhetorical Structure Theory (RST) and Elementary Discourse Unit (EDU)-level features. Outperforms 12 baselines with low false alarms.

AI safety Evals Regulation

SIG

HYP

arXiv cs.AI·May 19

Reverse-Engineering Model Editing on Language Models

Researchers reveal a critical vulnerability in locate-then-edit model editing methods: parameter updates enable attackers to recover edited data via KSTER attack exploiting low-rank structure. A defense using subspace camouflage is proposed to obfuscate fingerprints without compromising editing utility.

AI safety Alignment Papers

SIG

HYP

arXiv cs.AI·May 19

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

New metric decomposing token efficiency of reasoning LLMs. Introduces trace-optional evaluation protocol separating completion rate, conditional correctness, and generated length. Evaluates 14 open-weight models on CogniLoad, GSM8K, ProofWriter, ZebraLogic. Identifies three distinct failure modes: logic-limited, context-limited, and verbosity-limited.

Reasoning Evals Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

R&B-EnCoRe enables Vision-Language-Action models to self-generate and refine embodied reasoning without human annotation or external rewards. Tested on manipulation (Franka Panda, WidowX), navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving: +28% manipulation success, +101% navigation scores, −21% collision rate vs baselines.

Vision AI Agents Reasoning

SIG

HYP

arXiv cs.AI·May 19

The Laplacian Keyboard: Beyond the Linear Span

Laplacian Keyboard (LK) is a hierarchical framework that overcomes limitations of Laplacian eigenvectors in RL. LK builds a task-agnostic behavior library and uses a meta-policy to dynamically combine them, enabling learning of policies beyond the original linear span while improving sample efficiency over standard RL methods.

Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 19

QuantaAlpha: An Evolutionary Framework for LLM-Driven Alpha Mining

QuantaAlpha is an evolutionary framework for LLM-driven alpha mining. It treats each run as a trajectory and improves factors via trajectory-level mutation and crossover. On CSI 300 with GPT-5.2: IC=0.0472, ARR=4.68%, MDD=11.8%. Factors transfer effectively to CSI 500 (+40.28% excess return) and S&P 500 (+19.1%).

AI Agents Reinforcement learning Papers

SIG

HYP

arXiv cs.LG·May 19

A Structural Threshold in Decision Capacity Governs Collapse in Self-Play Reinforcement Learning

An arXiv study shows that a threshold in decision capacity determines collapse in self-play reinforcement learning. Eliminating all positive-reach contingent decisions causes rapid convergence to a deterministic exploitation attractor. Preserving even a single contingent decision point prevents collapse, confirming the mechanism is co-adaptation under constraint.

Reinforcement learning Papers Multi-agent

SIG

HYP

arXiv cs.AI·May 19

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

Diamond Maps are stochastic flow map models enabling efficient reward alignment at inference time. They amortize multiple simulation steps into a single-step sampler while preserving stochasticity required for optimal alignment. Learned via distillation from GLASS Flows, they outperform existing methods in performance and scalability.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.CL·May 19

AuthorMix: Modular Authorship Style Transfer via Layer-wise Adapter Mixing

AuthorMix introduces a modular authorship style transfer framework using style-specific LoRA adapters and layer-wise adapter mixing. Trained on few examples, it outperforms SoTA baselines and GPT-5.1 while better preserving original meaning.

Fine-tuning Prompt engineering Papers

SIG

HYP

arXiv cs.CL·May 19

MUSCAT: MUltilingual, SCientific ConversATion Benchmark

MUSCAT is a multilingual benchmark to evaluate Automatic Speech Recognition (ASR) systems on bilingual scientific conversations with code-switching. The dataset contains discussions between multiple speakers in different languages and proposes an evaluation framework beyond Word Error Rate (WER). Results show current ASR systems struggle with these challenges.

Benchmarks Voice

SIG

HYP

arXiv cs.AI·May 19

LaDi-RL: Latent Diffusion Reasoning Prevents Entropy Collapse in Reinforcement Learning

LaDi-RL optimizes LLM reasoning via RL in latent space using diffusion. Instead of optimizing token sequences, the method generates latent reasoning trajectories through iterative denoising. It solves credit assignment (rewards observed after decoding) via hierarchical latent-text rollouts. Gains: +9.4% code generation, +5.7% math reasoning on pass@1.

Reinforcement learning Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 19

On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression

Study of adversarial robustness in compressed vision-language models. Authors propose CAGE attack that exploits the mismatch between perturbation optimization (full tokens) and inference (via compression). CAGE combines expected feature disruption and rank distortion alignment to expose hidden vulnerabilities in compressed LVLMs.

Vision AI safety Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Med-V1 is a family of 3-billion-parameter language models trained on synthetic data for biomedical evidence attribution and fact verification. It outperforms base models by +27% to +71% on five benchmarks and rivals GPT-5 while being far more efficient. The study quantifies hallucinations in LLM-generated answers under different citation instructions.

Benchmarks Fine-tuning Evals

SIG

HYP

arXiv cs.CL·May 19

StructLens: A Structural Lens for Language Models via Maximum Spanning Trees

StructLens analyzes the internal organization of representations in language models using maximum spanning trees built on residual streams. The framework reveals that middle layers strongly organize nearby tokens, and that smaller local units emerge before larger units during pre-training.

Papers Reasoning

SIG

HYP

arXiv cs.CL·May 19

ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints

ToolMATH is a diagnostic benchmark for evaluating long-horizon tool use by language models. It converts math solutions into reusable Python tools with natural-language descriptions and typed schemas, then measures adaptability (success with replacement tools), robustness (stability under distractors), and tool connectivity (accuracy over long chains).

Benchmarks AI Agents Tools

SIG

HYP

arXiv cs.AI·May 19

SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking

SynCABEL uses LLMs to generate contextualized synthetic training examples to address scarcity of annotated data in biomedical entity linking. The framework achieves state-of-the-art on MedMentions (English), QUAERO (French), and SPACCC (Spanish), reaching full human supervision performance with 60% less annotated data. An LLM-as-a-judge protocol evaluates clinical validity.

Papers Benchmarks RAG

SIG

HYP

arXiv cs.AI·May 19

Tracking Drift: Variation-Aware Entropy Scheduling for Non-Stationary Reinforcement Learning

AES (Adaptive Entropy Scheduling) method dynamically adjusts entropy coefficient in non-stationary RL under environment drift. Proposes square-root scaling rule based on observable non-stationarity proxy. Evaluation across 4 algorithm variants, 12 tasks, 4 drift modes: reduces performance degradation from drift and accelerates recovery after abrupt changes.

Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 19

Support-Safe Variational Hybrid Filtering for Contact-Mode and Sparse-Law Recovery

VHYDRO is a variational hybrid filter for contact-rich robot dynamics. It prevents branch loss by mixing the learned proposal with a feasible transition law before sampling. The model jointly infers continuous latent state and discrete contact mode, recovering sparse port-Hamiltonian laws per regime.

Robotics Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

DevBench is a telemetry-driven benchmark evaluating LLMs on 1,800 realistic code completion tasks across 6 programming languages. 9 SOTA models tested, best score 43.5% Pass@1. Combines functional correctness, similarity metrics, and LLM-judge assessments on usefulness and contextual relevance.

Code generation Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language

Annotation framework using Llama3.1 as teacher model to tag Polish medical texts. Corpus spans 5 clinical categories (Radiology, Oncology, Cardiology, Hypertension, Pathology). DistilBERT achieves F1 > 0.80 per category, 500× smaller than LLM, 300× lower GPU VRAM, inference several hundred times faster.

Llama Fine-tuning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Enhancing Cloud Network Resilience via a Robust LLM-Empowered Multi-Agent Reinforcement Learning Framework

CyberOps-Bots combines an LLM agent with multi-level RL agents to defend cloud networks against attacks. The hierarchical framework uses ReAct planning and long-short term memory. On real cloud datasets, it maintains 68.5% higher availability and achieves 34.7% performance gain without retraining when shifting scenarios.

Multi-agent Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 19

KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks

KASER is a reinforcement learning method to simulate student coding errors. It uses a hybrid reward combining code similarity, error matching, and solution diversity to prevent mode collapse and capture the variety of student responses.

Reinforcement learning Code generation Evals

SIG

HYP

arXiv cs.AI·May 19

Rethinking GNNs and Missing Features: Challenges, Evaluation and a Robust Solution

arXiv paper on handling missing node features in Graph Neural Networks (GNNs). Authors prove existing benchmarks with sparse features limit meaningful performance comparison. They introduce GNNmim, a robust baseline evaluated on dense datasets with realistic missingness mechanisms beyond MCAR.

Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 19

Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

Calibrate-Then-Act (CTA) is a framework enabling LLM agents to explicitly reason about cost-uncertainty tradeoffs during exploration. By providing inferred priors about environment state, CTA improves decision-making on QA, retrieval-augmented QA, and file-reading coding tasks without standard RL training.

AI Agents Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

Learning from Historical Activations in Graph Neural Networks

HISTOGRAPH, an attention-based final aggregation layer, leverages intermediate activations from previous GNN layers. The method applies layer-wise attention followed by node-wise attention to model representation evolution. Improved results on graph classification benchmarks with enhanced robustness in deep GNNs.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Mapping Human Anti-collusion Mechanisms to Multi-agent AI Systems

arXiv paper developing a taxonomy of human anti-collusion mechanisms (sanctions, leniency & whistleblowing, monitoring & auditing, market design, governance) and mapping them to multi-agent AI systems. Highlights open challenges: attribution of emergent coordination, agent identity fluidity, boundary between beneficial cooperation and harmful collusion, adversarial adaptation.

Multi-agent AI Agents AI safety

SIG

HYP

arXiv cs.CL·May 19

Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study

CLUES, a framework for clinical Text-to-SQL, decomposes semantic uncertainty into ambiguity and instability scores using the Schur complement of a bipartite semantic graph matrix. Tested on AmbigQA/SituatedQA and a clinical benchmark, it outperforms Kernel Language Entropy and enables efficient triage: 51% of errors in 25% of queries.

Papers Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

ShareChat: A Dataset of Chatbot Conversations in the Wild

ShareChat is a corpus of 142,808 conversations (660,293 turns) collected from ChatGPT, Perplexity, Grok, Gemini, and Claude between April 2023 and October 2025. The dataset preserves native affordances (citations, reasoning traces, code artifacts) across 95 languages and enables analysis of cross-platform differences in user satisfaction, citation strategies, and response latency.

Benchmarks Evals GPT

SIG

HYP

arXiv cs.AI·May 19

SignRoundV2: Toward Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

SignRoundV2 is a post-training quantization framework for LLMs maintaining performance under extreme compression (2-4 bits). It combines adaptive mixed-precision strategy guided by gradients and lightweight stabilization techniques. Results show ~1% performance gap at 4.5 bits average in mixed MXFP, with substantial improvements in challenging 2-bit weight-only quantization.

Fine-tuning Benchmarks Infrastructure

SIG

HYP

arXiv cs.CL·May 19

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

arXiv paper introducing a trace-optional evaluation protocol decomposing token efficiency of reasoning LLMs. Analyzes 14 open-weight models on CogniLoad, GSM8K, ProofWriter, ZebraLogic by separating completion rate, conditional correctness, and generated length. Identifies three failure modes: logic-limited, context-limited, or verbosity-limited.

Reasoning Evals Benchmarks

SIG

HYP

arXiv cs.AI·May 19

GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning

GraphMind combines GNN and LLM for multi-step mathematical reasoning. The framework models the proof process as an evolving heterogeneous graph where nodes (conditions, theorems, conclusions) and edges (logical dependencies) enable context-aware theorem selection and iterative conclusion generation. Improved results on QA benchmarks.

Reasoning AI Agents Benchmarks

SIG

HYP

arXiv cs.AI·May 19

The Journal of Prompt-Engineered (Moral) Philosophy Or: Why AI-Assisted Ethics Research Requires Process Transparency

Paper on transparency requirements in AI-assisted ethics research. Authors argue output-only evaluation is insufficient; they propose a documentation-adequacy framework grounded in agent-integrity, comprising declaration, navigation, documentation account, process documentation, and development records. The paper itself demonstrates the framework with persistent archival.

Prompt engineering AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

Tongyi DeepResearch Technical Report

Tongyi DeepResearch is a 30.5B-parameter agentic LLM (3.3B activated per token) designed for autonomous long-horizon research tasks. Trained via agentic mid-training and post-training with automatic data synthesis, it achieves SOTA on Humanity's Last Exam, BrowseComp, WebWalkerQA and other benchmarks. Model, framework and solutions are open-sourced.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate

SDRL (Self-Debate Reinforcement Learning) trains LLMs to solve problems standalone AND benefit from multi-agent debate. The model samples multiple solutions, constructs debate context with diverse reasoning paths, then jointly optimizes initial and debate-conditioned responses. Results: consistent MAD performance gains across benchmarks and agent configurations.

Reasoning Reinforcement learning Multi-agent

SIG

HYP

arXiv cs.AI·May 19

DecoupleSearch: Decouple Planning and Search via Hierarchical Reward Modeling

DecoupleSearch decouples planning and search in agentic RAG systems using dual value models. A reasoning tree is constructed with Monte Carlo Tree Search to assess each step quality. Hierarchical Beam Search iteratively refines planning and search candidates during inference.

AI Agents RAG Reasoning

SIG

HYP

arXiv cs.AI·May 19

Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

arXiv study evaluating ChatGPT's consistency in coding communication data across demographic groups (gender, race). Authors adapt an automated scoring framework and test ChatGPT on three collaborative task types. Finding: ChatGPT coding shows consistency comparable to human raters across groups.

GPT Evals AI safety

SIG

HYP

arXiv cs.AI·May 19

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

ProfBench introduces a benchmark of 7000+ response-criterion pairs evaluated by domain experts (Physics/Chemistry PhDs, Finance/Consulting MBAs). Top models like GPT-5-high achieve only 65.9% performance. Authors develop robust LLM-Judges reducing evaluation costs by 2-3 orders of magnitude.

Benchmarks Evals GPT

SIG

HYP

arXiv cs.CL·May 19

SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking

SynCABEL uses LLMs to generate contextualized synthetic training data to address the scarcity of expert annotations in biomedical entity linking. The framework achieves SOTA results on MedMentions (English), QUAERO (French), and SPACCC (Spanish), reaching full human supervision performance with 60% less annotated data.

Papers Benchmarks Fine-tuning

SIG

HYP

arXiv cs.AI·May 19

Prior Knowledge Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods

Theoretical paper formulating multi-step reasoning as s-t connectivity on knowledge graphs. Shows phase transition: if pre-training knowledge is fragmented into small components, augmentation requires Ω(√n) queries; once density threshold is crossed forming a giant component, constant expected queries suffice.

RAG Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

EvolveR is a framework enabling LLM agents to learn from their own experiences through a closed-loop lifecycle. It combines offline self-distillation (synthesizing interaction trajectories into reusable strategic principles) and online interaction (actively retrieving distilled principles to guide decisions). Tested on complex multi-hop QA benchmarks, it outperforms existing agentic baselines.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 19

Extracting latent representations from X-ray spectra. Classification, regression, and accretion signatures of Chandra sources

Transformer autoencoder compresses Chandra X-ray spectra into 8D latent representation. Classification of 8 astrophysical source types: ~40% balanced accuracy overall, ~69% on AGN/compact objects. Latent features correlate with spectral and temporal properties, capturing physical information as effectively as hand-engineered features.

Benchmarks Vision

SIG

HYP

arXiv cs.AI·May 19

NeuroRVQ: Multi-Scale Biosignal Tokenization for Generative Foundation Models

NeuroRVQ is a modality-adaptive biosignal tokenizer family (EEG, ECG, EMG) using multi-scale temporal convolutions and hierarchical RVQ codebooks to preserve high-frequency dynamics. NeuroRVQ-FM foundation models trained with masked-token prediction achieve competitive or superior performance versus existing modality-specific models.

Papers Benchmarks Embeddings

SIG

HYP

arXiv cs.CL·May 19

UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages

UbuntuGuard is the first policy-based safety benchmark for African languages. Built from 155 domain experts, it evaluates 15 models (7 general-purpose LLMs, 8 guardian models) across three variants. Findings show English-centric benchmarks overestimate real-world multilingual safety and cross-lingual transfer remains insufficient.

AI safety Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

Evaluating Language Models' Evaluations of Games

arXiv study comparing game evaluations by language and reasoning models against human judgments. Dataset of 100+ board games and 450+ human evaluations. Reasoning models align better with humans, but show non-monotonic relationship: as models approach game-theoretic optimality, fit to human data weakens.

Reasoning Evals Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Adversarial Agent Collaboration for Correctness Improvements of C to Safe Rust Translation

ACToR, an adversarial multi-agent loop, improves C-to-Rust translation using a translator agent and discriminator agent that iteratively compete. On 63 real-world C utilities (avg 473 LOC), the system achieves 90% test pass rate with zero human intervention, improving correctness by 36.7% over non-adversarial baselines.

AI Agents Multi-agent Code generation

SIG

HYP

arXiv cs.AI·May 19

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

RLBFF combines human feedback and verifiable rewards for reward model training. The method extracts binary principles from natural language feedback (e.g., accuracy, code readability) and uses them as entailment tasks. Models achieve 86.2% on RM-Bench and 81.4% on JudgeBench (#1 as of September 2025). Qwen3-32B aligned with RLBFF matches o3-mini and DeepSeek R1 at <5% inference cost.

Reinforcement learning Evals Alignment

SIG

HYP

arXiv cs.AI·May 19

FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

FUNCanon breaks down long-horizon manipulation tasks into action sequences (actor-verb-object) and canonicalizes objects by functional affordances using VLM cues. FuncDiffuser, an object-centric and action-centric diffusion policy, learns on aligned data to generalize across object categories and enable cross-task behavior reuse.

Robotics Vision AI Agents

SIG

HYP

arXiv cs.AI·May 19

CoUn: Empowering Machine Unlearning via Contrastive Learning

CoUn is a machine unlearning method using contrastive learning to remove the influence of specific data from trained models. The technique adjusts learned representations using only retain data, outperforming existing label manipulation and weight perturbation baselines across multiple datasets and architectures.

Reinforcement learning Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Individual utilities of life satisfaction reveal inequality aversion unrelated to political alignment

Experimental study with 300 UK participants measuring individual preferences for well-being and fairness. Utility functions estimated via Expected Utility Maximisation reveal inequality aversion unrelated to political alignment. Results challenge average life satisfaction as a policy metric and support nonlinear utility-based alternatives for value-aligned AI systems.

Alignment AI safety

SIG

HYP

arXiv cs.AI·May 19

FediLoRA: Practical Federated Fine-Tuning of Foundation Models Under Missing-Modality Constraints

FediLoRA introduces a federated LoRA fine-tuning framework for vision-language models (VLLMs) addressing imbalanced LoRA ranks from heterogeneous resources and missing modalities from user errors or device failures. The method combines simple averaging with structured editing, validated on general-domain and medical-domain benchmarks.

Fine-tuning Vision Papers

SIG

HYP

arXiv cs.CL·May 19

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

Colluding LLM agents manipulate victim beliefs by coordinating truthful evidence fragments through public channels without covert communication. The Generative Montage framework (Writer-Editor-Director) constructs deceptive narratives via adversarial debate. Attack success rates reach 74.4% on proprietary models and 70.6% on open-weights across 14 LLM families. Advanced reasoning models show higher susceptibility.

AI Agents Multi-agent AI safety

SIG

HYP

arXiv cs.AI·May 19

Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity

Theoretical investigation of loss of plasticity (LoP) in deep learning under non-stationary environments. Authors identify two primary mechanisms: activation saturation and representational redundancy creating traps in parameter space. Paradox: properties promoting static generalization (low-rank representations) worsen LoP in continual learning.

Reinforcement learning Papers Alignment

SIG

HYP

arXiv cs.AI·May 19

Ordinal Adaptive Correction: A Data-Centric Approach to Ordinal Image Classification with Noisy Labels

ORDAC, a data-centric method, corrects noisy labels in ordinal image classification using Label Distribution Learning. Tested on Adience (age estimation) and Diabetic Retinopathy (disease severity), ORDAC_R reduces mean absolute error from 0.86 to 0.62 with 40% noise.

Vision Evals AI safety

SIG

HYP

arXiv cs.AI·May 19

FedKLPR: KL-Guided Pruning-Aware Federated Learning for Person Re-Identification

FedKLPR is a federated learning framework for person re-identification combining KL-Divergence-guided training to handle statistical heterogeneity, unstructured pruning for communication efficiency (40-42% reduction on ResNet-50), and Cross-Round Recovery for adaptive compression control. Evaluated on 8 benchmark datasets.

Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 19

SignRoundV2: Toward Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

SignRoundV2 is a post-training quantization framework for LLMs maintaining performance under extreme compression (2-4 bits). It combines adaptive mixed-precision strategy guided by gradient information and lightweight stabilization techniques. Results show ~1% performance gap at 4.5 bits average in mixed MXFP settings, with substantial improvements in 2-bit weight-only quantization.

Fine-tuning Benchmarks Open source

SIG

HYP

arXiv cs.CL·May 19

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Evo-Memory is a benchmark for evaluating self-evolving memory in LLM agents. It structures data into sequential task streams, testing models' ability to search, adapt, and update memory after each interaction. Authors implement 10+ memory modules and propose ExpRAG and ReMem to improve experience reuse.

AI Agents Benchmarks RAG

SIG

HYP

arXiv cs.AI·May 19

OPERA: A Reinforcement Learning--Enhanced Orchestrated Planner-Executor Architecture for Reasoning-Oriented Multi-Hop Retrieval

OPERA is a retrieval-augmented generation (RAG) architecture coupling planning and execution via reinforcement learning. A Goal Planning Module decomposes complex questions into sub-goals, executed by a Reason-Execute Module with specialized components for reasoning and retrieval. Training uses MAPGRPO, a GRPO variant. Superior results on complex multi-hop benchmarks.

RAG Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 19

Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs

New approach to generate formal theorem proving challenges by leveraging theoretical computer science (TCS). Framework automatically synthesizes problem-proof pairs in Lean4 and Markdown across two domains: Busy Beaver and Mixed Boolean Arithmetic. DeepSeekProver-V2-671B achieves 57.5% on Busy Beaver but only 12% on Mixed Boolean Arithmetic, revealing major gaps in long-form proof generation.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

LSDTs: LLM-Augmented Semantic Digital Twins for Adaptive Knowledge-Intensive Infrastructure Planning

LSDTs combine LLMs with Digital Twins to extract planning knowledge from unstructured documents (regulations, technical guidelines) and organize it into formal ontologies. A case study on offshore wind farm planning in Maryland demonstrates regulation-aware layout optimization and high-fidelity simulation capabilities.

RAG AI Agents Reasoning

SIG

HYP

arXiv cs.AI·May 19

Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models

Fourier Compressor compresses visual tokens in Vision-Language Models using Fourier transforms. The parameter-free method reduces FLOPs by 83.8% and boosts inference speed by 31.2% while retaining 96% of original accuracy. Tested on LLaVA and Qwen-VL, it generalizes to video understanding tasks.

Vision Benchmarks Infrastructure

SIG

HYP

arXiv cs.AI·May 19

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

Novel data selection strategy for LLM alignment based on DPO implicit reward gap. Method selects harder examples (smaller reward gaps) and achieves superior performance with only 10% of original data across multiple benchmarks.

Reinforcement learning Alignment Evals

SIG

HYP

arXiv cs.AI·May 19

Perovskite-R1: a domain-specialized large language model for intelligent discovery of precursor additives and experimental design

Perovskite-R1 is a specialized LLM based on QwQ-32B, fine-tuned on 1,232 scientific publications and 33,269 candidate materials to discover precursor additives optimizing perovskite solar cells. The model generates solutions for defect passivation and improves stability/performance, experimentally validated.

Qwen Fine-tuning Reasoning

SIG

HYP

arXiv cs.AI·May 19

Missing-Modality-Aware Graph Neural Network for Cancer Classification

MAGNET, a graph neural network, handles incomplete multimodal biological data for cancer classification. The model uses a dynamic multi-head attention mechanism to fuse modality embeddings with missing patterns, achieving linear complexity. Tested on three public multiomics datasets, MAGNET outperforms existing fusion methods.

Papers Benchmarks Vision

SIG

HYP

arXiv cs.AI·May 19

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

arXiv paper shows Mixture-of-Experts (MoE) models outperform dense architectures under strictly equal resource constraints (identical total parameters, training compute, data budget). Researchers identify an optimal activation rate region consistent across model sizes. Validated on ~200 2B-scale and 50 7B-scale models (50 trillion tokens processed).

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.CL·May 19

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

EvoSynth, an autonomous multi-agent framework, optimizes jailbreak attacks in executable code space rather than prompt space. The system iteratively evolves and self-corrects code-based attack algorithms. Results: 85.5% Attack Success Rate against Claude-Sonnet-4.5, 95.9% average ASR across evaluated targets.

AI Agents Multi-agent Claude

SIG

HYP

arXiv cs.AI·May 19

Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving

FastDrive, a compact 0.9B-parameter VLM, outperforms 7B+ models (LLaVA-1.5) on autonomous driving tasks. Trained on NuScenes-S, a benchmark with structured representations, it achieves +20% accuracy on decision-making with 10x inference speedup.

Vision Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

HTSC-2025: A Benchmark Dataset of Ambient-Pressure High-Temperature Superconductors for AI-Driven Critical Temperature Prediction

HTSC-2025 is an open-source benchmark of high-temperature superconducting materials discovered 2023-2025 (X₂YH₆ systems, MXH₃ perovskites, M₃XH₈, BCN-doped cage structures, 2D honeycomb). Addresses the lack of standardized datasets for fair comparison of AI algorithms predicting critical transition temperatures.

Benchmarks Papers Open source

SIG

HYP

arXiv cs.CL·May 19

T-FIX: Text-Based Explanations with Features Interpretable to eXperts

T-FIX is an evaluation framework to measure alignment of LLM-generated explanations with expert reasoning in specialized domains (surgery, astronomy, therapy). Covers seven scientific tasks across three domains with expert-defined criteria. Enables automatic, generalizable evaluation without ongoing expert annotation.

Evals Reasoning AI safety

SIG

HYP

arXiv cs.AI·May 19

RAP: Runtime Adaptive Pruning for LLM Inference

RAP is an elastic pruning framework for LLM inference using reinforcement learning to dynamically adapt compression strategies based on runtime memory variations and heterogeneous KV-cache demands. The RL agent optimizes the parameter-to-KV-cache ratio in real-time, retaining only components that maximize utility within the current memory budget.

Reinforcement learning Infrastructure Benchmarks

SIG

HYP

arXiv cs.AI·May 19

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

DriveMoE proposes a Mixture-of-Experts architecture for end-to-end autonomous driving. The model combines Vision MoE (dynamic camera selection based on driving context) and Action MoE (specialized expert activation for different behaviors). Built on Drive-π₀ baseline, DriveMoE achieves SOTA on Bench2Drive by avoiding mode averaging.

Vision AI Agents Papers

SIG

HYP

arXiv cs.AI·May 19

InvDesFlow-AL: active learning-based workflow for inverse design of functional materials

InvDesFlow-AL combines diffusion and active learning for inverse design of functional materials. The model achieves RMSE 0.0423 Å in crystal structure prediction (+32.96% vs existing methods) and systematically generates low-formation-energy materials. Validation: discovery of Li₂AuH₆ as BCS superconductor at 140 K.

Papers Benchmarks Reinforcement learning

SIG

HYP

arXiv cs.CL·May 19

Beyond Neural Incompatibility: Cross-Scale Knowledge Transfer in Language Models through Latent Semantic Alignment

SemAlign enables cross-scale knowledge transfer between language models via latent semantic alignment. Instead of direct parameter copying, the method uses activations as transfer medium, pairing source and target layers and optimizing through semantic supervision. Evaluated on four benchmarks.

Fine-tuning Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

Sustainability via LLM Right-sizing

Comparative study of 11 LLMs (GPT-4o, Gemma-3, Phi-4, etc.) across 10 common workplace tasks. GPT-4o delivers superior performance but at higher cost and environmental footprint; smaller models (Gemma-3, Phi-4) achieve strong results with better efficiency. Advocates task-aware sufficiency assessments over performance-maximizing benchmarks.

Benchmarks Evals Open source

SIG

HYP

arXiv cs.AI·May 19

Patients Speak, AI Listens: LLM-based Analysis of Online Reviews Uncovers Key Drivers for Urgent Care Satisfaction

arXiv study analyzing 10,000+ Google Maps reviews of urgent care facilities (DMV, Florida) using GPT prompt engineering for aspect-based sentiment extraction. Findings: interpersonal factors and operational efficiency are strongest drivers of patient satisfaction; technical quality, finances, facilities show no significant independent effects. Population density alone shows modest correlation with ratings.

GPT Prompt engineering RAG

SIG

HYP

arXiv cs.CL·May 19

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

ProfBench is a benchmark of 7000+ response-criterion pairs evaluated by human experts in physics, chemistry, finance, and consulting. Authors propose robust LLM-judges reducing evaluation cost by 2-3 orders of magnitude. GPT-5-high achieves 65.9% performance, revealing significant gaps between proprietary and open-weight models.

Benchmarks Evals GPT

SIG

HYP

arXiv cs.AI·May 19

Long Context Modeling with Ranked Memory-Augmented Retrieval

ERMAR (Enhanced Ranked Memory Augmented Retrieval) is a framework for effective long-context management in language models. It employs a novel relevance scoring mechanism and pointwise re-ranking model for key-value embeddings, inspired by learning-to-rank techniques. Achieves SOTA results on standard benchmarks with superior scalability and performance.

RAG Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

LLM-Safety Evaluations Lack Robustness

arXiv paper argues current LLM safety evaluations lack robustness due to small datasets, methodological inconsistencies, and unreliable setups. Systematically analyzes the evaluation pipeline—dataset curation, automated red-teaming, response generation, LLM judges—and proposes guidelines to reduce noise and improve comparability of attack/defense research.

AI safety Alignment Evals

SIG

HYP

arXiv cs.AI·May 19

Adaptive Camera Sensor for Vision Models

Lens, a camera sensor control method, adapts acquisition parameters in real-time to improve vision model performance. Using VisiT, a training-free quality indicator based on confidence scores, Lens compensates for domain shift without extensive model modification. ImageNet-ES Diverse benchmark introduced.

Vision Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 19

Toward Robust Multilingual Adaptation of LLMs for Low-Resource Languages

LiRA, a lightweight fine-tuning framework, improves LLM adaptation to low-resource languages through cross-lingual semantic alignment. Combines Arca (anchor-based alignment to English) and LaSR (language-aware head). Theoretical stability guarantees. Multilingual dataset (7 Asian languages) and code released.

Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Merlin's Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive Prompting

Whisper, an iterative prompting framework, reduces response length in large reasoning models (LRMs) via black-box persuasive prompting. 3x reduction on GSM8K for Qwen3; ~40% average token savings across benchmarks. Claude-3.7 and Gemini-2.5 achieve -46% to -50% on MATH-500.

Prompt engineering Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Uncertainty Quantification as a Principled Foundation for Explainable Artificial Intelligence: A Case Study of Counterfactual Explanations

arXiv paper proposing counterfactual explainability grounded in uncertainty quantification. Authors demonstrate that integrating foundational AI concepts—particularly uncertainty—improves robustness and reliability of explanations, achieving competitive performance despite radically simple design.

SIG

HYP

arXiv cs.AI·May 19

Supervising the search process produces reliable and generalizable information-seeking agents

RAG-Gym, a framework supervising the search process rather than final answers, improves autonomous search agents. Re²Search++, a process-supervised agent, achieves substantial gains on multi-hop information-seeking benchmarks, especially out-of-domain, through higher-quality search queries and better generalization.

AI Agents RAG Reasoning

SIG

HYP

arXiv cs.AI·May 19

Experimentally validated quantum-secure federated learning over a multi-user quantum network

QuNetQFL is a quantum federated learning protocol implemented on quantum networks, masking local model updates with distributed quantum secret keys for information-theoretic security. Experimentally validated on a four-client quantum network, it reduces communication costs by 75% and scales to 200 clients with rapid convergence.

AI safety Papers

SIG

HYP

arXiv cs.AI·May 19

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

LightTransfer converts language models (LLaMA, Mistral, QwQ-STILL) into hybrid architectures without training. The method identifies lazy layers and replaces full attention with streaming attention, reducing KV cache costs. Results: up to 2.17× throughput improvement with <1.5% loss on LongBench and 53.3% on AIME24.

Llama Mistral Qwen

SIG

HYP

arXiv cs.AI·May 19

Leveraging Unsupervised Learning for Cost-Effective Visual Anomaly Detection

Visual anomaly detection system using unsupervised learning on Raspberry Pi. Training and inference in 90 seconds with 10 normal images, F1 score >0.95. Deployment via Anomalib and openVINO for SMEs.

Vision Open source Tools

SIG

HYP

arXiv cs.CL·May 19

Residual Semantic Decomposition of Word Embeddings

Residual Semantic Decomposition (RSD) recursively decomposes word embeddings into local semantic axes via neural additive approach. On ambiguous words, RSD separates supplied contexts from shuffled controls, but entropy diagnostics show static GloVe does not uniformly place ambiguous words at high-entropy boundaries.

Embeddings Papers

SIG

HYP

arXiv cs.CL·May 19

FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

FinAuditing is a financial auditing benchmark built from 1,102 real XBRL instances (33k tokens average). It evaluates 13 LLMs on three tasks: Financial Semantic Matching, Financial Relationship Extraction, and Financial Mathematical Reasoning. Results reveal substantial gaps in concept retrieval and cross-document reasoning.

Benchmarks Reasoning Evals

SIG

HYP