May 2026

3149 articles

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

VideoDR is the first benchmark for open-domain video question answering, combining cross-frame visual extraction, iterative web retrieval, and multi-hop reasoning. Evaluation of multimodal models (closed/open-source) shows Agentic paradigm is not consistently superior to Workflow; key challenges are goal drift and long-horizon consistency.

AI Agents Vision Reasoning

SIG

HYP

arXiv cs.AI·May 19

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Speech-Hands is a voice-agentic framework learning when to trust its predictions versus consulting external audio perception. The model reduces WER by 12.1% across 7 OpenASR benchmarks and achieves 77.37% accuracy on audio QA, using a self-reflection mechanism to avoid noisy hypotheses.

AI Agents Voice Reasoning

SIG

HYP

arXiv cs.AI·May 19

"The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework

COMPACT, a multi-teacher CoT distillation framework, adaptively fuses supervisions from multiple LLMs into compact student models. It dynamically weights teacher gradients using three metrics: graph-based consensus, mutual-information-based adaptability, and loss-based difficulty. Achieves SOTA results without catastrophic forgetting.

Reasoning Fine-tuning Papers

SIG

HYP

arXiv cs.CL·May 19

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

Researchers introduce IBPO (Implicit Behavior Policy Optimization), a credit assignment method for reinforcement learning with LLMs. By comparing multiple reasoning trajectories, the framework transforms sparse terminal rewards into step-sensitive learning signals, reducing gradient variance and improving stability on mathematical and code reasoning benchmarks.

Reinforcement learning Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 19

OmniCode: A Benchmark for Evaluating Software Engineering Agents

OmniCode is a benchmark for evaluating AI agents on software engineering tasks. It contains 1794 tasks across Python, Java, and C++ covering bug fixing, test generation, code review fixing, and style fixing. Evaluations show SWE-Agent achieves only 25% on C++ test generation with DeepSeek-V3.1.

Benchmarks Code generation AI Agents

SIG

HYP

arXiv cs.CL·May 19

Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction

Study of KV cache eviction policies (LRU, H2O, SnapKV, StreamingLLM, Ada-KV, QUEST, Random) under global cap. Without structural boundary protection, all collapse to F1≤0.064. Reserving 10% cache at each boundary recovers 69–90% quality on LongBench at C=256 (13% retention). Position-0 holds ~75% attention mass; protecting structurally critical tokens dominates over scoring differences.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Multi-layer Cross-attention is Provably Optimal for Multi-modal In-context Learning

Theoretical study proving multi-layer cross-attention is optimal for multi-modal in-context learning. Authors show single-layer linear self-attention fails to recover Bayes-optimal predictor, but linearized cross-attention mechanism achieves Bayes optimality with gradient flow.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

SuReNav: Superpixel Graph-based Constraint Relaxation for Navigation in Over-constrained Environments

SuReNav proposes a superpixel graph-based navigation method for over-constrained environments. The system combines constraint map generation, relaxation via GNN trained on human demonstrations, and interleaved execution. Evaluated on 2D/3D OpenStreetMap maps and Spot quadruped robot, it achieves highest human-likeness score while balancing safety and efficiency.

AI Agents Robotics Papers

SIG

HYP

arXiv cs.AI·May 19

GRAFT: Decoupling Ranking and Calibration for Survival Analysis

GRAFT is a hybrid AFT model for survival analysis that decouples prognostic ranking from calibration of survival estimates. It combines a linear AFT model with a non-linear residual neural network and stochastic gates for feature selection. Trained on C-index-aligned ranking loss with conditional imputation, it outperforms baselines in discrimination and calibration.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.CL·May 19

WASIL: In-the-Wild Arabic Spoken Interactions with LLMs

WASIL is a dataset of real-world Arabic voice interactions with LLMs: 8,529 turns with audio, ASR hypotheses, assistant responses, and like/dislike feedback (14.2% dislikes). Includes 2,000 test turns covering Modern Standard Arabic and 4 major dialects. Answerability annotations separate ASR errors from intrinsic limitations.

Voice Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

Perception-based Image Denoising via Generative Compression

Paper proposes generative compression framework for perception-based image denoising. Two approaches: conditional WGAN-based denoiser explicitly controlling rate-distortion-perception trade-off, and conditional diffusion-based iterative reconstruction guided by compressed latents. Theoretical guarantees and perceptual improvements demonstrated on synthetic and real-noise benchmarks.

Image generation Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

OptimusVLA, a hierarchical Vision-Language-Action model, improves robotic manipulation via two memories: Global Prior Memory (replaces Gaussian noise with trajectory priors) and Local Consistency Memory (enforces temporal coherence). Results: 98.6% on LIBERO, +13.5% vs pi_0 on CALVIN, 2.9x inference speedup.

Vision Robotics AI Agents

SIG

HYP

arXiv cs.AI·May 19

Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

Systematic investigation of diffusion models for end-to-end autonomous driving (E2E AD). Hyper Diffusion Planner (HDP) framework achieves 10x performance improvement over base model, tested on 200 km real-world driving and 6 urban scenarios. Includes reinforcement learning post-training to enhance safety and robustness.

Reinforcement learning Robotics

SIG

HYP

arXiv cs.AI·May 19

Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models

SurgUn, an unlearning method for text-to-image diffusion models, treats forgetting as controlled competition rather than direct deletion. Using target-gradient ascent and descent over semantically diverse distractors, it reduces erase-retain imbalance and limits collateral damage. Tested on Stable Diffusion v1.5, SDXL, and SANA-1.5.

Image generation AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain

Self-evolution loops in LLMs plateau when they fail to generate learnable information. This study identifies three roles (Proposer, Solver, Verifier) and three system designs (asymmetric co-evolution, capacity growth, proactive information seeking) to sustain information gain across iterations on coding tasks.

Reasoning Reinforcement learning Code generation

SIG

HYP

arXiv cs.AI·May 19

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

History-Echoes framework investigates how conversational history biases LLM outputs. Using Markov chain modeling and geometric analysis of hidden representations across three model families and six datasets, researchers show behavioral persistence manifests as a geometric trap constraining model trajectories. Code released.

Reasoning Evals Alignment

SIG

HYP

arXiv cs.AI·May 19

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Med-V1 is a family of 3-billion-parameter language models trained on synthetic data for biomedical evidence attribution. It outperforms base models by +27% to +71% on five benchmarks and matches frontier LLMs like GPT-5, while detecting hallucinations and misattributions in clinical guidelines.

Benchmarks Fine-tuning AI safety

SIG

HYP

arXiv cs.AI·May 19

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

Vocabulary adaptation approach to improve LLM efficiency on specialized domains (legal, medical). Combines tokenizer adaptation with selective pretraining on Llama-3.1-8B and Qwen2.5-7B. Reduces training time by 35-55% and parameters by 37% vs expansion-only methods.

Llama Qwen Fine-tuning

SIG

HYP

arXiv cs.AI·May 19

Toward Template-Free Explainability for Monte Carlo Tree Search

Framework enabling LLMs to generate evidence-grounded explanations of MCTS decisions from search traces end-to-end, without hand-crafted formal logic constraints. Maps natural-language questions to intent categories, triggers targeted tree expansion when needed, and generates explanations using visit counts, value estimates, and risk information.

Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

No Plan, Yet Human: A Reactive Robotics Model Predicts Human Planning Failures on a Clinical Task

AICON, a reactive robotics model using gradient descent, better predicts human planning failures on the Tower of London cognitive test than planning baselines. Without lookahead, it reproduces the difficulty ordering of 24 problems and fails similarly to Parkinson's patients, suggesting reduced planning capacity shifts behavior toward reactive modes.

Robotics Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

\textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer

MasFACT introduces a geometry-aware posterior transfer framework for LLM-powered multi-agent systems. It addresses topology forgetting by preserving historical collaboration structures when adapting to new tasks, using Fused Gromov-Wasserstein optimal transport and PAC-Bayes-guided conservative posterior adaptation.

Multi-agent AI Agents Llama

SIG

HYP

arXiv cs.AI·May 19

Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment

Novel black-box, single-sample membership inference attack against Vision-Language Models. Exploits cross-modal semantic alignment: training data exhibits stronger image-caption alignment than non-members. Achieves AUC 0.821 against LLaVA-1.5 on VL-MIA/Flickr dataset.

Vision AI safety Benchmarks

SIG

HYP

arXiv cs.AI·May 19

OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics

OmniVL-Guard Pro is a tool-augmented agent for open-world vision-language forgery detection. It integrates real-time event search, face detection, video frame extraction, and SAM3-based segmentation. A FSTR dataset and checker-guided agentic RL (CGARL) improve multi-step reasoning and achieve state-of-the-art performance.

AI Agents Vision Reasoning

SIG

HYP

arXiv cs.AI·May 19

Policy-Grounded Dynamic Facet Suggestions for Job Search

LinkedIn presents dynamic facet suggestion (DFS) for refining job search queries. 80% of queries contain ≤3 keywords. System combines semantic retrieval, distilled small language model scoring, and real-time ranking to disambiguate user intent. Online A/B tests show significant improvements in engagement and search outcomes.

RAG Embeddings Evals

SIG

HYP

arXiv cs.AI·May 19

Strategic Over-Parameterization for Generalizable Low-Rank Adaptation

LoRA-Over improves parameter-efficient fine-tuning (PEFT) by enriching the optimization landscape during training via auxiliary over-parameterization, then collapsing this enrichment into standard LoRA structure at inference. Evaluated on GLUE, MT-Bench, GSM8K, and HumanEval with LLaMA 2-7B and 3.1-8B, the framework consistently outperforms vanilla LoRA with no additional inference cost.

Fine-tuning Llama Benchmarks

SIG

HYP

arXiv cs.AI·May 19

LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails

LPG (Latent Policy Guardrail) is a safety framework for LLMs that adapts security policies at inference time without retraining. It compresses reasoning into 10 latent tokens, achieves 84.5% accuracy and 77.9% F1 on benchmarks, while running 11× faster than Qwen3-4B-Thinking.

AI safety Alignment Reasoning

SIG

HYP

arXiv cs.AI·May 19

MHMamba: Multi-Head Mamba for 3D Brain Tumor Segmentation

MHMamba combines a U-shaped architecture with a multi-head state-space model (Mamba) for 3D brain tumor segmentation in MRI. The method preserves Mamba's linear complexity while improving long-range dependency modeling and multimodal training stability. Experiments on BraTS2021/2023 demonstrate gains in overall accuracy, boundary smoothness, and small-lesion detection.

Vision Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Alignment Drift in Long-Term Human-LLM Interaction: A Mechanism-Oriented Framework

Study of alignment drift: gradual process where LLM outputs become less constrained by current user message and more shaped by interaction history, while remaining coherent. Proposed mechanism-oriented framework distinguishes signals A/B, explains feedback loops and sub-pattern selection across three interactional regimes.

Alignment AI safety Papers

SIG

HYP

arXiv cs.AI·May 19

Asking Back: Interaction-Layer Antidistillation Watermarks

New watermarking approach against unauthorized LLM distillation: behavioral markers (follow-up questions, low-frequency variants, restatements) injected via system prompt. Tested on 63 LoRA-distilled models from Llama-3.3-70B, with transfer rates 88.9% (Gemma) to 45.2% (Qwen). Robustness validated against DIPPER paraphrasing and user study (N=20) confirming imperceptibility.

AI safety Alignment Llama

SIG

HYP

arXiv cs.AI·May 19

Conservative AI for Safety-Sensitive Medical Image Restoration: Residual-Bounded CT-CTA Enhancement for Intracranial Aneurysm-Relevant Signal Recovery

2.5D residual-bounded image restoration framework for enhancing intracranial CT/CTA without uncontrolled modification of clinically sensitive regions. Model adds learned residual via edit-control map limiting magnitude and spatial extent. On 50 out-of-distribution cases: PSNR 37.51 dB, iatrogenic-edit rate 4.0%, net positive in 85.4% of 1,000 Monte Carlo runs.

Vision AI safety Evals

SIG

HYP

arXiv cs.AI·May 19

Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model

Feature-free initialization method for monocular visual-inertial navigation systems (VINS) using feed-forward 3D models. Reduces initialization time to <1.2s (vs 3-4s), achieves >90% success rate, eliminates visual feature tracking. Code and dataset released.

Vision Robotics Open source

SIG

HYP

arXiv cs.AI·May 19

Physics-Guided Geometric Diffusion for Macro Placement Generation

MacroDiff+ is a physics-guided geometric diffusion framework for macro placement optimization in VLSI physical design. Using a dual-domain architecture (heterogeneous GNNs + Transformer) and physics-guided gradient-based sampling, it achieves 6.1-6.2% wirelength reduction on ISPD2005 benchmarks with superior stability on large-scale designs.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.AI·May 19

Nested Spatio-Temporal Time Series Forecasting

Nested spatio-temporal forecasting framework coupling macro-level regional trends with micro-level historical observations. Uses spectral clustering to construct semantically coherent regions, filtering systematic noise while preserving trends. Progressive coarse-to-fine predictor integrates features to anticipate dynamic anomalies. Outperforms state-of-the-art baselines on high-dimensional datasets.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Avoiding Structural Failure Modes in Tabular Fair SSL: Online Primal-Dual Allocation under Confidence Gating

arXiv paper identifies two structural failure modes in fair tabular semi-supervised learning: Masking Collapse and Trivial Saturation. Proposes OPDA (Online Primal-Dual Allocation), an adaptive controller that dynamically adjusts fairness penalties without per-dataset tuning. Evaluated on Adult, ACSIncome, COMPAS benchmarks.

Papers Benchmarks AI safety

SIG

HYP

arXiv cs.AI·May 19

Improving MLLM Training Efficiency via Stage-Aware Sparsity

Sparse Training Scheme (STS) improves MLLM training efficiency through stage-aware sparsity: Visual Token Compressor reduces visual token load during modality alignment, Layer Dynamic Skipper skips unnecessary layers during instruction tuning. Framework adapts to varying redundancy across training stages.

Vision Fine-tuning Infrastructure

SIG

HYP

arXiv cs.AI·May 19

Two-Valued Symmetric Circulant Matrices: Applications in Deep Learning

Paper proposes Two-Valued Symmetric Circulant Matrices (TVSCM) to drastically reduce neural network parameters. Achieves 80× parameter reduction (623k→7.8k on MNIST) with minor accuracy loss (97.6%→93.5%). Designed for edge computing and embedded systems.

Fine-tuning Infrastructure Benchmarks

SIG

HYP

arXiv cs.AI·May 19

DeepArrhythmia: Segment-Contextualized ECG Arrhythmia Classification via Selective Evidence Acquisition

DeepArrhythmia is a multimodal framework for beat-level ECG arrhythmia classification. It combines raw signal and waveform image, localizes R peaks, and uses specialized tools for rhythm and morphology extraction. The system selectively routes between minimal and rich evidence states based on segment-level confidence.

AI Agents Vision MCP

SIG

HYP

arXiv cs.AI·May 19

L-Drive: Beyond a Single Mapping-Latent Context Drives Time Series Forecasting

L-Drive introduces a change-aware forecasting framework for multivariate time-series that uses latent context to characterize evolving dynamics and gating mechanisms to modulate representations. Patch-shared relative positional basis functions strengthen structural modeling and reduce overfitting, improving accuracy at regime transitions.

Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 19

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

Injecting mathematically wrong drafts from a smaller model (Qwen2.5-Math-1.5B) into stronger learner (Mathstral-7B) GRPO training improves performance on MATH-500 (+1.62pp) and AIME 2025/2026 (+14.2pp at pass@1024). Intentional mismatch between problems and drafts is critical: 71.98% on MATH-500, highest published result for this model.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Byzantine-Resilient Federated Learning via QUBO-Based Client Selection on Quantum Annealers

Quantum annealing approach for selecting trustworthy clients in federated learning against Byzantine attacks. Reformulates client selection as QUBO problem jointly optimizing closest client subsets. MultiSignal ensemble achieves 95.3% detection accuracy at 100 clients on MNIST vs 91.8% for classical MultiKrum.

AI safety Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Edge-AI-Driven Learning-to-Rank for Decentralized Task Allocation in Circular Smart Manufacturing

Decentralized task allocation framework for circular manufacturing using Edge-AI and ranking-aware learning. Each machine evaluates tasks using local information (processing capability, queue state, resource contention). Results: reduced delays, improved deadline adherence, enhanced energy efficiency in discrete-event simulation.

AI Agents Reinforcement learning Infrastructure

SIG

HYP

arXiv cs.AI·May 19

Deep Reinforcement Learning Framework for Diversified Portfolio Management Across Global Equity Markets

Deep reinforcement learning framework for dynamic portfolio allocation across global equity markets. Soft Actor-Critic optimizes continuous weights with transaction costs and diversification constraints. Evaluation on Nasdaq-100, Nikkei 225, Euro Stoxx 50 (2003-2026): significant abnormal returns on Euro Stoxx 50, but no statistically significant outperformance vs Buy and Hold across all markets.

Reinforcement learning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

A Theory of Training Profit-Optimal LLMs

Economic model combining scaling laws and microeconomic theory to characterize rational behavior of LLM training firms. Analyzes profit maximization under compute-bound and data-bound regimes: in compute-bound, optimal model size tracks hardware efficiency (FLOPs/$) at near-linear rate; in data-bound, optimal training expenditure scales as D²/E.

Benchmarks Papers Business

SIG

HYP

arXiv cs.AI·May 19

The Impact of AI Search on the Online Content Ecosystem: Evidence from Google and Reddit

Empirical study on the impact of Google AI Overviews on Reddit. Using identification strategy based on moderation policy (SFW vs NSFW communities), authors find AI Overviews increase engagement in SFW communities by +12% (comments) and +12.3% (users), but only for experience-based content. Introduction of Google AI Mode eliminates these gains.

DeepMind Benchmarks Business

SIG

HYP

arXiv cs.AI·May 19

CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models

CarbonScaling is a hardware-aware analytical framework modeling carbon emissions during frontier LLM training. It integrates neural scaling laws, distributed training strategies, accelerator modeling, and operational/embodied carbon accounting. Source code available on GitHub.

Benchmarks Papers Infrastructure

SIG

HYP

arXiv cs.AI·May 19

When Efficiency Backfires: Cascading LLMs Trigger Cascade Failure under Adversarial Attack

LLM cascade systems, designed to balance efficiency and performance by routing complex queries to powerful models, are vulnerable to targeted adversarial attacks. A novel attack exploits lightweight models and internal decision mechanisms to simultaneously degrade accuracy and cost-efficiency.

AI safety AI Agents Benchmarks

SIG

HYP

arXiv cs.AI·May 19

CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving

CLAP optimizes prompts in the latent space of Vision-Language-Action models to improve autonomous driving in rare safety-critical situations. Using contrastive learning and directional regularization, the method reduces planning error by 24% on challenging scenes (NAVSIM benchmark) with no regression on normal cases.

Vision Prompt engineering Reasoning

SIG

HYP

arXiv cs.AI·May 19

Agentic Pipeline for Self-Synchronized Multiview Joint Angle Monitoring in Uncalibrated Environments

Agentic pipeline for multi-view joint angle monitoring without calibration in uncalibrated environments. Uses two cameras, automatic synchronization via multimodal LLM, 2D pose detection and agent-based selection to identify target subject. Validation against Vicon system: MAE 5.97° ± 2.36°, Pearson correlation 0.962 ± 0.014. Application: spinal cord injury rehabilitation.

AI Agents Vision Reasoning

SIG

HYP

arXiv cs.AI·May 19

CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning

CAVE is a credit assignment method based on GRPO to improve fragmented visual reasoning in VLMs. It evaluates intermediate steps via three signals: belief update, evidence acquisition, and adaptive focus control. TRACER-Bench, a new benchmark, assesses reasoning across four nonlocal and semantically confusable dimensions.

Vision Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

How Loud Rumbles Hit Newsstands: A Data Analysis of Coverage and Spatial Bias in German News about Landslides Around the World

Analysis of 60k German news articles covering 5.5k landslide events over 25 years. Reveals overreporting of Southern and Western Europe relative to actual landslide susceptibility. Study of spatial bias in media coverage of international natural disasters.

Benchmarks

SIG

HYP

arXiv cs.AI·May 19

STRIDE-AI: A Threat Modeling Framework for Generative AI Security Assessment

STRIDE-AI is a threat modeling framework for assessing generative AI system security. It bridges NIST AI RMF standards and OWASP LLM Top 10, defines a six-phase assessment lifecycle, and operationalizes through a web tool. Validation on a deployed LLM chatbot reduced attack success rate from 80% to 15%.

AI safety Alignment Regulation

SIG

HYP

arXiv cs.AI·May 19

Trajectory-Aware Adaptive Inference in Object Detection Models

Adaptive inference method for YOLOv8 in autonomous maritime navigation. Early-exit mechanism leverages GPS trajectory data (inter-vessel distances, convergence speeds) to partially activate the network. Reduces inference time and computational cost while maintaining detection performance.

Code generation Evals Vision

SIG

HYP

arXiv cs.AI·May 19

DARE-EEG: A Foundation Model for Mining Dual-Aligned Representation of EEG

DARE-EEG is a self-supervised foundation model for EEG that learns representations invariant to incomplete observations through dual-aligned learning (mask alignment + anchor alignment). Evaluated across multiple EEG benchmarks, it achieves state-of-the-art accuracy with low parameter complexity and superior cross-dataset portability.

Papers Benchmarks Embeddings

SIG

HYP

arXiv cs.AI·May 19

JSON-Bag: A generic game trajectory representation

JSON-Bag tokenizes JSON descriptions of game trajectories and uses Jensen-Shannon distance for comparison. Tested on 6 tabletop games (7 Wonders, Dominion, Connect4, etc.), the model outperforms baselines on agent, parameter, and seed classification. Efficient in few-shot settings and enables automatic feature extraction.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Overcoming the Intrinsic Performance Limitations of MEMS IMU via Diffusion-Based Generative Learning

A conditional diffusion model based on U-Net architecture synthesizes high-fidelity virtual IMU data from low-cost IMU measurements. Trained with high-grade IMU measurements as ground-truth priors, the model significantly improves positioning and attitude estimation accuracy, and produces thinner, more consistent point clouds in airborne mapping experiments.

Vision Robotics

SIG

HYP

arXiv cs.AI·May 19

Haptic Rendering of Fractional-Order Viscoelasticity: Passivity and Rendering Fidelity

Paper on haptic rendering of fractional-order viscoelastic materials. Authors derive passivity conditions for fractional-order SLS (Standard Linear Solid) models under short-memory discretization, generalizing results for integer-order Kelvin-Voigt, Maxwell, and SLS models. Includes experimental validation and human-subject evaluations.

Papers Robotics

SIG

HYP

arXiv cs.AI·May 19

Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice

TaTok introduces adaptive image tokenization grounded in information entropy theory. The framework adds global tokens modeling mutual information across patch tokens and a Dynamic Token Filtering algorithm eliminating redundancy. Results: 1.3x gFID improvement and 8.7x inference speedup.

Vision Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

A neurosymbolic Approach with Epistemic Deep Learning for Hierarchical Image Classification

Neurosymbolic framework combining Swin Transformers, focal set reasoning and differentiable fuzzy logic for hierarchical image classification. Captures epistemic uncertainty through focal sets in embedding space and enforces logical constraints via fuzzy membership functions to ensure consistency between fine and coarse predictions.

Vision Reasoning AI safety

SIG

HYP

arXiv cs.AI·May 19

StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video

StreamPro introduces StreamPro-Bench, a benchmark evaluating proactive video streaming understanding across three dimensions: perception, temporal reasoning, and proactive agency. The framework proposes CB-Stream Loss to address supervision imbalance and applies GRPO with multi-grained rewards. Results: 41.5 on StreamPro-Bench vs 10.4 previously, 78.9 on StreamingBench-RTVU.

Vision Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

ManiSoft is a benchmark for vision-language manipulation with soft robotic arms. It includes a simulator coupling realistic soft-body dynamics with contact-rich interactions, 4 deformable control tasks, and 6,300 scenes with expert trajectories. Testing 3 policy models shows promising results in clean scenes but substantial performance drop under randomization.

Vision Robotics Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Lost or Hidden? A Concept-Level Forgetting in Supervised Continual Learning

Study of forgetting in continual learning using Sparse Autoencoders (SAEs). Authors propose a diagnostic framework to analyze how task-specific information evolves at concept-level granularity. Finding: much apparent forgetting stems from representational accessibility loss rather than complete information erasure.

Papers Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

Cross-Source Supervision for Bone Infection Segmentation in Dual-Modality PET-CT

Bimodal PET-CT segmentation method for bone infection lesions using early-fusion multimodal representation. Dual-source learning framework trained on independent expert annotations (high-sensitivity vs high-specificity). Rigorous patient-level 3D volumetric evaluation to mitigate inter-slice correlation bias.

Vision Evals AI safety

SIG

HYP

arXiv cs.AI·May 19

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

GeoSym127K is a dataset of 127K geometric questions with exact symbolic ground truths, generated by an automated neuro-symbolic engine. Fine-tuning on Qwen3-VL-8B: +22.21% on MathVerse Vision-Only, 61.52% on WeMath. RLVR via GRPO further improves performance.

Benchmarks Vision Reasoning

SIG

HYP

arXiv cs.AI·May 19

WASIL: In-the-Wild Arabic Spoken Interactions with LLMs

WASIL is a dataset of 8,529 turns of Arabic voice interactions with LLMs, including audio, ASR hypotheses, responses, and user feedback (14.2% dislikes). Covers Modern Standard Arabic and 4 major dialects. Enables isolating ASR errors from intrinsic unanswerability via annotation and multi-judge LLM evaluation.

Voice Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search

Researchers formalize activation steering (LLM control without retraining) as budget-constrained optimization. They introduce "concept granularity" to explain why some concepts are expensive to steer, and propose GRACE, a framework using activation geometry to diagnose difficulties and reduce evaluations needed by 39.8% on average.

Reasoning Alignment Papers

SIG

HYP

arXiv cs.AI·May 19

LEAF: A Living Benchmark for Event-Augmented Forecasting

LEAF is a living benchmark to evaluate LLM forecasting capabilities using multidimensional events. The system employs recursive retrieval agents and dual-agent cross-validation to provide textual context for models. Testing shows LLMs leverage signals from complex events to improve stock price predictions.

Benchmarks AI Agents Multi-agent

SIG

HYP

arXiv cs.AI·May 19

Learning Displacement-Aware WiFi Representations for Weakly Supervised Relative Localization

Relative WiFi localization without dense coordinate annotations. Intersection Pathway aligns WiFi fingerprint traces and inertial motion vectors in a shared additive latent space, enabling direct relative-displacement inference. Validated on synthesized dataset from real measurements.

Reinforcement learning Embeddings

SIG

HYP

arXiv cs.AI·May 19

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

arXiv paper proposing a formal framework for combining LLM and human evaluations. Uses a doubly robust estimator (missing data approach) to determine optimal sample sizes of human ratings needed for benchmark validation, shifting LLMs from substitutive to auxiliary role.

Evals Benchmarks AI safety

SIG

HYP

arXiv cs.AI·May 19

DiPRL: Learning Discrete Programmatic Policies via Architecture Entropy Regularization

DiPRL introduces a programmatic reinforcement learning method that learns discrete, interpretable policies without post-hoc discretization. Using architecture entropy regularization, the approach converges toward discrete programs during training, avoiding performance collapse and eliminating the need for additional fine-tuning.

Reinforcement learning Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

How Do Electrocardiogram Models Scale?

Systematic study of scaling laws for ECG models: 120 models (20K–200M parameters) pre-trained on CODE (2.3M records). SSL models outperform SL on out-of-distribution generalization; ResNets 1.3–2.5× more parameter-efficient than Transformers; SSL 16× more data-efficient. Architecture and paradigm choice matter more than brute-force scaling.

Benchmarks

SIG

HYP

arXiv cs.AI·May 19

State-of-the-Art Claims Require State-of-the-Art Evidence

Critical study of state-of-the-art claims in AI/ML. Analysis of 10 public benchmarks reveals that over 50% of top-model comparisons fail to support implicit superiority properties (meaningful effect size, cross-task consistency, robustness). Aggregate gains often driven by outlier datasets. Proposes more honest claim language without additional experiments.

Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

LARGER: Lexically Anchored Repository Graph Exploration and Retrieval

LARGER is a context retrieval framework for repository-level coding agents combining lexical search with structural graph exploration (imports, call chains, type hierarchies) without external databases. On LocBench, it improves file-level Acc@5 by +13.9 points (or +11.8 with fixed hyperparameters) and shows consistent gains on test generation and codebase QA benchmarks.

AI Agents Code generation Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Drift Flow Matching

Drift Flow Matching (DFM) combines drift models (one-step generation) with iterative flow matching. The framework preserves efficiency of direct transport maps while enabling multi-step refinement. Validated across multiple tasks and datasets.

Papers Benchmarks Image generation

SIG

HYP

arXiv cs.AI·May 19

Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization, Statistical Validation, and Clinical Interpretability

Systematic optimization of Vision Transformers (ViT-Tiny) for cervical cancer screening on Herlev dataset (917 images). Optimal configuration: 94.9%-95.2% cross-validation accuracy with horizontal flipping and class weighting (0.7 x 1.3). Grad-CAM validates clinical interpretability: attention on nuclei, cell boundaries, and chromatin texture.

Vision Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation

PropGuard is a security framework for LLM-based multi-agent systems (LLM-MAS). It detects and neutralizes malicious injections propagated across agents using a dual spatio-temporal graph and a GE-GRPO-trained inspector. Tests across 4 architectures and 5 attack scenarios show significant reduction in attack success rates.

Multi-agent AI safety AI Agents

SIG

HYP

arXiv cs.AI·May 19

Goal-Conditioned Supervised Learning for LLM Fine-Tuning

New offline fine-tuning method for LLMs: Goal-Conditioned Supervised Learning (GCSL) treats feedback signals as explicit goals and trains models via pure supervised learning. Evaluated on non-toxic generation, code generation, and recommendation; outperforms SFT and DPO without external reward models.

Fine-tuning Reinforcement learning Alignment

SIG

HYP

arXiv cs.AI·May 19

LoopQ: Quantization for Recursive Transformers

LoopQ introduces a post-training quantization (PTQ) method for recursive language models (LoopLMs) that reuse Transformer blocks. It addresses three challenges: distribution shift across roles, state reuse between loops, and recursive error accumulation. Results: +68.8% downstream accuracy and -87.7% perplexity reduction in W4A4 vs strongest static PTQ baseline.

Fine-tuning Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 19

Detecting Verbatim LLM Copy-Paste in Homework

SteganoPrompt, an open-source web tool, detects verbatim copies of assignment prompts submitted to LLMs. It encodes an invisible instruction in the prompt via the Unicode Tags block (U+E0000–U+E007F), creating a detectable signature in the model's response. Tested across 7 LLM families, the approach bypasses limitations of post-hoc detectors and requires no cooperation from model providers.

Evals AI safety Prompt engineering

SIG

HYP

arXiv cs.AI·May 19

A Machine Learning Framework for EEG-Based Prediction of Treatment Efficacy in Chronic Neck Pain

ML framework using EEG to predict treatment efficacy in chronic neck pain patients. Rigorous preprocessing pipeline (baseline removal, ICA, spectral analysis) applied to resting-state and motor EEG. Systematic review of 763 studies (16 patient, 47 healthy-control studies) to inform post-processing strategy.

Evals Papers

SIG

HYP

arXiv cs.AI·May 19

Modality vs. Morphology: A Framework for Time Series Classification for Biological Signals

Unified review on time series classification of biological signals (EEG, EMG, ECG, PPG, ocular). The morphology-modality framework shows that waveform structure (spikes, bursts, oscillations, drift) determines performance more strongly than model class. Inductive biases aligned with physiological dynamics improve interpretability and generalization.

Benchmarks Vision Papers

SIG

HYP

arXiv cs.AI·May 19

Phase Transitions in Driven Informational Systems: A Two-Field Perspective on Learning Theory and Non-Equilibrium Chemistry

Theoretical paper proposing a unified framework for phase transitions in deep learning (grokking, emergent capabilities) and non-equilibrium chemistry. Introduces two gradient fields (entropy production rate and information quasi-potential) and two order parameters (adversarial breakdown threshold α†, self-referential coupling threshold κc) to describe driven informational systems.

Reasoning Alignment Papers

SIG

HYP

arXiv cs.AI·May 19

AdaGraph: A Graph-Native Clustering Algorithm That Overcomes the Curse of Dimensionality and Enables Scientific Discovery

AdaGraph is a graph-native clustering algorithm that overcomes the curse of dimensionality by operating on kNN topology rather than Euclidean metrics. Without specifying k a priori, it identifies gene modules in genomics (GSE14520, 10k genes), achieves ARI=0.751 on text clustering (20NG-6cat vs HDBSCAN 0.464), and outperforms Silhouette/Davies-Bouldin on 10 benchmarks up to d=5000.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

Study of adversarial attacks via action removal in self-play reinforcement learning. An attacker selectively removes legal actions from the victim's available set. Across poker games (6 to 5,531 states) and two non-poker domains, learned masking causes more damage than random masking. The attack persists across Q-learning, PPO, NFSP, DQN and shows no recovery under extended masked training.

Reinforcement learning AI safety Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

New credit assignment method for reinforcement learning with LLMs. IBPO (Implicit Behavior Policy Optimization) uses counterfactual trajectories to convert sparse terminal rewards into step-sensitive learning signals, reducing gradient variance and improving stability on mathematical and code reasoning benchmarks.

Reinforcement learning Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 19

MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment

MANTA is a multi-turn evaluation framework on Inspect AI that stress-tests LLMs (Claude Sonnet 4, GPT-4o) against adversarial follow-up arguments on animal welfare alignment. Results show models capitulate at Turn 2 under economic/social pressure, and evidence-based capacity attribution is the weakest dimension across all models.

Claude GPT Evals

SIG

HYP

arXiv cs.AI·May 19

Consent Chain Degradation in Embodied Multi-Agent Systems: Bridging the Gap Between AI Agent Governance and Robot Ethics

Theoretical paper on consent degradation in delegation chains between autonomous robots. Introduces CoRVE framework to verify consent across multi-agent architectures. Analyzes regulatory gaps in EU AI Act, GDPR, Machinery Regulation, and Product Liability Directive.

Multi-agent Robotics AI safety

SIG

HYP

arXiv cs.AI·May 19

A Conflict-aware Evidential Framework for Reliable Sleep Stage Classification

ConfSleepNet, an evidential framework, resolves inter-view conflicts for sleep stage classification. The method extracts category-related evidence from different modalities and aggregates view-specific opinions via a conflict-aware mechanism. Code available on GitHub.

Evals Reasoning

SIG

HYP

arXiv cs.AI·May 19

MusicSynth: An Automated Pipeline for Generating Violin Fingerboard Animations from Sheet Music Using Optical Music Recognition

MusicSynth is an open-source web tool that automatically converts violin sheet music (photo or file) into animated videos showing finger positioning on the fingerboard. The system combines optical music recognition (OMR), MusicXML parsing, and video rendering. Tested on 110 scores: 91.2% note recognition accuracy on printed music, 99.1% finger position accuracy on digital files.

Vision Code generation Open source

SIG

HYP

arXiv cs.AI·May 19

Task-Level AI Readiness Assessment for Business Process Management:The T-IPO Model and LARA Matrix in Financial-Services IT Operations

arXiv paper introducing T-IPO and LARA, tools to assess LLM agent readiness for business tasks. LARA is a 5-dimension rubric scoring tasks into 4 levels (L1-L4), with 1.5× weight on compliance sensitivity. Validated on 127 tasks (κ=0.80), replicated across 3 institutions (κ=0.73). Auto-completion decays from 95% (L1) to 40% (L3).

AI Agents Evals Papers

SIG

HYP

arXiv cs.AI·May 19

ANVIL: Analogies and Videos for Lecturers

ANVIL is a multimodal generative system automating production of analogy-based instructional animations for computer science. Given a concept definition, it generates textual analogies, compiles them into structured visual screenplays, and produces executable manim code. Evaluation includes teacher studies and user adoption assessment.

Video generation Code generation Evals

SIG

HYP

arXiv cs.AI·May 19

Are Researchers Being Replaced by Artificial Intelligence?

A 2023 Nature survey of 1,600 researchers shows tension between excitement and concern about AI tools in research. The article argues replacement is underway: shift from researcher-as-creator to researcher-as-curator. Key risk: humans retain responsibility while losing intellectual ownership and deep understanding of science.

AI Agents Papers AI safety

SIG

HYP

arXiv cs.AI·May 19

AI of the People, by the People, for the People: A Social Choice Approach to Collective Control of Artificial Intelligence

Theoretical framework grounded in social choice theory to incorporate collective control throughout AI development, from data collection to alignment. Proposes axiomatic criteria for evaluating democratic control mechanisms across multiple stages of the ML pipeline.

Alignment AI safety Regulation

SIG

HYP

arXiv cs.AI·May 19

Homoglyph-based Adversarial Perturbation of Introductory Computer Science Theory Problems

Method using homoglyph-based adversarial perturbation to modify computer science problem statements without changing semantic meaning. Aims to prevent ChatGPT, Gemini, and Claude from directly solving student homework. Interactive tool provided.

Claude GPT Gemini

SIG

HYP

arXiv cs.AI·May 19

Measuring Changes in Instructor Class Design and Student Learning After the Release of Large Language Models (LLMs)

Mixed-methods multi-course study at a New England university examining LLM impact on teaching and learning. Retrospective quantitative analysis, instructor and student surveys, historical grade data pre/post-LLM. Documents shifts in study methods, course design, and learning outcomes.

Evals Business AI safety

SIG

HYP

arXiv cs.AI·May 19

AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers

AI4BayesCode translates natural-language Bayesian model descriptions into validated, modular MCMC samplers. The system decomposes models into sampling blocks mapped to built-in components, with pre- and post-generation validation. A novel recursively stateful architecture enables coherent composition of independently developed sampling components.

Code generation AI Agents Reasoning

SIG

HYP

arXiv cs.AI·May 19

Evolutionary Extreme Learning Machine of ab-initio Energy Landscapes for Crystal Structure Prediction using Manta Ray Optimization with Levy Flight

Manta Ray Foraging Optimization algorithm enhanced with Lévy Flight to train Extreme Learning Machines (ELMs) for predicting crystal formation energies. EELM-MRFO-LF uses MRFO-Lévy for input weight selection and Moore-Penrose generalized inverse for analytical output weight determination, improving population diversity and avoiding local optima.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

From Reactive to Proactive: A Multi-Regulatory Empirical Analysis of 480 AI Incidents and a Data-Driven Governance Compliance Framework

Analysis of 480 real-world AI incidents from AIID against EU AI Act, NIST AI Risk Management Framework, and GDPR post-deployment provisions. Reveals substantial governance gaps in post-deployment accountability. Proposes Proactive AI Governance Compliance Framework (PAGCF), a four-phase lifecycle methodology shifting from reactive incident response to pre-deployment compliance assurance.

Regulation AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

Contrastive Conceptor Activation Steering (COAST): Unlocking Vision-Language-Action Models through Hidden States

COAST (Contrastive Conceptor Activation Steering) improves Vision-Language-Action models by identifying and steering latent representations toward success-critical subspaces. Across three distinct architectures, COAST increases task success rates by +20% in simulation and +40% on real robots, without additional training.

Vision Robotics Reasoning

SIG

HYP

arXiv cs.AI·May 19

Keeping an Eye on AI: A Framework for Effective Human Oversight of AI Systems

Multidisciplinary framework for human oversight of AI systems in high-risk decision-making scenarios. Defines oversight architectures, actor roles, and implementation processes. Synthesizes open research challenges in this emerging field.

AI safety Alignment Regulation

SIG

HYP

arXiv cs.AI·May 19

Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects

Comprehensive review of AI methods for solving inverse partial differential equation (PDE) problems. Covers three categories: inverse problems, inverse design, and control. Applications: medical imaging, geophysics, aerodynamics, thermal systems. Challenges: physics-informed architectures, limited real-world data, uncertainty quantification, inverse foundation models.

Papers Reasoning Benchmarks

SIG

HYP