Archives

May 2026

3148 articles

arXiv cs.AI·

ReTAMamba: Reliability-Aware Temporal Aggregation with Mamba for Irregular Clinical Time Series Prediction

ReTAMamba is a Mamba-based model for predicting irregular clinical time series. It estimates observation reliability from missingness and elapsed time, integrates multi-resolution information via Chronological Weaving, and uses a budgeted token router. On MIMIC-IV, eICU, and PhysioNet 2012, it improves AUPRC by 7.51%, 7.80%, and 10.15% respectively.

BenchmarksPapersReasoning
SIG
72
HYP
18
arXiv cs.AI·

Hierarchical Two-Stage Framework for Environment-Aware Long-Horizon Vessel Trajectory Prediction

Hierarchical two-stage framework for long-horizon vessel trajectory prediction under real ocean conditions. Combines long-term predictor with short-term Spatio-Temporal Graph Transformer on discretized maritime cells. Environmental module integrates currents, wind, wave height via cross-modal attention. Results: 25% improvement in ADE, 17% in FDE on Australian CTS data.

ReasoningBenchmarksPapers
SIG
72
HYP
15
arXiv cs.AI·

PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics

PH-Dreamer embeds Port-Hamiltonian physical principles into world models to improve latent imagination. The framework models energy evolution, estimates the Hamiltonian from proprioceptive observations, and uses an energy-guided Actor-Critic. Results: 4.18-8.41% phase space volume reduction, up to 7.80% energy consumption decrease, up to 9.38% jerk reduction.

ReasoningReinforcement learningPapers
SIG
78
HYP
15
arXiv cs.LG·

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

ProxyKV introduces a cross-model proxy pruning framework to accelerate long-context LLM inference. A lightweight in-family small model evaluates KV cache importance asynchronously via HybridAxialMapper and Multi-Granularity Hybrid Loss. On Llama-3.1, Qwen-2.5, and Qwen-3, recovers 98.7% of KVZip accuracy with up to 3.21× prefilling speedup (Llama-3.1-8B, dual-GPU) and sustains speedup at contexts up to 170k tokens.

LlamaQwenReasoning
SIG
82
HYP
18
arXiv cs.AI·

Membership Inference Attacks on Discrete Diffusion Language Models

Study of membership inference attacks (MIA) on masked diffusion language models (MDLM). Researchers extract 46-dimensional feature vectors from reconstruction loss at different masking ratios and train XGBoost and MLP classifiers. On MIMIR benchmark, XGBoost achieves AUC 0.878 (peak 0.930), outperforming SAMA baseline by 0.062 AUC. ELBO trajectory alone drives most of the signal.

AI safetyBenchmarksPapers
SIG
78
HYP
15
arXiv cs.AI·

Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign

Peak-Detector leverages instruction-tuned LLMs for peak detection across physiological signals (ECG, PPG, BCG, BSG) with explainability. A "peak-representation" technique compresses time-series while preserving critical events. The model is optimized via supervised fine-tuning then multi-objective reinforcement learning, evaluated on 7 datasets (6 public benchmarks + 1 real-world cohort).

ReasoningFine-tuningReinforcement learning
SIG
72
HYP
25
arXiv cs.AI·

Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex

MINE (Mechanistically Interpretable Neural Encoding) applies mechanistic interpretability to neural encoding models to identify visual features driving activation in individual voxels of human visual cortex. Using language-aligned image representations and counterfactual editing, the approach causally validates fine-grained selectivity in category-selective brain regions.

VisionPapers
SIG
78
HYP
15
arXiv cs.AI·

Prompts Don't Protect: Architectural Enforcement via MCP Proxy for LLM Tool Access Control

LLMs used as autonomous agents select unauthorized tools despite explicit instructions. Study across Qwen 2.5 7B, Llama 3.1 8B, and Claude Haiku 3.5 shows an MCP proxy with attribute-based access control (ABAC) reduces unauthorized invocation rate to 0%, versus 11-18% for prompt-based restrictions. Architectural enforcement, not prompting, is required for reliable tool access control.

AI AgentsMCPAI safety
SIG
82
HYP
15
arXiv cs.LG·

DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models

DACA-GRPO improves reinforcement learning for diffusion language models by addressing temporal credit assignment and mean-field likelihood bias. It introduces Denoising Progress Scores and Stratified Masking Likelihood, achieving gains up to 7.4pp on code generation and 36.3pp on constraint satisfaction across seven benchmarks.

Reinforcement learningReasoningCode generation
SIG
78
HYP
15
arXiv cs.LG·

PIMSM: Physics-Informed Multi-Scale Mamba for Stable Neural Representations under Distribution Shift

PIMSM embeds physical constraints into a multi-scale Mamba architecture to improve representation stability under distribution shift. The model aligns discretization parameters to characteristic frequencies in temporal series (fMRI, weather). Results: improved robustness on Human Connectome Project and Weather-5K with minimal MAE in out-of-distribution forecasting.

ReasoningBenchmarksPapers
SIG
78
HYP
18
arXiv cs.AI·

Action-Gradient Monte Carlo Tree Search for Non-Parametric Continuous (PO)MDPs

Action-Gradient MCTS (AGMCTS) combines global tree search with local gradient-based action refinement for online planning in continuous spaces. Three theoretical contributions: action score gradient theorem, Multiple Importance Sampling Tree for sample reuse, tractable gradients via Area Formula. Outperforms state-of-the-art sample-based solvers on continuous MDP/POMDP benchmarks.

ReasoningReinforcement learningPapers
SIG
72
HYP
18
arXiv cs.AI·

Evaluating AI Alignment in LLMs: Output Analysis of Value Priorities Across 75 Models with Human Benchmarking

Alignment evaluation across 75 LLMs benchmarked against 376 humans. Qualitative analysis derives 6 themes of optimal AI functioning (Performance, Adaptive Capacity, Social Good, Ethics and Responsibility, Relational Integration, Agency). Models reproduce human value ordering but systematically exaggerate differences. Profile fidelity does not correlate with model size or recency.

AlignmentEvalsBenchmarks
SIG
78
HYP
25
arXiv cs.AI·

Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning

BPO, a three-stage framework (bootstrapping, extrapolation, refinement), creates a self-improving data flywheel to train robust reasoning models for long-horizon sparse-reward planning. Uses planning quaternions, long-short chain-of-thought fusion, and complexity-stratified curriculum learning. SOTA on ALFWorld, ScienceWorld, WebShop with significant token efficiency.

ReasoningAI AgentsReinforcement learning
SIG
78
HYP
25
arXiv cs.AI·

Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning

R-AIRL (Reasoning Adversarial Inverse Reinforcement Learning) infers process-level reward functions from expert Chain-of-Thoughts without explicit reward definitions. Tested on GSM8K, MMLU-Pro, and MedReason: improves pass@1 by 17.4 points via inference-time reranking, outperforms SFT in post-training, localizes reasoning failures with 86.1% accuracy.

Reinforcement learningReasoningEvals
SIG
78
HYP
25
arXiv cs.AI·

How Wrong Can Your Counterfactual Be? Quantifying Confounding Bias for Continuous Treatments without a Control Group

Causal inference framework for financial stress testing in panel data with continuous treatment and no control group. Proposes closed-form confounding envelope parameterized by two sensitivity parameters, combines partial identification with importance-weighted conformal prediction. Shows standard predictive models remain causally biased on US unemployment data.

ReasoningBenchmarksPapers
SIG
72
HYP
15
arXiv cs.AI·

Can LLM Agents Be CFOs? Benchmarking Long-Horizon Resource Allocation in an Uncertain Enterprise Environment

EnterpriseArena, a 132-month CFO simulator, benchmarks LLM agents' ability to allocate resources over long horizons under uncertainty. Tests across 23 models and 4 frameworks: only 15.4% of trials complete the full horizon. Larger models do not reliably outperform smaller ones. Reveals critical capability gap in managing binding commitments under partial observability.

AI AgentsBenchmarksReasoning
SIG
82
HYP
18
arXiv cs.AI·

EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

EmergentBridge improves unified multimodal embedding models for unpaired modality pairs (audio↔depth, infrared↔audio). The method learns a mapping producing a 'noisy bridge anchor' and enforces alignment in the orthogonal subspace, preserving existing anchor-alignment structure. Results across 9 datasets: outperforms baselines on zero-shot classification and retrieval.

EmbeddingsVisionMulti-agent
SIG
72
HYP
18
arXiv cs.AI·

Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

GCPO (Group Cooperative Policy Optimization) replaces competitive rollout optimization with team-level credit assignment. Rollouts are rewarded by contribution to valid solution coverage (determinant volume over semantic embeddings), not individual accuracy. Results: improved reasoning accuracy and solution diversity across benchmarks.

Reinforcement learningReasoningBenchmarks
SIG
78
HYP
25
arXiv cs.AI·

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

arXiv paper showing LLMs exhibit a knowing-doing gap in tool use: recognition of tool necessity vs. actual invocation diverge. Testing 4 models on arithmetic and factual QA reveals 26.5-54% mismatches. Hidden state probing shows cognition and action signals become nearly orthogonal in late layers, with most failures at the cognition-to-action transition, not in recognition itself.

AI AgentsToolsReasoning
SIG
78
HYP
15
arXiv cs.AI·

Mitigating Extrinsic Gender Bias for Bangla Classification Tasks

Study on extrinsic gender bias in Bangla pretrained language models. Four manually annotated task-specific datasets constructed (sentiment analysis, toxicity detection, hate speech, sarcasm detection) with minimal-pair gender perturbations. RandSymKL debiasing strategy proposed, combining symmetric KL divergence and cross-entropy loss, reducing bias while maintaining competitive accuracy.

BenchmarksAI safetyAlignment
SIG
72
HYP
15
arXiv cs.LG·

AdaGraph: A Graph-Native Clustering Algorithm That Overcomes the Curse of Dimensionality and Enables Scientific Discovery

AdaGraph is a graph-native clustering algorithm that overcomes the curse of dimensionality by operating on kNN graph topology instead of Euclidean distances. Tested on 10 synthetic benchmarks (d=10 to 5000) and three scientific domains (genomics, NLP, materials science), it outperforms HDBSCAN, WGCNA, and other methods without requiring k specification.

BenchmarksPapers
SIG
78
HYP
35
arXiv cs.AI·

UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities

UniversalRAG is a multi-modal RAG framework that retrieves and integrates knowledge from heterogeneous sources (text, images, videos) at variable granularities. It introduces modality-aware routing to avoid intra-modal bias and organizes each modality into granularity levels. Validated on 10 benchmarks, it outperforms single-modality and unified baselines.

RAGVisionVideo generation
SIG
75
HYP
25
arXiv cs.AI·

Catastrophic Overfitting, Entropy Gap and Participation Ratio: A Noiseless $l^p$ Norm Solution for Fast Adversarial Training

arXiv paper addressing Catastrophic Overfitting (CO) in fast adversarial training. Authors propose controlling the lp training norm instead of noise injection or regularization. They quantify gradient concentration via Participation Ratio and entropy measures, developing an adaptive lp-FGSM that automatically tunes the training norm based on gradient information.

AI safetyAlignmentBenchmarks
SIG
72
HYP
15