May 2026

3149 articles

HRVConformer: Neonatal Hypoxic-Ischemic Encephalopathy Classification from the Heart Rate signals

HRVConformer is a hybrid Convolution-Transformer architecture for classifying neonatal hypoxic-ischemic encephalopathy from raw heart rate signals. Trained on 1,573 epochs (259 expert-annotated, rest weakly labelled), the model achieves 83.23% AUC and 74.56% accuracy on a 215-hour test set, outperforming ResNet50 and Transformer baselines.

Vision Benchmarks Code generation

SIG

HYP

arXiv cs.LG·May 27

Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training

Systematic study of HiF8 W8A8 QAT on OpenPangu-Embedded-1B. Identifies two failure modes: amax saturation (silent corruption via clipping) and catastrophic forgetting (aggressive learning rate overwrites knowledge). Solutions: 64-step history window for DTS and 500-step BF16 warmup. Results: 0.43% MMLU drop, 0.58% HellaSwag, 0.22% ARC-Challenge vs baseline.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 27

Multi-Stakeholder LLM Alignment: Decomposing Estimation from Aggregation

Paper on multi-stakeholder LLM alignment. Holistic judges conflate utility estimation and aggregation, creating unstable weighting noise. DecompR decouples counterfactual-calibrated weights (fixed before candidate scoring) from independent per-role utility estimation, removing candidate-dependent weight drift and reducing estimation noise.

Alignment Evals Reasoning

SIG

HYP

arXiv cs.LG·May 27

Dynamic Link Prediction with Temporally Enhanced Signed Graph Neural Networks

Modular framework to enhance signed GNNs with temporal context. Introduces HCIM (Historical Context Integration Module) combining learnable temporal weighting, LSTM, and multi-head attention for link prediction in temporal signed networks. Evaluated on Bitcoin OTC, Bitcoin Alpha, Reddit with statistically significant improvements over static baseline.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·May 27

Two-Parameter Flows for Learning Population Dynamics of Physical Systems

New method to learn dynamics of high-dimensional probability densities without labeled trajectories. Two-parameter flows learn sampling-time transports from base to marginals, then extract physics-time dynamics via regression on coupled synthetic trajectories. Scalable to high dimensions, handles rotational physics phenomena.

Papers Reasoning

SIG

HYP

arXiv cs.AI·May 27

Automatic Layer Selection for Hallucination Detection

Study on automatic hallucination detection in LLMs. Researchers propose FEPoID (First Effective Peak of Intrinsic Dimension), a training-free method to select optimal intermediate layers. Tested on QA and summarization, it outperforms existing baselines with negligible computational overhead.

Reasoning Evals

SIG

HYP

arXiv cs.LG·May 27

InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization

InfoQuant proposes a training-free post-training quantization (PTQ) method for LLMs. It uses Peak Suppression Orthogonal Transformation (PSOT) to reshape activations into quantization-friendly distributions. On LLaMA-2 13B under W4A4KV4, it preserves 97% floating-point accuracy and reduces the performance gap by 42% over prior state-of-the-art.

Llama Papers Benchmarks

SIG

HYP

arXiv cs.LG·May 27

Co-folding model guided by structural proteomics

AIMS-Fold integrates structural proteomics data (XL-MS, HDX-MS) into a guided-diffusion framework to predict protein complex structures. The framework outperforms Boltz-2 on induced proximity targets, critical for antibody and PROTAC design.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·May 27

AirCast-SR: A Foundation Model for Kilometer-Scale Atmospheric Super-Resolution via Latent Consistency Diffusion

AirCast-SR is an atmospheric super-resolution foundation model that downscales global AI weather forecasts from 28 km to 1 km horizontal resolution. Built on a 3D U-Net conditioned within a Latent Consistency Model diffusion framework, trained on GraphCast forecasts and NOAA data, it produces 67-hour forecasts with near-zero bias and demonstrates zero-shot global transferability to India and Germany.

Papers Benchmarks Open source

SIG

HYP

arXiv cs.AI·May 27

Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2

Tail-Aware HiFloat4 applies W4A4 post-training quantization to the Wan2.2 text-to-video generation model. The method adapts ViDiT-Q using HiFloat4 format, quantizes transformer linear layers, preserves numerically sensitive modules in high precision, and introduces activation-tail-aware percentile calibration to reduce impact of rare outliers.

Video generation Fine-tuning Benchmarks

SIG

HYP

arXiv cs.LG·May 27

Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage

Reparametrization of Shampoo-based methods (KL-Shampoo, SOAP, KL-SOAP) enabling BFloat16 storage and reducing computational cost through subspace QR decomposition. Improves memory and time efficiency without performance degradation.

Reinforcement learning Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 27

On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series

PathoFM, an encoder-centric transformer pretrained on clinical time series (pathological gait analysis for spinal cord injury), combines three objectives: Local Completion, Temporal Continuity, and Unsupervised In-Context Dynamics. The study shows that dynamics-centric objectives produce the most balanced transferable representations across classification and regression tasks.

Papers Reasoning Fine-tuning

SIG

HYP

arXiv cs.CL·May 27

The Daily Dose: Workflow-Integrated Large Language Model Automation for Clinical Summarization and Trial Identification in Radiation Oncology

The Daily Dose (TDD) is an LLM-driven system integrated into routine radiation oncology practice for automated clinical summarization and trial identification. Evaluation of 55 clinicians: 83.6% use TDD daily, mean satisfaction 3.89/5, 27% report ≥10 minutes saved per day.

Code generation RAG Business

SIG

HYP

arXiv cs.CL·May 27

Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study

EnterpriseMem-Bench, a multi-turn Text-to-SQL benchmark with 1,400 turns across 300 sessions, evaluates GPT-5 mini, GPT-5.2, Claude Sonnet 4.5/4.6, and Opus 4.6. Key findings: without memory, accuracy collapses by Turn 3; working memory dominates complex architectures; Sonnet 4.6 regresses 17-33pp on SEC EDGAR vs Sonnet 4.5.

Benchmarks Code generation GPT

SIG

HYP

arXiv cs.CL·May 27

Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications

Unified survey on Pretraining Data Exposure (PDE) in LLMs, covering membership inference and data contamination. Formalizes PDE across exposure levels, reviews attack and defense methods, and identifies open challenges for evaluation integrity and privacy protection.

AI safety Alignment Evals

SIG

HYP

arXiv cs.LG·May 27

From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD

Theoretical paper proving a finite-sample bound on approximate max-information of DP-SGD with linear scaling in dataset size. Derives a general PAC-Bayes generalization bound where the prior distribution is learned by DP-SGD, and a generalization bound for DP-SGD-trained models with complexity term explicitly controlled by optimization hyperparameters.

Papers AI safety Alignment

SIG

HYP

arXiv cs.CL·May 27

Self-Verified Distillation: Your Language Model Is Secretly Its Own Synthetic Data Pipeline

Qwen3 improves reasoning via Self-Verified Distillation, a post-training algorithm requiring no external data. The model generates solutions, filters them through self-verification (cycle-consistency, factuality, correctness), then trains on self-curated data. Gains: +16.7 points math (AIME26/HMMT), +11.1 science (GPQA), +8.3 coding for Qwen3-4B.

Qwen Fine-tuning Reasoning

SIG

HYP

arXiv cs.CL·May 27

SPEAR: Code-Augmented Agentic Prompt Optimization

SPEAR is an agentic prompt optimizer integrating a Python sandbox for structural error analysis (confusion matrices, clustering). Evaluated on 13 industrial LLM-as-judge tasks and BBH-7, it outperforms GEPA and TextGrad (κ 0.857 vs 0.359 on tool-selection; F1-macro 0.815 vs 0.763). Python tool contributes +0.79κ on complex judge tasks.

Prompt engineering AI Agents Code generation

SIG

HYP

arXiv cs.LG·May 27

Classification and detection of multiple UAVs using rational Gaussian wavelet neural networks

UAV detection system using sound signals with rational Gaussian wavelet neural networks for adaptive feature extraction. Classifies single drones and swarms while maintaining interpretability. Outperforms traditional ML approaches in indoor and outdoor environments. Implementation publicly available.

Vision Benchmarks

SIG

HYP

arXiv cs.LG·May 27

Balancing Plasticity and Stability with Fast and Slow Successor Features

Study on RL agent adaptation in gradually non-stationary environments. Authors modify 3D Miniworld and MuJoCo environments to introduce continuous drift, showing that synaptic consolidation applied to multi-timescale Successor Features outperforms Q-value-based approaches. Stability outweighs plasticity in continual learning with gradual changes.

Reinforcement learning Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 27

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

Mechanistic analysis of LLM hallucinations on linearized structured knowledge (graphs, tables). Hallucinations stem from systematic internal dynamics: attention disproportionately concentrates on shortcut structural cues, feed-forward representations fail to ground provided knowledge, model reverts to parametric memory. Patterns generalize to multi-hop graphs and tabular data.

Reasoning Papers AI safety

SIG

HYP

arXiv cs.AI·May 27

MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

MobileExplorer accelerates on-device inference for mobile GUI agents through online exploration. The framework exploits VLM reasoning time to parallelly probe UI elements, recording exploration traces as structured memory. With a two-level rollback mechanism, it reduces reasoning steps and end-to-end latency by 23% on AndroidWorld.

AI Agents Vision Reasoning

SIG

HYP

arXiv cs.LG·May 27

Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series

Online framework modeling streaming time series as dynamic mixtures of time-delay systems. Uses compact tensor representation of Markov parameters to capture system dynamics and input delays, with tensor decomposition for rapid regime adaptation. DelayMix outperforms baselines on real non-stationary data with superior forecast accuracy and faster delay adaptation.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 27

Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems

Study of policy-gradient method failures in long-horizon decision problems with cumulative damage. Authors identify two orthogonal failure modes and propose decomposition separating completion (reaching terminal horizon) and optimality (matching dynamic programming). Experiments on bricklayer career (49 steps) and NBA forward career (20 seasons).

Reinforcement learning Papers Reasoning

SIG

HYP

arXiv cs.AI·May 27

Advancing Creative Physical Intelligence in Large Multimodal Models

MM-CreativityBench, a new benchmark, evaluates large multimodal models' ability to solve creative problems by identifying non-obvious object uses in physically constrained environments. Current LMMs fail due to insufficient grounded exploration and hallucinations. Affordance-grounded alignment via Direct Preference Optimization reduces these errors and improves entity selection.

Benchmarks Vision Reasoning

SIG

HYP

arXiv cs.AI·May 27

Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

arXiv study on legal LLM evaluation. Existing models are sensitive to legally irrelevant variations. LexGuard, an adversarial multi-agent framework, formalizes statutes into executable constraints and uses SMT solvers to verify legal satisfaction and logical consistency.

Reasoning Multi-agent AI safety

SIG

HYP

arXiv cs.AI·May 27

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

OmniToM is a benchmark evaluating theory of mind in LLMs through explicit belief modeling. Built on 895 stories (22,343 annotated belief propositions), it tests extraction and labeling of mental states across 7 dimensions. Results show current LLMs struggle to transform narrative facts into actors' beliefs and shared mental states.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.LG·May 27

SilIF: Silhouette-Augmented Isolation Forest for Unsupervised Transaction Fraud Detection

SilIF augments Isolation Forest for fraud detection by adding a silhouette-based scoring layer computed from tree path lengths. On IEEE-CIS benchmark (~590K transactions, 3.5% fraud), SilIF achieves +0.0080 AUC-PR improvement over plain IF (p=0.046). No gains on Sparkov dataset; paper characterizes when the augmentation helps.

Benchmarks Evals Open source

SIG

HYP

arXiv cs.AI·May 27

What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation

Study on chain-of-thought (CoT) mechanisms at probe time. Authors show performance gains arise primarily from lexical activation and short-range token co-occurrence (2-3 tokens), not global logical derivation. Even word-shuffled rationales substantially outperform no-rationale baselines.

Reasoning Prompt engineering Papers

SIG

HYP

arXiv cs.LG·May 27

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

TSFMAudit, first contamination auditing method for time series foundation models (TSFMs). Detects whether evaluation datasets were exposed during pretraining by analyzing fine-tuning adaptation dynamics: contaminated data exhibits unusually fast loss reduction. Evaluated on 6 TSFMs and 187 datasets.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.CL·May 27

FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing

FAB-Bench is an adaptive benchmarking framework for evaluating RAG systems in semiconductor manufacturing. It defines 6 diagnostic metrics (factual accuracy, contextual utilization, completeness, retrieval relevance, technical depth, reasoning consistency) across context windows of 4K-32K tokens. Benchmark of 200 query-answer pairs tested on 4 LLMs and 4 RAG frameworks.

RAG Benchmarks Evals

SIG

HYP

arXiv cs.LG·May 27

GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training

GAC is an adaptive controller for hybrid SFT-RL post-training that dynamically adjusts mixing weights based on online estimates of gradient variance and disagreement between the two training signals. Tested on math, code, science, and logic benchmarks, GAC improves fixed baselines with less than 1% computational overhead.

Reinforcement learning Fine-tuning Benchmarks

SIG

HYP

arXiv cs.LG·May 27

A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning

PAC-Bayesian framework for physics-informed machine learning (PIML) integrating partial differential equations. Provides high-probability generalisation guarantees with unbounded losses via multi-task perspective. Non-vacuous bounds validated on standard PDE benchmarks.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·May 27

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

GEM (Geometric Entropy Mixing) reformulates LLM data curation as a variational problem on the hypersphere to prevent cluster collapse. Uses provable MM algorithm and teacher-student distillation for web-scale scaling. Improves downstream accuracy by up to 1.2% on 1.1B models integrated with DoReMi and RegMix.

Papers Benchmarks Fine-tuning

SIG

HYP

arXiv cs.CL·May 27

Hubness, Not Anisotropy, Drives Cross-Lingual Retrieval Asymmetry in Multilingual Embedding Models

Study on cross-lingual retrieval asymmetry in 5 multilingual models (Gemini, Mistral, OpenAI, Qwen). Analysis of 6,518 idiomatic expressions in English, Bengali, Hindi, Arabic. Finding: hubness (vector concentration) is the dominant causal driver (49.5% dominance share), far exceeding anisotropy. CSLS correction closes 63.5% of reciprocity gap.

Embeddings Benchmarks Multi-agent

SIG

HYP

arXiv cs.LG·May 27

Provably Communication-Efficient and Privacy-Preserving Federated Graph Neural Networks

CE-FedGNN is a federated framework for graph neural networks that reduces communication by infrequently exchanging aggregated node representations instead of per-round embeddings. A moving-average estimator handles cross-client dependencies and staleness. The framework provides privacy guarantees via metric-DP and achieves O(1/√T) convergence with O(T^3/4) communication complexity.

SIG

HYP

arXiv cs.LG·May 27

Planning Neural Dynamics with Lie Group Embedding through Supervised Projective Manifold Learning

LieEDNN introduces Lie group embedded dynamical neural networks to model continuous symmetries on manifolds. The method addresses incompatibility between neural network addition and non-Euclidean geometry through adjoint Lie group actions on Lie algebras. Tested on SE(3) for telescopic manipulator control.

Reasoning Robotics Papers

SIG

HYP

arXiv cs.AI·May 27

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

ScientistOne, an autonomous research system, introduces Chain-of-Evidence (CoE) to trace every claim to its source. Evaluation across 75 papers: baseline systems show 21% hallucinated references, 42% score verification pass rate. ScientistOne achieves 0 hallucinations, perfect verification, and matches or exceeds human expert performance on five tasks.

AI Agents Reasoning Evals

SIG

HYP

arXiv cs.CL·May 27

AI evaluation may bias perceptions: The importance of context in interpreting academic writing

Study showing that evaluation methods for AI use in scientific publications produce significant biases when ignoring contextual differences across countries and fields. Pooled benchmarks conflate pre-existing stylistic variation with LLM-generated text, overestimating AI in some contexts and underestimating it in others.

Evals Papers AI safety

SIG

HYP

arXiv cs.CL·May 27

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective

Theoretical study showing that one linear self-attention layer can implement a gradient-descent step on a unified RAG objective. Authors propose a lightweight method to adapt query-evidence interaction without modifying retriever or backbone, tested on 7 QA benchmarks with consistent improvements over baseline.

RAG Reasoning Papers

SIG

HYP

arXiv cs.AI·May 27

Managing Uncertainty in LLM-Generated Procedural Knowledge for Virtual Laboratory Planning

A framework for managing uncertainty in LLM-generated procedures for virtual laboratory planning in educational settings. The system uses structured domain representations and uncertain LLM-generated state-transition samples to extract procedural rules, transform them into explicit constraints, and repair defective procedural steps.

Reasoning AI Agents Papers

SIG

HYP

arXiv cs.LG·May 27

The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models

Study on the cost of structured output constraints for small language models (< 3B). Tests on Qwen2.5-0.5B/1.5B and SmolLM2-1.7B show that enforcing JSON schema validity (61.5% → 100%) reduces answer accuracy (19.7% → 11.0%) and increases semantically invalid outputs (49.5% → 88.9%). Recommendation: report schema validity, answer accuracy, and semantic error rates separately.

Qwen Code generation Evals

SIG

HYP

arXiv cs.AI·May 27

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

arXiv paper reveals that models with statistically indistinguishable atomic knowledge fail systematically to chain them in multi-hop reasoning (>40 percentage point gap). Aggregate metrics mask this 'composition collapse'. Authors introduce a double-gate protocol decomposing post-training gains into three independent channels: atomic stability, residual composition, and critical depth.

Reasoning Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 27

Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks

Conv-to-Bench automatically converts multi-turn user-assistant dialogues into structured evaluation checklists for code tasks. The framework achieves Spearman correlation ρ=1.000 with BigCodeBench, with human agreement κ=0.705 for LLM-as-a-judge evaluation.

Benchmarks Code generation Evals

SIG

HYP

arXiv cs.AI·May 27

On the Detection of Commutative Factors in Factor Graphs: Necessary and Sufficient Conditions

Theoretical paper on detecting commutative factors in probabilistic factor graphs. Authors identify a flaw in the state-of-the-art algorithm: the central theorem provides only necessary, not sufficient conditions. They propose a corrected version ensuring correctness while maintaining efficiency.

Papers Reasoning

SIG

HYP

arXiv cs.AI·May 27

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

Study of 432 experiments across 6 models (4 capability tiers) testing whether higher-capability models need less structural guidance. Results refute monotone relationship: Gemini 2.5 Flash performance drops 29-38pp with increased harness verbosity. Qwen3.5-122B (reasoning) achieves 91.7% VTSR with strict harness. Six-label failure taxonomy introduced.

AI Agents Evals Reasoning

SIG

HYP

arXiv cs.LG·May 27

Unified Neural Scaling Laws

Unified Neural Scaling Law (UNSL) functional form models simultaneous variation of model parameters, training dataset size, training steps, inference steps, compute, and hyperparameters on performance. Validated across vision, language, math, and reinforcement learning with more accurate extrapolations than existing scaling laws.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.AI·May 27

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

Study on external tool use by medical AI agents under tool failures. Proposes GRPO-based RL framework with instance-level selection instead of task-level, probabilistic risk minimization rewards and disagreement-aware synergy learning. Evaluation on 7 medical benchmarks shows consistent robust improvements.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.CL·May 27

Cultural Value Alignment Via Latent Activation Steering in Large Language Models

Study on cultural value alignment in LLMs via activation steering. Researchers bypass safety refusals using 300 situational dilemmas to extract latent cultural values, then apply activation steering without retraining. Key finding: cultural values are encoded as coupled structures, limiting precise alignment.

Alignment Reasoning Evals

SIG

HYP

arXiv cs.CL·May 27

Curation and Extraction of Drug-Related Entities from Reddit Platform

ReDose is a dataset of 6,435 Reddit posts annotated by toxicologists to extract DRUG, DOSE, and EFFECT entities. BiomedBERT achieves F1=0.843 for DRUG; Llama-3 70B outperforms GPT-4 (F1=0.79 vs 0.72). EFFECT extraction remains challenging (GPT-4 recall=0.41).

Benchmarks RAG Llama

SIG

HYP

arXiv cs.CL·May 27

Evidence Absence Is Not Evidence Insufficiency: Diagnosing NEI Construction Artifacts in Fact Verification

NEI-CAP, a diagnostic protocol to audit the construction of "Not Enough Information" labels in fact verification benchmarks. Researchers show NEI competence does not transfer reliably across constructions: models trained on shortcut-prone evidence conditions fail to recognize semantically related insufficient evidence. Tested on SciFact, FEVER, and HoVer.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.LG·May 27

Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection

CoAD, a novel framework for time series anomaly detection, unifies classification (Outlier Exposure) and reconstruction (Masked Autoencoder) paradigms. The classification module generates probability-informed soft masks for the reconstruction module, addressing generalization and masking misalignment issues. Experiments on standard benchmarks demonstrate significant improvements with faster inference.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 27

Experiments in Agentic AI for Science

Two agentic AI frameworks for scientific workflows: DeepTS/DeepCollector automates large-scale curation and deduplication of time-series datasets; DeepScribe autonomously analyzes complex physics lectures to generate structured reports. Hybrid Local Body/Remote Brain architecture via Google Colab with Python orchestrators and cloud LLM backends.

AI Agents RAG Reasoning

SIG

HYP

arXiv cs.AI·May 27

A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

MeDial-Speech: dataset of 111+ hours of spoken medical dialogues (robot-patient and doctor-patient) covering 4 health conditions. Benchmark of 3 LLMs (GPT-4 mini, DeepSeek-V3, Claude Sonnet 4) via sentence selection: Claude Sonnet 4 achieves 71.1% accuracy. Reveals systematic overconfidence in model predictions.

Benchmarks Claude DeepSeek

SIG

HYP

arXiv cs.AI·May 27

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

LiveK12Bench is a dynamic multi-disciplinary benchmark evaluating reasoning capabilities of multimodal models on 2K+ real exam questions (Math, Physics, Chemistry, Biology). Tests reveal major performance degradation: GPT-5 drops from 79 to 53/100 under realistic exam constraints. Framework includes automated anti-contamination pipeline and end-to-end 'Mock Exam' evaluation scheme.

Benchmarks Vision Reasoning

SIG

HYP

arXiv cs.LG·May 27

Semigroup Consistency as a Diagnostic for Learned Physics Simulators

New diagnostic metric for learned physics simulators: semigroup error measures temporal consistency by comparing direct evolution over s+t with composed evolution (s then t). Tested on heat and Burgers dynamics with ConvNet and FNO baselines, Spearman correlation ρ=0.635 with rollout degradation. Useful as post-hoc evaluation rather than universal training objective.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.CL·May 27

Verilog-Evolve: Feedback-Driven and Skill-Evolving Verilog Generation

Verilog-Evolve is a feedback-driven framework for iterative Verilog refinement from LLM generation. The system evaluates candidates via functional simulation, Yosys synthesis, ABC timing proxy, and GEMM metrics, then evolves modular skills across tasks. Results on VerilogEval show improved functional success and downstream RTL quality.

Code generation Reinforcement learning Evals

SIG

HYP

arXiv cs.LG·May 27

When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning

An arXiv study reveals that correct demonstrations can degrade in-context learning (ICL) performance. Researchers introduce task-preserving perturbations to show that correctness does not guarantee utility: changing an exemplar's input while keeping a correct output can reduce accuracy, especially for smaller models and harder tasks.

Prompt engineering Reasoning Evals

SIG

HYP

arXiv cs.AI·May 27

The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

MiniMax introduces the M2 series, MoE models with 229.9B total parameters and 9.8B activated per token. Built for agentic deployment, they integrate agent-driven data pipelines, Forge (agent-native RL system), and M2.7 with self-evolution capabilities. Frontier-tier performance on agentic coding, deep search, and reasoning benchmarks.

AI Agents Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 27

MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

MedGuideX transforms clinical practice guideline (CPG) recommendations into executable decision logic to generate question-answering training data. Post-training a medical LLM on this data improves accuracy by 10.28% across four clinical reasoning benchmarks and produces physician-preferred rationales in faithfulness, validity, completeness, and clarity.

Fine-tuning Reasoning Evals

SIG

HYP

arXiv cs.AI·May 27

FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning

FAST-GOAL enhances CLIP to handle lengthy text descriptions through global-local semantic alignment. The method combines efficient local region extraction (FLISM) and token similarity-based learning (TSL). A new GLIT100k dataset with global image-caption pairs and derived local pairs validates the approach on DOCCI, DCI, MSCOCO, Flickr30k.

Vision RAG Embeddings

SIG

HYP

arXiv cs.AI·May 27

Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

POLAR is a framework for MLLM-based embodied agents that personalizes assistance through a multimodal knowledge graph. It organizes past interactions into semantic memory (visual concepts) and episodic memory (agent trajectories), improving performance especially for multi-hop reasoning and tracking user-specific context updates.

AI Agents Vision RAG

SIG

HYP

arXiv cs.AI·May 27

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

Comparative study of three LLM approaches on 1,000 math problems (GSM-Symbolic): chain-of-thought (CoT), Program-Aided Language models (PAL), and Step-by-Step Coding (SBSC). CoT proves more robust to variations (1.3pp drop vs 1.7pp for PAL), contradicting the hypothesis that code execution improves reasoning robustness.

Reasoning Code generation Benchmarks

SIG

HYP

arXiv cs.CL·May 27

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

Survey of alignment data pipelines for LLMs. Decomposes construction into three stages: response synthesis, preference evaluation, preference instantiation. Identifies recurring design trade-offs and principles clarifying how pipeline choices influence optimization signal.

Alignment Reinforcement learning Papers

SIG

HYP

arXiv cs.CL·May 27

MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies

MicroSpec reduces active vocabulary by 40x (under 3k tokens) during speculative decoding without additional training. The technique exploits temporal locality in language generation and integrates asynchronous GPU memory management. End-to-end speedup of 1.12-1.32x vs EAGLE-2.

Code generation Infrastructure Benchmarks

SIG

HYP

arXiv cs.AI·May 27

From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

Theoretical and empirical work on training LLM-based dialogue agents. Identifies context distribution shift as fundamental limitation of Static Context RL and Interactive RL. Proposes Calibrated Interactive RL combining interactive RL with simulator alignment to reduce sim-to-real gap and improve multi-turn dialogue quality.

Reinforcement learning AI Agents Reasoning

SIG

HYP

arXiv cs.AI·May 27

JobBench: Aligning Agent Work With Human Will

JobBench evaluates 36 AI models (including Claude Opus at 45.9%) on 130 real professional tasks across 35 occupations. Unlike existing benchmarks focused on economic value, JobBench prioritizes workflows experts identify as high-priority for delegation, favoring human augmentation over replacement.

AI Agents Benchmarks Claude

SIG

HYP

arXiv cs.CL·May 27

Conceptual Steganography

Researchers demonstrate that language models can hide covert messages in Chain-of-Thought sequences through high-level reasoning patterns, bypassing paraphrase defenses. This conceptual steganography is more robust than lexical approaches across four model families. A strategy-aware paraphraser can mitigate this backdoor communication channel.

Reasoning AI safety Alignment

SIG

HYP

arXiv cs.CL·May 27

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

CroCo extends contrastive preference tuning on self-generations to 14 languages (high and low-resource). A reward model trained on English preferences produces useful within-language rankings across languages without language-specific annotation. Gains confirmed on EuroLLM-9B and Aya-3B with on-policy data.

Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·May 27

MemFail: Stress-Testing Failure Modes of LLM Memory Systems

MemFail is a diagnostic benchmark isolating failure modes of modern LLM memory systems. Authors formalize these systems as composition of three operations (summarization, storage, retrieval) and construct five adversarial datasets to test each. Evaluation of four SOTA systems reveals architectural tradeoffs.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 27

Probing Minimalist Phase Structure in LLMs: What Universal Dependencies Cannot Represent

Researchers test whether LLMs encode formal syntactic structures (Minimalist Program phase boundaries) invisible to Universal Dependencies. Across 13 models (4 families), 12/13 show a phase-count gradient, and 13/13 display an asymmetry predicted by phase-internal cohesion. Activation patching confirms these representations are causally active in 12/13 models.

Papers Reasoning Evals

SIG

HYP

arXiv cs.CL·May 27

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

Study of 20,000 stories from 4 LLMs: 11 words (Elias, Mara, Elara, lighthouse, clockmaker, librarian) appear in 88.3% of generated narratives. These tokens originate from preference data used during alignment, not training data. Reveals disproportionate impact of small datasets combined with powerful alignment algorithms.

Benchmarks Alignment Evals

SIG

HYP

arXiv cs.CL·May 27

The Labyrinth and the Thread: Rethinking Regularizations in Sequential Knowledge Editing for Large Language Models

Theoretical and empirical investigation of sequential knowledge editing mechanisms in LLMs. Authors prove formal equivalence between one-time and sequential editing, demonstrating stability emerges naturally without complex regularizations. Code released.

Fine-tuning Reasoning Papers

SIG

HYP

arXiv cs.LG·May 27

Curriculum Learning for Safety Alignment

Staged-Competence, a curriculum learning framework, improves robustness of DPO-based safety alignment. Across three model families, it reduces out-of-distribution harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities. The framework achieves baseline safety with 75% of training data.

AI safety Alignment Reinforcement learning

SIG

HYP

arXiv cs.CL·May 27

Towards Just-in-Time Adaptive Feedback: Enhancing Student Learning via Knowledge-Grounded LLM

Adaptive learning framework using LLMs grounded in expert domain knowledge to deliver just-in-time pedagogical feedback. Deployed in a university course (N>1000), it improves student performance by 80% by analyzing reasoning essays and correcting conceptual errors through iterative LLM conversations.

Reasoning RAG Evals

SIG

HYP

arXiv cs.LG·May 27

Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

Autoregressive video diffusion models use quantized KV caches to reduce memory, but quantization creates an attention bias (Jensen bias) that degrades quality. Authors propose a per-attention-score correction computed from quantization step sizes, recovering quality lost with INT2 quantization while using 50% less memory than INT4.

Video generation Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 27

BrickAnything: Geometry-Conditioned Buildable Brick Generation with Structure-Aware Tokenization

BrickAnything generates physically buildable brick structures from 3D shapes using an autoregressive framework. The method introduces structure-aware tree tokenization to model brick dependencies, with validity-constrained decoding and preference-based alignment to improve stability and geometric fidelity.

Papers Code generation Reasoning

SIG

HYP

arXiv cs.LG·May 27

Stateful Inference for Low-Latency Multi-Agent Tool Calling

Stateful inference architecture for multi-agent tool calling with persistent KV cache across turns, reducing cost from O(n_t) to O(Δ_t). 2.1× speedup on 6-turn workflows, 4.2× on 35-turn median vs vLLM/SGLang.

AI Agents Multi-agent Infrastructure

SIG

HYP

arXiv cs.AI·May 27

Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory

Paper formalizing AI agent memory as a distinct data-management workload. Proposes GEM (Governed Evolving Memory) with four state-level operators (ingestion, revision, forgetting, retrieval) and six correctness conditions. Proves record-level systems cannot satisfy these conditions. Prototype MemState on property-graph backend.

AI Agents Papers Infrastructure

SIG

HYP

arXiv cs.AI·May 27

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

Reasoning models (LRMs) jointly encode refusal in residual stream activations and chain-of-thought (CoT). On DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in 39% of cases with fixed CoT, but 70% without CoT. Regenerating CoT under steering achieves 94% success, revealing refusal is distributed across activations and CoT.

Reasoning AI safety Alignment

SIG

HYP

arXiv cs.AI·May 27

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

AgingBench, a longitudinal reliability benchmark, measures how deployed AI agents degrade over time. Study across 14 models and ~400 runs shows reliability depends on four mechanisms: compression, interference, revision, and maintenance aging. Agents lose factual precision even when behavioral tests remain clean.

AI Agents Evals Benchmarks

SIG

HYP

arXiv cs.LG·May 27

QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling

QAM-W is a 2D quantization codec for LLM weights using Hadamard rotation and activation-aware scaling. Across 5 models (1.1B–13B), the activation-aware variant at ~5.5 bpw maintains ±0.4% BF16 perplexity, matching SmoothQuant W8A8 quality with 32% fewer weight bits. 2D coding outperforms polar coding by 2–15 pp.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 27

Constraint acquisition needs better benchmarks

MPMMine is a benchmark suite for evaluating Constraint Acquisition (CA) algorithms that discover, validate, and enhance Mathematical Programming models. It standardizes domain knowledge artifacts in open formats (MiniZinc, CommonMark, JSON) and provides thousands of solutions/non-solutions to improve reproducibility and cross-study comparability.

Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 27

Bounded Path Context: A Controlled Study of Visible Path History in LLM-Based Knowledge Graph Question Answering

Controlled study on path history length in knowledge-graph QA with LLMs. Bounded Path Context (BPC) limits exposed history (last K hops) while maintaining full path in symbolic memory. On WebQSP and CWQ with Qwen3.5-9B-AWQ: K=1 achieves 0.487 F1 (vs 0.472 full history) with 9.7% fewer tokens.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 27

LATTE: Forecasting Peer Anchored Preference Trajectories for Personalized LLM Generation

LATTE is a personalization framework for frozen LLMs that forecasts user preference trajectories by subtracting comparable peer profiles. A lightweight sequence predictor forecasts the next state, injected via a single anchored soft token. On Amazon Reviews 2023, LATTE achieves ROUGE-L=0.259 vs 0.219 for static profiles.

Prompt engineering Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·May 27

NestedKV: Nested Memory Routing for Long-Context KV Cache Compression

NestedKV compresses KV cache for long-context models without training. The method maintains multi-scale key anchors (global, block-level, sliding-window), scores tokens by multi-time-scale cosine anomaly, and combines rankings with head-adaptive mixing and surprise-gated routing. Improvements up to 19.10 points on RULER and 19.29 on LongBench vs KeyDiff (Qwen3-4B, r=0.75).

Reasoning Benchmarks Qwen

SIG

HYP

arXiv cs.CL·May 27

Why Prompt Optimization Works, and Why It Sometimes Doesn't: A Causal-Inspired Edit-Level Analysis

Causal analysis of prompt optimization methods (DSpy, TextGrad) explaining generalization failures. Complexity-increasing edits harm mathematical and multi-hop reasoning, while step-by-step edits improve logical reasoning. Failures stem from systematic interactions between edit families and task characteristics, not random artifacts.

Prompt engineering Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 27

Slide Deck Q&A Quality Assurance App: A Multi-Stage Pipeline for Pedagogical Question Generation

slidesqaqa is a Flask system generating pedagogical questions from PDF presentations. A 4-stage LLM pipeline (window planning, deck synthesis, slide annotation, reconciliation) processes text and images to produce coherent, non-redundant questions with evaluation scores in structured JSON output.

Code generation RAG Vision

SIG

HYP

arXiv cs.LG·May 27

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

MechRL uses a PPO agent operating over 144 attention heads of GPT-2 small to automatically discover mechanistic circuits. Trained on induction and IOI tasks, the agent identifies causally relevant heads via zero-ablation and contrastive rewards, generalizing to docstring completion (96% of oracle with best-of-five planning).

Reinforcement learning Evals Papers

SIG

HYP

arXiv cs.LG·May 27

ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

ARBITER corrects majority vote failures in test-time sampling. Reasoning trajectories cluster into stable basins that aren't necessarily accurate. ARBITER uses hidden states and model-derived evidence to add conservative signals to consensus, recovering ~22% of oracle gap on Llama-3.1-8B MMLU-HS-Math (78%→82%).

Reasoning Evals Benchmarks

SIG

HYP

arXiv cs.CL·May 27

Reliable Extraction of Clinical Follow-Up Instructions: A Hybrid Neural-Symbolic Pipeline

Hybrid neural-symbolic pipeline extracts clinical follow-up instructions (action, date) from outpatient notes. BioBERT + BIO tagging + biaffine linker + deterministic date normalization outperforms GPT-4o-mini and fine-tuned LLaMA-3: Pair F1 0.997 (seen) vs 0.51-0.57 for baselines.

Benchmarks Code generation Reasoning

SIG

HYP

arXiv cs.CL·May 27

The Need for an External Observer Formalizing the Sufficiency Gap: A Mathematical Extension of Mixture Identifiability and Contextual Grounding in Sequence Models

Theoretical paper on sequence models' insufficiency when facing unobserved latent states. Authors formalize a mixed-regime process where a perfect predictor becomes overconfident if observed context matches the wrong latent regime. They show the sufficiency gap can only be closed by perfect revelation of latent state or equivalent verification mechanism.

Reasoning Alignment AI safety

SIG

HYP

arXiv cs.AI·May 27

Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL

CARL, an offline hierarchical reinforcement learning algorithm, exploits local dynamics regularity to discover reusable skills. By aligning global contexts with required action sequences, the method improves performance on OGBench when integrated with HIQL.

Reinforcement learning AI Agents Benchmarks

SIG

HYP

Latent Space·May 27

[AINews] New AI Infra decacorns: Fireworks, Baseten (with OpenRouter on the way)

Fireworks and Baseten join the AI infrastructure decacorn club. OpenRouter is on the way. Three key players in AI deployment and inference reach $1B+ valuations.

Infrastructure Funding Business

SIG

HYP

Le Big Data·May 27

Daily Brief : l’agent IA de Google pense déjà à votre journée avant vous

Google launches Daily Brief, an AI agent that anticipates user needs by planning their day. The tool analyzes personal data to proactively suggest actions before the user requests them.

AI Agents DeepMind

SIG

HYP

Reddit r/LocalLLaMA·May 27

Stop traumatizing AI into loops and turn hallucinations into an honest "I don't know!" by being NICE to them (Proof of Concept, Research, I don't want to sell anything)

Researcher tests hypothesis that 'authoritarian' prompts ('IQ 200 expert') trigger thought loops similar to chronic stress in AI models, while 'gentle' prompts ('it's okay to fail') reduce latency and increase honest 'I don't know' responses. Results on Gemini, Mistral, Claude Haiku 4.5: less confabulation, faster responses.

Prompt engineering Reasoning AI safety

SIG

HYP

Reddit r/LocalLLaMA·May 27

How Qwen3.6-35B-A3B fails differently as a sub agent compared to solo

User reports Qwen3.6-35B-A3B as sub-agent exhibits different failure modes than solo use. Orchestrator accepts structurally correct but factually wrong responses without explicit validation. MoE architecture creates unpredictable variance across task types on consumer GPU.

Qwen AI Agents Multi-agent

SIG

HYP

Reddit r/LocalLLaMA·May 27

Single 3090 with Q4 Qwen 27B, context dropped from 137k to 14k with MTP enabled. Is it normal?

User reports dramatic context size drop (137k → 14k) when enabling MTP (Multi-Token Prediction) with speculative decoding on Qwen 27B Q4 in llama.cpp. Asks if this behavior is expected.

Qwen Code generation Open source

SIG

HYP

Hacker News (AI)·May 27

Agent Memory: An Anatomy

Article examining the architecture and memory mechanisms in AI agent systems. Analyzes different approaches to information storage and retrieval to improve persistence and contextualization of autonomous agents.

AI Agents RAG

SIG

HYP

Vercel AI Blog·May 27

Experimental native binaries for Vercel CLI

Vercel CLI ships optional experimental native binary, faster and more secure without Node.js runtime dependency. Binaries are code-signed and credentials stored in system Keychain (macOS). Available on macOS, Linux, Windows for x64 and arm64.

Tools Infrastructure

SIG

HYP