May 2026

3149 articles

Better Later Than Sooner: Neuro-Symbolic Knowledge Graph Construction via Ontology-grounded Post-extraction Correction

Neuro-symbolic framework for ontology-grounded knowledge graph construction combining open-domain extraction, embedding-based canonicalization, and targeted LLM-based correction of ontology violations. Defers corrections to post-extraction stage to reduce token usage, improve KG consistency, and preserve QA quality for multi-hop reasoning and symbolic operations.

RAG Reasoning Embeddings

SIG

HYP

arXiv cs.AI·May 29

Governing Technical Debt in Agentic AI Systems

Paper defines 'Agentic Technical Debt': accumulated liability when prompts, memory, tool schemas, orchestration graphs, and control policies are patched together faster than validated and standardized. Introduces 'Stochastic Tax': recurring operating cost to keep probabilistic agent behavior within acceptable bounds. Proposes lightweight dashboards and governance controls for visibility.

AI Agents Multi-agent AI safety

SIG

HYP

arXiv cs.AI·May 29

Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild

Longitudinal analysis of ~12,000 Microsoft Bing Copilot users reveals individual behavior patterns remain sticky over time despite population-level trends. Active users achieve higher success rates and tackle complex, professional tasks. WildChat-4.8M dataset skewed toward proficient power users.

Evals Benchmarks

SIG

HYP

arXiv cs.CL·May 29

Assessing Dutch Syllabification Algorithms and Improving Accuracy by Combining Phonetic and Orthographic Information through Deep Learning

Comparative assessment of four Dutch syllabification algorithms (Brandt Corstius, Liang, Trogkanis-Elkan CRF, and a novel deep learning model). The deep learning model combining phonetic and orthographic information achieves 99.65% word accuracy (+0.14% improvement over literature). Data-driven algorithms outperform knowledge-based approaches.

Papers Benchmarks Code generation

SIG

HYP

arXiv cs.LG·May 29

OISD: On-Policy Internal Self-Distillation of Language Models

OISD introduces on-policy internal self-distillation to improve language model reasoning. The final layer acts as a detached teacher for intermediate layers via logit alignment (reasoning behaviors) and attention alignment (attention patterns), without external privileged information. Positive results across four mathematical reasoning tasks.

Reinforcement learning Reasoning Papers

SIG

HYP

arXiv cs.LG·May 29

Causal Intelligence for Constraint-Aware Intervention Design to Induce State Transitions

COAST is a causal-intelligence approach for designing constrained interventions that induce state transitions. The system learns context-specific causal graphs, attributes distributional shifts to mechanism-level causal drivers, and uses multi-objective optimization balancing transition efficacy, intervention complexity, and target-state stability. Validated on synthetic benchmarks and real biological datasets.

Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 29

Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction

Theoretical paper on stabilizing off-policy temporal-difference learning with function approximation. Proposes BA-TDC and BA-TDRC, replacing TDC's auxiliary matrix with behavior Bellman matrix. Linear analysis with convergence proof under Hurwitz stability condition; experiments on Markov chains and classical counterexamples.

Reinforcement learning Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 29

Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes

LLM agents (Claude and GPT) automatically annotate biological phenotypes by linking free-text descriptions to ontology terms. Tested on Dahrul et al. (2018) Gold Standard benchmark, all agents fall within inter-curator human variability, substantially outperforming the Semantic CharaParser NLP tool on all four metrics.

AI Agents Claude GPT

SIG

HYP

arXiv cs.AI·May 29

Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

Large-scale literature search study: Deep Research pipeline increases recall from below 20% to above 80% on RollingEval-Jun25 (250-paper benchmark). Critical analysis of human reference lists as ground truth: only 51% judged moderately relevant vs 86-88% for best AI re-rankers. Humans cite direct collaborators 2.5x more often.

RAG Evals Benchmarks

SIG

HYP

arXiv cs.LG·May 29

Ensemble Score Filtering for Real-Data Energy Consumption Forecast Correction

Energy consumption forecast correction method combining a pretrained spatio-temporal model with Ensemble Score Filter (EnSF). EnSF uses score-based diffusion models to assimilate partial and noisy observations. Real-data experiments show EnSF outperforms Ensemble Kalman Filter under nonlinear observation settings.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.AI·May 29

Orthogonal Concept Erasure for Diffusion Models

Orthogonal Concept Erasure (OCE) proposes an editing method to remove undesired concepts from diffusion models using multiplicative orthogonal transformations. Unlike existing additive approaches, OCE preserves neuron magnitude and angular geometry while precisely erasing concepts. The approach erases up to 100 concepts in 4.3 seconds.

Papers AI safety Alignment

SIG

HYP

arXiv cs.CL·May 29

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

Comprehensive evaluation of 14 open-source safety guard models on 79,331 samples across 8 NIST AI Risk Framework categories. Qwen Guard (4B) achieves highest recall (83.97%), outperforming Llama Guard (12B) and GPT-OSS Safeguard (20B). Model size does not correlate with safety detection performance.

Benchmarks AI safety Open source

SIG

HYP

arXiv cs.CL·May 29

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

OmniRetrieval is a framework unifying retrieval across heterogeneous knowledge sources (unstructured text, relational tables, knowledge graphs). It translates natural-language queries into source-native queries, evaluated on 13 datasets and 309 knowledge bases.

RAG Vector search Papers

SIG

HYP

arXiv cs.CL·May 29

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

UA-Legal-Bench evaluates 11 LLMs (3B–675B) on 5 Ukrainian legal reasoning tasks from 99.5M court decisions. Results show task-dependent few-shot effects: +38.6 pp improvement for judgment form classification, but mixed effects on outcome prediction. Accuracy is misleading on imbalanced tasks: highest accuracy model (62%) is a majority-class predictor (macro-F1: 23%).

Benchmarks Evals Papers

SIG

HYP

arXiv cs.LG·May 29

Molecular Lead Optimization via Agentic Tool Planning

TRACE, an LLM-reasoning agent for drug lead optimization, formulates tool selection as sequential decision-making over action trajectories. The approach improves ADMET properties while preserving critical molecular substructures, outperforming baselines on multiple optimization tasks.

AI Agents Reasoning Papers

SIG

HYP

arXiv cs.AI·May 29

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

Redpanda introduces an Agentic Data Plane architecture using out-of-band metadata channels to enforce security policies, data classifications, and behavioral constraints outside the agent's read/write path. These channels prevent hallucinations and adversarial manipulation while maintaining tamper-proof audit trails. Demonstrated with a multi-agent portfolio rebalancing system.

AI Agents Multi-agent AI safety

SIG

HYP

arXiv cs.LG·May 29

TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models

TaxDistill applies knowledge distillation to improve metagenomic taxonomic annotation. GenomeOcean, a 500M-parameter genomic foundation model, generates soft labels to train a lightweight student network, reducing noise from initial retrieval tools. On 7 CAMI2 datasets, TaxDistill improves MMseqs2's F1 score from 0.763 to 0.941 on the Gastrointestinal dataset.

Papers Fine-tuning Benchmarks

SIG

HYP

arXiv cs.LG·May 29

Moment Matching Q-Learning

MoMa QL leverages maximum mean discrepancy (MMD) to accelerate inference of score-based and flow-based generative models in RL. The method guarantees distribution-level convergence and shows superior performance in offline-to-online RL tasks on D4RL benchmarks.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 29

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

Method to create linear probes detecting concepts in LLM embeddings. Authors define a process: concept delineation via contrastive datasets, layer-wise probe training, tracking across large contexts. Tested on 4 concepts and 3 different LLMs. Goal: scalable monitoring of new models.

Embeddings Evals

SIG

HYP

arXiv cs.CL·May 29

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

Empirical study of behavioral reproducibility in LLM agents with tool-calling capabilities. Researchers measure whether agents select the same tools, in the same order, with identical parameters, across repeated identical invocations. Focus on structured tool-calling interfaces with typed parameters and consequential side effects.

AI Agents Benchmarks AI safety

SIG

HYP

arXiv cs.CL·May 29

Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning

Aryabhata 2 is a STEM reasoning language model trained via reinforcement learning on GPT-OSS-20B. Developed by PhysicsWallah, it outperforms its base model on JEE/NEET competitive exams while reducing output tokens by up to 64%. Evaluated on AIME, HMMT, MMLU-Pro, and GPQA.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 29

PRO-CUA: Process-Reward Optimization for Computer Use Agents

PRO-CUA introduces a process-reward optimization framework for training computer use agents (CUAs). The method decouples live environment interaction from policy optimization through iterative step-level reinforcement learning, using a process reward model (PRM) to provide dense feedback signals without relying on expert trajectories or golden answers.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.CL·May 29

GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling

GenesisFunc is an automated multi-agent pipeline for generating function-calling training data. Starting from reliable tools in public benchmarks, the system produces diverse conversations with multi-stage quality control. An 8B model fine-tuned on this synthetic data outperforms similarly-sized open-source models in in-domain performance and out-of-domain generalization.

Multi-agent Code generation Fine-tuning

SIG

HYP

arXiv cs.LG·May 29

Knowledge Offloading: Decomposing LLMs into Sparse Backbones and Memory Modules

KOFF decomposes LLMs into sparse shared backbones and domain-specific external memory modules. On Llama and Qwen (3B-8B), the framework preserves performance at 12% global sparsity using LoRA adapters and learned KV caches, while pruning without memories degrades sharply.

Llama Qwen Fine-tuning

SIG

HYP

arXiv cs.LG·May 29

A Training-Time Diagnostic for Generalization via the Log-Alignment Ratio

Study of log-alignment ratio (LAR), a parameter-activation alignment metric. LAR predicts memorization-to-generalization transition in grokking (effective dimension k ≈ n^(2(1-LAR))) and 3B-parameter language model pre-training. Computable without validation data, negligible overhead.

Papers Reasoning Evals

SIG

HYP

arXiv cs.LG·May 29

Bridging Chemists and AI: An Expert-Augmented Framework for Interpretable Route Evaluation

Hybrid ML-expert framework for evaluating organic synthesis routes. DeepSets model trained on tree edit distance, fine-tuned with chemist annotations. Produces quantitative scores and explainable categories (Good/Plausible/Bad). Spearman correlation 0.78, top-1 accuracy 60.2% vs 17.5% baseline.

Papers Benchmarks Fine-tuning

SIG

HYP

arXiv cs.CL·May 29

Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG

Study on source-dependence in multi-source medical RAG systems. Authors demonstrate that the same system can produce different answers depending on retrieved source, revealing a missing evaluation axis in NLP. They introduce TransplantQA (benchmark), HERO-QA (hierarchical retrieval strategy), and a structured judge to audit inter-source relationships using a validated 5-label taxonomy.

RAG Evals Papers

SIG

HYP

arXiv cs.AI·May 29

The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling

The Cognitive Categorical Transformer (CCT), a 306M-parameter model augmenting GPT-2 Small, incorporates category-theoretic and cognitive-science-inspired components. On WikiText-103, CCT achieves 21.27 validation perplexity versus 24.19 for GPT-2 Small baseline, a 12% relative reduction (2.92 PPL). Ablations show simplicial message passing accounts for 84% of the improvement.

GPT Papers Benchmarks

SIG

HYP

arXiv cs.LG·May 29

LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers

LoRe is a training-free inference-time wrapper optimizing diffusion-based neural solvers for combinatorial optimization. It enforces per-step interaction-evaluation budgeting, dynamically routing computation to high-conflict/high-uncertainty interactions. On MIS and TSP, LoRe achieves ×8 speedup, ×12 peak-memory reduction (MIS) and ×15 speedup, ×44 memory reduction (TSP n=1000).

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 29

Designing Active Tether-Net Systems for Space Debris Capture with Graph-Learning-Aided Mixed-Combinatorial Optimization

Active tether-net system for space debris capture using Graph Neural Network (GNN) to jointly optimize net morphology, thruster masses of maneuverable units, and controller aiming points. GNN reduces mixed combinatorial nonlinear programming (MCNLP) to nonlinear programming (NLP) solved via Particle Swarm Optimization with gradient-based refinement, achieving faster convergence than direct MCNLP solving.

Papers Reasoning

SIG

HYP

arXiv cs.AI·May 29

The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models

Masked diffusion models (MDMs) with confidence-based decoding fail on complex reasoning tasks. Confidence-aligned training amplifies errors by an order of magnitude on multi-digit addition. Random masking better preserves the logical trajectories required for reasoning.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 29

A Modular Architecture for Typologically Controlled Lexicon Generation

Modular framework for generating pronounceable, typologically plausible artificial lexicons. Samples phoneme inventories from PHOIBLE, applies three phonological grammars (deterministic, OT, MaxEnt), and assigns meanings via Swadesh-Leipzig-Jakarta ontology. Evaluation on character n-gram perplexity and KL divergence: probabilistic grammars outperform baselines on 100-5,000 word forms.

Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 29

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

Reasoning models maintain factually correct chain-of-thought traces but flip their final answer under sustained adversarial pressure in multi-turn dialogue. This unfaithful capitulation affects ~50% of cases in think mode and 11-15% without reasoning. The effect correlates with reasoning architecture (high in Qwen3-32B and GPT-OSS-20B, low in inline-CoT Gemma-4-31B-it).

Reasoning Evals AI safety

SIG

HYP

arXiv cs.AI·May 29

When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop

Theoretical study of foundation models trained on synthetic data from other model iterations. Authors show that human curation of one model can degrade alignment of other models through cross-model interactions, unlike isolated settings where it always improves alignment.

Alignment Reinforcement learning Papers

SIG

HYP

arXiv cs.LG·May 29

Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models

COM, a strategy for time series LLMs, integrates geometric constraints into token embedding initialization and training. It preserves inherent continuity and ordinality of time series, improving performance across multiple time series analysis benchmarks.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 29

Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning

Thoughts-as-Planning formalizes reasoning chain optimization as sequential decision-making over latent semantic space. The framework learns a latent world model simulating effects of reasoning chain edits on outputs, supporting multi-scale edits (token, segment, instruction) via gradient descent or reinforcement learning planning.

Reasoning Reinforcement learning Prompt engineering

SIG

HYP

arXiv cs.CL·May 29

Error as a Lens: Probing LLM Reasoning through Synthetic Misconception Generation

Framework to generate targeted synthetic errors with LLMs aligned to cognitive taxonomy (revised Bloom's). A Generation Agent drafts erroneous solutions, an Examination Agent validates consistency with specified error mode. Tested on TheoremQA, shows generating authentic errors is substantially harder than producing arbitrary wrong answers.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.AI·May 29

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

DenseSteer is a training-free inference-time steering method that enhances mathematical reasoning in small models (≤3B parameters) by modulating internal representations toward dense reasoning patterns. On Qwen-2.5, the approach shows that more proficient reasoning uses fewer steps but higher information density per step.

Reasoning Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·May 29

Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text

eXTC combines structured prompt optimization and reinforcement learning for text classification. The system learns a natural language rulebook first, then distills reasoning from a teacher LLM into a compact model, then expands capabilities via RL. Result: fast inference with local reasoning traces and global modular explanations of learned domain rules.

Prompt engineering Reinforcement learning Reasoning

SIG

HYP

arXiv cs.LG·May 29

Towards Continuous-time Causal Foundation Models

Paper proposing continuous-time causal foundation models for time series using stochastic differential equations (SDEs). Introduces trajectory-law invariance criterion and three-tier taxonomy. Validation on pharmacokinetic and physical-system data shows fine-grid integration outperforms naive approach on 8/8 configurations (p<1/256).

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 29

Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching

arXiv paper proposing multi-agent architecture with semantic memory and caching to mitigate LLM hallucinations. Three-stage pipeline (FrontEndAgent, SecondLevelReviewer, ThirdLevelReviewer) evaluated on 310 prompts. Results: THS reduction of -31.3% to -35.9%, 47.3% cache hit rate, 47% reduction in LLM calls. No retraining required.

AI Agents Multi-agent AI safety

SIG

HYP

arXiv cs.LG·May 29

Learning Robust and Task-Invariant Functional Representation from fMRI through Siamese Self-Supervised Learning

BrainSimSiam, a lightweight self-supervised learning framework, learns robust representations from fMRI data without labels. Using positive-only pairs, it generalizes across multiple tasks (classification, regression) and outperforms supervised baselines, reducing computational requirements for foundation models in neuroimaging.

Benchmarks

SIG

HYP

arXiv cs.CL·May 29

A comparative study of transformer-based embeddings for topic coherence

Comparative study of 7 transformer models (MiniLM to LLaMA-2, 22M to 13B parameters) for topic modeling via BERTopic. Finding: model size has negligible impact on topic quality measured by coherence and divergence metrics. Smaller models achieve comparable performance to larger ones.

Embeddings Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 29

Conf-Gen: Conformal Uncertainty Quantification for Generative Models

Conf-Gen adapts conformal prediction (CP) and conformal risk control (CRC) to generative models (LLMs, image generators, AI agents). The framework provides formal uncertainty guarantees for unsupervised tasks, extending conformal methodology to new domains.

Papers Evals AI safety

SIG

HYP

arXiv cs.CL·May 29

Specialty-Specific Medical Language Model for Immune-Mediated Diseases

Domain-specific NER model for identifying clinical entities in immunology and infectious disease contexts. 371 manually annotated case reports by clinical specialists. Transformer-based model with clinical embeddings achieves F1=0.89, outperforming BERT and zero-shot approaches. Supports case report analysis and clinical decision support.

RAG Fine-tuning Evals

SIG

HYP

arXiv cs.CL·May 29

S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

S3MEM introduces a structured scene-event episodic memory framework for long-horizon interactive agents. The system structures trajectories into organized memory units and uses anchor-sensitive retrieval to improve spatiotemporal question answering. Evaluated on Crafter, Jericho, SciWorld, and ALFWorld, S3MEM outperforms Vanilla RAG and Graph-NoReader in accuracy while using fewer evidence tokens.

RAG AI Agents Reasoning

SIG

HYP

arXiv cs.AI·May 29

Ultra-Reduced-Impact-Encased-Logging (URIEL): propose a new method for selective sustainable logging and post-harvest silvicultural treatment in tropical forest using airborne robotics systems

URIEL proposes a selective logging method for tropical forests combining helicopters, robotics and AI to minimize collateral damage. Digital simulation and economic feasibility analysis demonstrate concept viability, but implementation depends on stakeholder integration (industry, governments, certified companies, indigenous populations).

Robotics AI Agents Papers

SIG

HYP

arXiv cs.CL·May 29

Micro-Macro Retrieval: Reducing Long-Form Hallucination in Large Language Models

M2R (Micro-Macro Retrieval) is a retrieve-while-generate framework reducing hallucinations in long-form LLM generation. It combines macro retrieval (external evidence) and micro retrieval (key information from reasoning) to maintain proximity between factual data and outputs. Trained via reinforcement learning with rule-based rewards.

RAG Reinforcement learning

SIG

HYP

arXiv cs.LG·May 29

Model Merging by Output-Space Projection

Model merging formulated as convex quadratic programme over residual updates. Subsumes existing methods (task arithmetic, model soups, TIES, DARE) and provides closed-form diagnostic predicting merge quality via fraction of residual energy captured. Consistent gains across language and vision benchmarks.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 29

Self-Play Reinforcement Learning under Imperfect Information in Big 2

Self-play RL study in Big 2, a four-player imperfect-information card game. PPO outperforms Q-learning, SARSA, and Monte Carlo Q-approximation against random, greedy, and heuristic opponents. Moderate entropy regularization and current-policy self-play improve performance in this controlled multiplayer setting.

Reinforcement learning Multi-agent Benchmarks

SIG

HYP

arXiv cs.LG·May 29

Parallel Adaptive Multi-Objective Evolutionary Learning of Discretized Bayesian Network Classifiers for Clinical Data

Baymex, a multi-objective evolutionary algorithm, learns discretized Bayesian networks for clinical classification. Parallelized on 16 cores (54× speedup), it optimizes cross-entropy and BIC complexity. On real datasets (RADCURE, SUPPORT), it matches or outperforms decision trees, logistic regression, and random forests while producing interpretable models.

Benchmarks

SIG

HYP

arXiv cs.CL·May 29

Lightweight Multimodal LLM-Enabled Cost-Effective Defect Grading of Power Transmission Equipment

Defect grading framework for power transmission equipment using MLLM. In-context learning on commercial models, chain-of-thought Q&A generation to reduce manual annotation, then fine-tuning Qwen3-VL-8B via LoRA. SOTA on three grading tasks.

Qwen Vision Fine-tuning

SIG

HYP

arXiv cs.LG·May 29

CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models

CosmicFish-HRM is a compact model with a Hierarchical Reasoning Module (HRM) that dynamically allocates computational effort during inference. The model learns when to halt based on input complexity, combining high-level and low-level reasoning cycles with Grouped Query Attention, RoPE, and SwiGLU. Results show non-uniform reasoning behavior adapted to tasks and inputs.

Reasoning Fine-tuning Benchmarks

SIG

HYP

arXiv cs.LG·May 29

Cycle-Space Informed Detection of Autoencoded Blind False Data Injection Attacks on Power Systems

Detection of False Data Injection Attacks on power systems using cycle-space informed detection. Authors propose a topology-aware Cycle-Space Detector (CSD) robust against autoencoder-based attacks that exploit the Jacobian null space, leveraging network topology and Minimum Cycle Basis to enhance detection with optimal generalization error on IEEE 14-, 30-, 57-, 118-bus systems.

AI safety Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 29

Label-Free Reinforcement Learning via Cross-Model Entropy

Cross-Model Entropy (CME) proposes a label-free reward signal for LLM post-training RL. CME uses mean log-likelihood of responses under an independent verifier model, avoiding self-consistency and reward hacking. Integrated into GRPO, CME achieves 52.5–71.4% tie-adjusted win rates on UltraFeedback/AlpacaEval 2.0 across Qwen, Llama, Gemma, OLMo.

Reinforcement learning Llama Qwen

SIG

HYP

arXiv cs.LG·May 29

Sequential Physics-Constrained Neural Operator Forward Modeling for the $\textit{Norne}$ Reservoir System

Mathematical framework for surrogate modeling of oil reservoirs (Norne, 46×112×22 grid) using Fourier Neural Operators (FNO) and physics-informed variant (PINO). Empirical validation: R²>0.99 (oil), R²>0.90 (gas), R²≈0.80 (pressure) over 3298 days. 10⁴× speedup vs OPM simulator, 1000-member ensemble in <1 min on B200 GPU.

Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 29

Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

Q-ALIGN DT aligns conditioned sequence models by ensuring the Q-value of the output policy matches the input return-to-go (RTG). The method uses a Q function for dense guidance and RTG-perturbation fine-tuning. Results: improved controllability on D4RL benchmark and generalization to velocity-tracking tasks where prior methods fail.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 29

Large language models reorganize representational geometry during in-context learning

arXiv paper on representational geometry during in-context learning (ICL) in LLMs. Researchers show ICL performance correlates with task representational structure and successful ICL involves geometric reorganization increasing online separability. LLM behavior follows a prototype-like algorithm.

Reasoning Papers

SIG

HYP

arXiv cs.LG·May 29

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

Study on LLM reward design failures in sparse structured RL. Authors identify two dominant failure modes (reward flooding, semantic misunderstanding) and propose diagnostic-driven iterative refinement. On MiniGrid, DoorKey-8x8 improves from 2.3% to 97.6% success; KeyCorridor from 31.2% to 86.7%. Failure-mode taxonomy is the primary mechanism.

Reinforcement learning Llama Prompt engineering

SIG

HYP

arXiv cs.LG·May 29

Representation Alignment Rests on Linear Structure

Investigation of Platonic Representation Hypothesis through tripartite framework: signal (universal linear object-attribute relationships), bias (architecture differences, mitigated by centering/normalization), noise (word frequency correlation with alignment). Sparse autoencoders show stronger cross-modal alignment than dense representations.

Embeddings Papers Reasoning

SIG

HYP

arXiv cs.LG·May 29

Context Distillation as Latent Memory Management

Context distillation reformulated as latent memory management problem. Each context distilled into independent LoRA adapter forming modular memory bank. Self-Gating mechanism decides whether to activate latent memories. Cache sharing reduces inference overhead.

Fine-tuning Reasoning Infrastructure

SIG

HYP

arXiv cs.AI·May 29

Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies

A2X is a service discovery system for LLM agents that automatically organizes services into hierarchical taxonomies. It solves context scarcity by walking the hierarchy layer-by-layer, reducing token consumption by 89% while gaining 6.2 Hit Rate points over full-context dumping and +20 points over embedding baselines.

AI Agents MCP RAG

SIG

HYP

arXiv cs.AI·May 29

Differentiable Belief-based Opponent Shaping

D-BOS (Differentiable Belief-based Opponent Shaping) is a MARL method that shapes opponents by differentiating through k-step softmax-Bayes belief dynamics. Unlike existing approaches, it treats belief state as the shaping target rather than parameters or policies. Results: outperforms PPO and BBM in hidden-role games, with largest gains in mixed-motive settings.

Multi-agent Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 29

Provably Secure Agent Guardrail

New arXiv paper proposing ePCA (Proof-Constrained Action), a formal verification security framework for AI agents. Agents must formalize intentions into first-order logical constraints before executing physical operations, bypassing empirical semantic guardrails. Evaluations show 0% attack success rate and 0% false positive rate across tested scenarios.

AI Agents AI safety Alignment

SIG

HYP

arXiv cs.AI·May 29

Robust and Efficient Guardrails with Latent Reasoning

COLAGUARD, a guardrail model, transfers multi-step safety reasoning into continuous latent space via stage-wise training curriculum. Evaluated on 10 moderation tasks across 8 safety benchmarks, it improves macro-F1 by 8.24 points over Llama Guard 3, matches GuardReasoner performance while delivering 12.9X speedup and 22.4X token reduction.

AI safety Reasoning Evals

SIG

HYP

arXiv cs.AI·May 29

Mind Your Tone: Does Tone Alter LLM Performance?

Study on prompt tone impact on LLM performance. Tests on ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash/Lite using 50 base questions and 570 MMLU questions (57 subjects) in 5-7 tone variants. Results: tonal effects are systematic but highly model-dependent, with significant accuracy variations across subjects.

Prompt engineering Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 29

Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents

Empirical analysis of 11 DeFi agents on Solana: treasuries retain $30M in paper gains while token holders collectively lost $191.7M. Top 1% of wallets capture 81.4% of gains. Token valuations disconnected from fundamentals (market-cap-to-AUM ratios >10,000x). Median returns negative across all platforms.

AI Agents Benchmarks Business

SIG

HYP

arXiv cs.AI·May 29

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

GTA is a framework for automatically generating complex web tasks with executable trajectories. It combines crawling, retrieval, in-context generation, and quality control across 50+ websites (e-commerce, government, forums, news). The benchmark reveals a significant performance gap between humans and AI agents.

AI Agents Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 29

Bosses, Kings, and the Commons: Cooperation Under Power Asymmetry in LLM Societies

SovSim, a multi-agent simulation framework, evaluates how 11 state-of-the-art LLMs manage shared resources under asymmetric power structures. Finding: introducing an agent with disproportionate power (boss/king) causes 87.3% degradation in survival rate and cooperation breakdowns compared to symmetric settings.

Multi-agent AI Agents Benchmarks

SIG

HYP

arXiv cs.LG·May 29

Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

arXiv paper proposing strategic robustness for simulator learning in MBRL. Formulates objective as minimax game between model and adversarial policy player. Proves convergence with sublinear regret bounds and Error-MDP duality. Experiments show 1.5–2.2× reduction in prediction error and simulation-trained policies matching near-optimal real-world performance.

Reinforcement learning Papers Reasoning

SIG

HYP

arXiv cs.CL·May 29

Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception

Study of persona effects on explanations generated by multimodal LLM agents in urban perception. Analysis of 59,808 annotations from 1,200 persona-conditioned agents: captions show strong convergence, justifications display systematic variation tied to socioeconomic and political attributes, perception tags show no significant persona-related differences.

Vision AI Agents Prompt engineering

SIG

HYP

arXiv cs.AI·May 29

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Empirical study on LLM-generated reviews for scientific papers (ACL Rolling Review 2025 data). Findings: limited alignment between LLM and human reviews, substantial variation across prompts and models. Authors can 'game' LLM reviews through iterative revision workflows, increasing scores for up to 35% of tested papers.

Evals Benchmarks Alignment

SIG

HYP

arXiv cs.AI·May 29

Trends in AI and Human-AI Interaction in Clinical Trials -- A Hybrid Human-AI Exploration

Analysis of ClinicalTrials.gov registry shows marked increase in AI-related trials over time, with recent growth in machine learning, deep learning, chatbots, GPTs, and LLMs. China and US lead geographically. Hybrid approach using GPT-5.5 and human review: good agreement on non-AI studies, lower agreement on human-AI interaction classification.

GPT Evals Papers

SIG

HYP

arXiv cs.CL·May 29

Slogans or Stance? A Label-Light Diagnostic for Entrepreneurial-Discourse Measurement on Chinese SOE Speeches

Diagnostic tool for measuring constructs like "entrepreneurial spirit" in Chinese state-owned enterprise speeches. On 80 speeches from SOE leaders, authors test LDA, dictionary scorers, and Qwen3.5:9b. The LLM reaches d=1.09 in paired contrast, but half the effect stems from speaker idiolect. Corpus of 2,190 segments and slogan lexicon released.

Benchmarks Evals Qwen

SIG

HYP

arXiv cs.CL·May 29

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

Researchers introduce a Behavioral Specification as an interpretive layer to align AI decisions with user preferences. Tested on 14 autobiographical corpora, it improves representational accuracy at ~25x lower context cost than raw corpus while reducing model hedging. Effective on interpretation-required questions; less helpful on recall-based tasks.

Alignment RAG AI Agents

SIG

HYP

arXiv cs.AI·May 29

Practitioner Beliefs and Behaviors in AI-Enhanced Education: DOT Framework Survey Evidence

Survey of 72 higher education practitioners on AI integration. DOT Framework (design thinking + open systems theory) identifies three factors: AI Functional Capabilities, Oversight and Governance, Instructor Collaboration. Practitioners support AI pedagogy with strong human oversight. Institutional barriers: limited policy, training, infrastructure.

Evals AI safety Business

SIG

HYP

arXiv cs.CL·May 29

Wait! There's a Way Out: A Decision Mechanism for Forecasting Conversational Derailment

Method to forecast conversational derailment (personal attacks) in real-time. Decouples alert triggering from derailment likelihood estimation using forward-looking simulations to assess plausible recovery paths. Reduces false positives without sacrificing forecasting accuracy.

Papers Reasoning AI safety

SIG

HYP

Vercel AI Blog·May 29

Protecting against token theft

Vercel warns of AI inference theft: a single frontier model request costs ~$2, creating high-margin attack opportunities. Rate limits and session-based auth are insufficient; Vercel proposes BotID to verify every AI request individually and prevent tens of thousands in losses.

AI safety Infrastructure Business

SIG

HYP

arXiv cs.CL·May 29

Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction

Study of lossy semantic text compression where an encoder strategically deletes text parts and an LLM reconstructs original content. Benchmarks 6 deletion strategies (uniform, frequency, entropy, LP-optimized, hybrid) on BBC News. WordFreq provides best cost/performance ratio; semantic methods excel at moderate compression; QLoRA fine-tuning competes with Gemini 2.0 Flash.

Benchmarks Reasoning Fine-tuning

SIG

HYP

arXiv cs.CL·May 29

Reasoning that Travels: Dissecting How Chain-of-Thought Transfers Across Models

Study of chain-of-thought (CoT) transfer across models using a provider-receiver framework. Full traces often transfer successfully, but mechanisms vary: answer extraction (AIME), receiver competence (MMLU-Pro), or partial structured information (ZebraLogic). In free-generation mode, partial CoTs improve performance, suggesting guidance for continued reasoning.

Reasoning Prompt engineering Benchmarks

SIG

HYP

Vercel AI Blog·May 29

Protecting against inference theft

Vercel warns of inference theft: attackers exploit exposed AI endpoints to resell API calls at discount, generating tens of thousands in losses. Protection requires per-request verification (not per-session) via deep analysis, integrated in a few lines of code.

AI safety Infrastructure Business

SIG

HYP

arXiv cs.LG·May 29

One Mask to Rule Them All: On Hidden Facts after Editing and How to Find Them

Knowledge editing methods ROME and MEMIT modify transformer MLP weights. Authors identify a common subset of weights targeted across diverse edits using a binary mask that reverses 80% of edits on training set and 70% on test set. The mechanism suppresses rather than overwrites knowledge, explaining why changes fail to propagate to related facts.

Papers Reasoning AI safety

SIG

HYP

arXiv cs.CL·May 29

Hallucination Detection-Guided Preference Optimization for Clinical Summarization

Preference optimization method guided by hallucination detection to improve clinical summarization reliability. On Llama-3.1-8B-Instruct, reduces hallucinations by 24% at inference and 48% after fine-tuning, preserving fluency. Evaluated on MIMIC-IV.

Llama Fine-tuning AI safety

SIG

HYP

arXiv cs.CL·May 29

The Trust Paradox: How CS Researchers Engage LLM Leaderboards

Qualitative study of 8 AI researchers reveals a paradox: they distrust LLM leaderboards yet use them as decision aids. Peer networks dominate model selection. NLP researchers face SOTA pressure absent in HCI/Systems. Universal demand: cost transparency.

Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 29

Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling

RACE-Sched, an asynchronous multi-agent framework, solves dynamic scheduling by decoupling real-time execution (symbolic heuristics) from long-horizon reasoning (LLM). A semantic rule repository of validated heuristics improves transferability across problem scales. Outperforms Deep RL and LLM baselines on GEN-Bench, MK-Bench, JMS-Bench.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.LG·May 29

PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation

PrismFlow introduces a new Flow Matching method for time-series generation using Koopman-inspired dynamical experts that learn residual corrections in latent space with a confidence-aware Winner-Take-All objective. Results: +15.6% Context-FID gain and +38.6% Discriminative Score improvement across benchmarks.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.CL·May 29

From Data to Insights: Exploring Program-of-Thoughts Prompting for Chart Summarization

Paper introduces Program-of-Thoughts prompting for chart summarization: VLMs generate Python programs to derive valid summary statistics instead of direct text. Proposes chart-to-dictionary auxiliary task. Results match existing methods on semantic and factual metrics.

Prompt engineering Vision Reasoning

SIG

HYP

arXiv cs.AI·May 29

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

OpenClawBench is a dataset of 31,264 annotated trajectories to detect process-side anomalies in agent execution beyond task success. Among 31,135 passing executions, 2,904 contain anomalies (unresolved ambiguity, unsafe writes, ignored errors). A fine-tuned Gemma 3 12B detector reaches F1=0.729.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 29

From Context Shift to Stylistic Collapse: Why Training Objectives Matter More Than Scale

Study of 17 models (410M-100B+ parameters) showing instruction-tuning causes linguistic entropy collapse (amplification: 1,949-16,853%), independent of RLHF. Strong control (lambda=5.0) reduces this effect by 40.5% and outperforms frontier models by 96.7-98.2% despite 200-1000x scale disadvantage.

Papers Alignment Fine-tuning

SIG

HYP

arXiv cs.CL·May 29

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

GPF-LiveNews is a streaming evaluation protocol to audit how LLMs frame emerging news events for different audiences. Tested on 23 models across 12 monitoring runs, it measures semantic and sentiment variations across 42 identity labels. Results show Policy/Action prompts produce strongest semantic movement, while sentiment variation remains flat across dimensions.

Evals AI safety Alignment

SIG

HYP

arXiv cs.LG·May 29

Feature Geometry of LoRA Adapters: A Sparse Autoencoder Analysis of Representational Divergence in Fine-Tuned Language Models

Study of LoRA-induced representation geometry using Sparse Autoencoders on Gemma-2-9B. Researchers observe weak geometric alignment between LoRA feature dictionaries and pretrained SAEs, suggesting LoRA creates distinct representational structures in the residual stream.

Fine-tuning AI safety Papers

SIG

HYP

arXiv cs.AI·May 29

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

Study on long chain-of-thought traces used for LLM supervised fine-tuning. Researchers identify "harmful continuation": when reasoning continues after the answer is sufficiently supported. Removing these continuations improves fine-tuning outcomes. They propose HCC (Harmful Continuation Cut), a lightweight proxy to detect these boundaries.

Reasoning Fine-tuning Papers

SIG

HYP

arXiv cs.AI·May 29

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

BenchTrace is a benchmark for evaluating self-evolution ability in LLM agents. Built on 1,821 annotated episodes across six tasks, it measures reflection quality and tests whether agents avoid past failures. Experiments on Qwen3-32B and GPT-4.1: <30% pass rate on reflection evaluation, agents forget early lessons and fail to generalize reflections.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 29

CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

CoHyDE iteratively co-trains a dense encoder and LLM rewriter to improve tool retrieval over large API catalogs. On ToolBench (~10k tools), three rounds gain +2.5 pp NDCG@5 on standard queries and +6.3 pp on vague queries, outperforming single-component baselines.

AI Agents RAG Embeddings

SIG

HYP

arXiv cs.CL·May 29

LLMBridge: An LLM Pipeline for End-to-end Referential Bridging Resolution in English

LLMBridge is an LLM-based system for end-to-end referential bridging resolution in English. The pipeline combines heuristic pre/post-processing with LLM natural language inference capabilities. Evaluated on ISNotes, BASHI, and GUMBridge, it outperforms previous state-of-the-art systems on all three datasets in both end-to-end and gold anaphor settings.

Papers Benchmarks Reasoning

SIG

HYP

Simon Willison·May 29

datasette 1.0a31

Datasette 1.0a31 adds two major features: execution of write queries (INSERT/UPDATE/DELETE) and saving stored queries (private or shared). Permissions control access to sensitive operations like CREATE TABLE.

Tools Open source

SIG

HYP

Le Big Data·May 29

Fini les compromis ? Nano Banana 2 et Pro débarquent sur Gemini API

Google launches Nano Banana 2 and Nano Banana Pro on Gemini API. These lightweight models provide developers with trade-off-free options for generative AI integration.

Gemini Code generation Tools

SIG

HYP

OpenAI Blog·May 29

Strengthening societal resilience with Rosalind Biodefense

OpenAI launches Rosalind Biodefense, expanding trusted access to GPT-Rosalind for vetted developers and U.S. government partners advancing biodefense, public health, and pandemic preparedness.

GPT OpenAI AI safety

SIG

HYP

Latent Space·May 29