May 2026

3149 articles

Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs

arXiv paper analyzing latency-reliability-cost tradeoffs in LLM-enabled multi-agent workflows. Introduces performance models for LLM and non-LLM agents, proposes water-filling token allocation policy, and characterizes optimal workflow reliability via shadow prices under latency and cost constraints.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.CL·May 26

Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap

Found in Conversation (FiC) is a training framework where LLMs self-teach to close the multi-turn gap (Lost-in-Conversation). Via View-Asymmetric Self-Distillation, the model distills between single-turn (teacher) and multi-turn (student) views. Tested on Llama, Qwen, Phi, OLMo (3B-14B), FiC recovers 92-100% of single-turn performance.

Llama Qwen Fine-tuning

SIG

HYP

arXiv cs.CL·May 26

Structure-Aware RAG: Structured Retrieval Augmented Generation from Noisy Data for Conversational Agents

SA-RAG uses structured tables as intermediate representation to improve RAG for conversational agents. A quality-aware metadata generation framework enhances table quality from noisy data. Generation validation and direct preference optimization outperform RAG baselines on two real-world datasets.

RAG AI Agents Papers

SIG

HYP

arXiv cs.AI·May 26

In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models

Researchers replicate Picbreeder (interactive image evolution platform) by replacing human users with Vision Language Models (VLMs). Results show qualitative differences from human baseline. Study of causal factors: exploratory noise, behavioral diversity between agents, memory of past actions.

Vision AI Agents Open source

SIG

HYP

arXiv cs.CL·May 26

How Much Structure Do LLMs Need? Evaluating LLMs for Bibliometric Cluster Description

Study evaluating 6 LLM pipelines for generating bibliometric cluster descriptions. On 100 published analyses, LLMs produce semantically close descriptions to human versions but hallucinate references and fail to infer bibliometric structure alone. Optimal performance in hybrid workflow: algorithms define clusters, LLMs generate readable descriptions.

Benchmarks Evals RAG

SIG

HYP

arXiv cs.AI·May 26

Fuzzy, Neutrosophic, and Uncertain Graph Theory: Properties and Applications

Comprehensive survey of graph theory under uncertainty, covering fuzzy, neutrosophic, and related models. Addresses fundamental structures, graph classes, uncertain digraphs, hypergraphs, dynamic graphs. Applications include uncertain molecular graphs, decision-making systems, graph neural networks, knowledge graphs, cognitive maps.

Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 26

Federated Learning over Human-Body Communication for On-Body Edge Intelligence: A Survey, Taxonomy, and BODYFED-HBC Scheduling Vignette

Survey paper at the intersection of human-body communication (HBC) and federated learning for wearable sensor networks. Proposes taxonomy of FL deployments (intra-body, body-hub, cross-user, clinical-cloud) and introduces BODYFED-HBC reference architecture with scheduling algorithm and reproducible simulation combining public datasets with empirical HBC signal-loss models.

Benchmarks

SIG

HYP

arXiv cs.AI·May 26

LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition

LC-ERD is a self-alignment framework for LLMs that mines latent logical structures via consistency-regulated reward decomposition. Addresses three challenges: label noise from mimetic bias, coarse-grained supervision, and distributional collapse. Uses Variational Logic Potential and multi-agent value decomposition based on IGM principle.

Reasoning Reinforcement learning Alignment

SIG

HYP

arXiv cs.CL·May 26

DRInQ: Evaluating Conversational Implicature with Controlled Context Variation

DRInQ is a benchmark evaluating LLM pragmatic reasoning on conversational implicature. Researchers reveal a generation-inference asymmetry: models generate plausible pragmatic scenarios but fail to recover intended implications at inference time. Structured prompting improves alignment for smaller models.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·May 26

SEAL: Synergistic Co-Evolution of Agents and Learning Environments

SEAL is a closed-loop co-evolution framework for tool-use LLM agents. It collects verifiable trajectories, diagnoses turn-level failures, and uses these signals to jointly adapt the learning environment and agent policy. With 400 training samples, SEAL achieves +8.25 to +26.25 point gains across three backbones and shows positive out-of-distribution transfer.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 26

Right-Sizing Communication and Recommendation Set Size in AI-Assisted Search

Theoretical paper on optimizing user-AI recommendation system interaction. Models communication cost (precision of user message) and search cost (size of recommendation set). For large d, characterizes how optimal message precision and recommendation set size depend on cost parameters under two sampling schemes: posterior belief and optimized tilted distribution.

AI Agents RAG

SIG

HYP

arXiv cs.CL·May 26

Toxicity in Twitch Chats: An LLM-Based Analysis Across Gaming Communities

Analysis of 20 million Twitch chat messages (4,452 streams, 7 genres) using LLM zero-shot classification. 2.4% of messages classified as toxic per Twitch taxonomy (harassment, discrimination, sexual content, profanity). F1=94.5% on TextDetox. MOBA games show 3.2% toxicity, sports games 2%. Significant within-genre variation reveals game-specific community norms.

AI safety Evals Benchmarks

SIG

HYP

arXiv cs.CL·May 26

CUNY at CLPsych 2026: A Pipeline Approach to Classification and Summarization of Mental Health Changes

CUNY submits a pipeline for CLPsych 2026 shared task: classifying mental health states via ensemble of three open-weight LLMs with majority voting, predicting timeline changes with supervised classifiers, and summarizing mood dynamics through augmented in-context learning. Rankings: 1st (Task 1.1), 4th (1.2, 2), 3rd (3.1).

Benchmarks Reasoning Open source

SIG

HYP

arXiv cs.CL·May 26

Decompose-and-Refine: Structured Legal Question Answering with Parametric Retrieval

DaR (Decompose-and-Refine) is a framework for answering complex legal questions by decomposing them into atomic sub-questions and generating statute-aligned parametric queries. Evaluated on KoBLEX (Korean multi-hop benchmark) using Qwen3-32B and Gemma3-27B, DaR improves retrieval accuracy and answer quality while reducing hallucinations.

Reasoning RAG Qwen

SIG

HYP

arXiv cs.CL·May 26

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

Researchers apply Direct Preference Optimization (DPO) to improve English-Mandarin code-switching transcription in Audio LLMs. Three failure modes identified: language omission, translation-instead-of-transcription, hallucination. Training on 100K pairs (570 hours) reduces MER up to 89.6% (in-distribution) and 20.0% (out-of-distribution).

Reinforcement learning Alignment Voice

SIG

HYP

arXiv cs.CL·May 26

Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning

GuardedRepair is a guarded best-of-N repair framework for LLM mathematical reasoning that selectively fixes incorrect traces while preserving correct answers. On GSM8K (95.60% → 96.89%), it fixes 17 of 58 errors with no measured broken-correct cases. On weak-reasoner ASDiv, accuracy improves from 78.40% to 87.60%.

Reasoning Evals AI safety

SIG

HYP

arXiv cs.CL·May 26

Raon-Speech Technical Report

Raon-Speech is a 9B multilingual speech language model (English/Korean) that understands and generates speech while preserving text capabilities. Trained on 1.38M hours of curated data, it outperforms 8 comparable audio models (Qwen2.5-Omni, Fun-Audio-Chat) across 42 benchmarks. Raon-SpeechChat extends it with real-time full-duplex conversation trained on 119K hours of dialogue.

Voice Benchmarks Open source

SIG

HYP

arXiv cs.CL·May 26

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

EchoDistill introduces an alignment-based noisy-to-clean self-distillation framework to improve Audio LLM robustness against real-world noise. A noisy student is optimized via GRPO using a frozen clean-audio teacher as semantic reference. Results: +4.18% GSR improvement under strong noise vs strongest baseline, +3.02% Acc on Qwen-Omni.

Reinforcement learning Fine-tuning

SIG

HYP

arXiv cs.CL·May 26

QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks

QUEST is a family of open-source models (2B to 35B) trained as deep research agents via data synthesis pipeline and RL. With only 8K synthetic tasks, QUEST matches or exceeds proprietary systems across 8 research benchmarks, excels at citation grounding and report synthesis. Models, data, and training scripts released.

AI Agents Reinforcement learning Open source

SIG

HYP

arXiv cs.CL·May 26

An Interactive Paradigm for Deep Research

SteER is a framework for interactive deep research using LLMs. It introduces interpretable control points allowing users to correct course mid-process via cost-benefit formulation. Results: +22.80% alignment improvement vs baselines, preferred by human readers in 85%+ of pairwise judgments.

AI Agents Reasoning RAG

SIG

HYP

arXiv cs.CL·May 26

SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning

SLAP is a batch-aware data selection framework for instruction tuning that evaluates learnability at batch composition level rather than individual samples. Using stratified sampling and relative distance optimization with Hessian-approximated gradients, it matches full dataset performance with 20-40% less training data across LLaMA, ChatGLM, and diverse tasks (dialogue, translation, QA).

Fine-tuning Llama Benchmarks

SIG

HYP

arXiv cs.CL·May 26

AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue

AERIC is a lightweight safety monitor (387 parameters) detecting implicit harmful dialogue by analyzing hidden states during decoding without additional forward passes. On DiaSafety and Harmful Advice, it improves AUROC from 0.683→0.714 and 0.822→0.858. Deployment adds only 2.34% latency versus 79.40% for Qwen3Guard-Stream-4B.

AI safety Alignment Reasoning

SIG

HYP

arXiv cs.CL·May 26

Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches

Systematic review of 139 studies on information fusion for document classification. Meta-analysis shows multimodal fusion improves accuracy by +5.28 percentage points (p=0.0016) and multiview fusion by +4.67% accuracy. Critical finding: only 11.8% of multimodal and 23.3% of multiview studies use statistical validation, undermining reproducibility.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.AI·May 26

Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism

PAT, an adaptive tensor parallelism method, optimizes the generation stage in synchronous RLHF. It dynamically reconfigures parallelization during decoding to compensate for response-length skew. Implemented on SGLang/VeRL, PAT reduces generation latency by up to 34.6% on LLaMA3.1-8B and Qwen3-14B.

Reinforcement learning Infrastructure Benchmarks

SIG

HYP

arXiv cs.CL·May 26

Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers

Novel sparse attention approach using grammatical roles (POS tags) to reduce quadratic complexity of Transformers. Two masking strategies tested on SST-2 with DistilBERT: hard mask (0.8200) and soft mask (0.8165) maintain full attention performance (0.8200) while reducing computational overhead.

Reasoning Evals Papers

SIG

HYP

arXiv cs.LG·May 26

PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection

PromptAudit evaluates how prompting strategies affect LLM-based vulnerability detection. Across 5 open-weight models and 1,000 CVEs (6,074 samples), standard chain-of-thought achieves strongest performance, while few-shot provides model-dependent gains. Adaptive chain-of-thought suppresses recall; self-consistency induces excessive abstention.

Prompt engineering Evals AI safety

SIG

HYP

arXiv cs.LG·May 26

Cascade-KDE: Robust Time-Series Restoration under Out-of-Distribution Impulse Corruptions

Cascade-KDE is a training-free restoration method for time series corrupted by Gaussian noise and impulse outliers. It estimates temporal-amplitude density, applies Density-Truncated Robust Expectation to limit anomaly influence, then refines via exponential cascade. Tested on ECG and battery degradation, it preserves derivative peaks better than classical filters.

Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 26

AstroMind: A High-Fidelity Benchmark for Spacecraft Behavior Reasoning Based on Large Language Models

AstroMind is a benchmark for evaluating LLM reasoning on spacecraft behavior. Built on high-fidelity astrodynamics simulations, it tests intent inference, maneuver parameter estimation, and threat assessment. Qwen3 (32B) leads intent inference, QwQ (32B) leads threat assessment, GPT-OSS (20B) produces strongest reasoning quality.

Benchmarks Reasoning Qwen

SIG

HYP

arXiv cs.LG·May 26

Parameter Efficient Multi-Class Intelligent Scheduling for Multimodal Online Distributed Industrial Anomaly Detection

MODIAD framework for multimodal industrial anomaly detection in distributed online settings. Introduces SMG algorithm for multi-class scheduling and REC-LoRA strategy reducing computational overhead. Validated on MVTec 3D-AD and Eyecandies datasets.

Benchmarks Fine-tuning Vision

SIG

HYP

arXiv cs.AI·May 26

MAPLE: Multi-State Aggregated Policy Evaluation for AlphaZero in Imperfect-Information Games

MAPLE, a tree search method, extends AlphaZero to imperfect-information games by aggregating policy and value evaluations from multiple sampled world states. Tested on Phantom Go and Dark Hex, MAPLE outperforms PIMC-AlphaZero baseline with Elo gains of 291 and 136.

Reasoning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·May 26

Neuro-Inspired Inverse Learning for Planning and Control

Neuro-inspired framework for embodied planning and control. The Inverter uses Inverse Learning (IL) to generate multi-step action sequences. Outperforms offline-RL and diffusion-planner baselines on D4RL (+24.2% average) with 100-1000x less inference compute. Application: single-qubit quantum gate synthesis matching GRAPE fidelity at 1000x faster per-gate compute.

Reasoning Reinforcement learning Robotics

SIG

HYP

arXiv cs.LG·May 26

CAFD: Concept-Aware DNN Fault Detection using VLMs

CAFD is a DNN fault detection method combining model signals, distance features, and a novel Concept Failure Ratio (CFR) leveraging Vision-Language Models. Evaluated on ImageNet and three models, CAFD outperforms 5 baselines with average 18.3% improvement in Fault Detection Rate.

Vision Evals Benchmarks

SIG

HYP

arXiv cs.LG·May 26

Truthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing

arXiv paper proposing an online aggregation mechanism to align LLMs with human feedback in mobile crowdsourcing. The system incentivizes truthful preference reporting from strategic workers via a dynamic Bayesian game, reducing regret from O(T) to O(√T) over T time slots.

Fine-tuning Reinforcement learning Papers

SIG

HYP

arXiv cs.LG·May 26

Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

Method to identify attention-head circuits in pretrained transformers using spectral signal (time-integrated participation ratio), task-pattern filtering, and group ablation against matched-random control. Validated across 51M to 7B parameters, two architectures, four pretraining pipelines. Finding: 2-6 head induction circuit causally necessary in all models tested (94-100% drop after ablation).

Papers Reasoning Evals

SIG

HYP

arXiv cs.AI·May 26

Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security

Comprehensive survey on trustworthy agentic AI systems (LLMs augmented with planning, tool use, memory). Examines safety, robustness, privacy, and system security. Proposes unified metrics, benchmarks, and stage-targeted mitigation strategies across agent workflows. Identifies open challenges: self-evolving agents, runtime verification, privacy-preserving personalization.

AI Agents AI safety Benchmarks

SIG

HYP

arXiv cs.CL·May 26

Word Class Representations Spontaneously Emerge from Successor Representations Trained on Natural Language

Researchers train neural networks on WikiText-103 (103M tokens) using Successor Representations from RL to predict future word distributions. Without explicit linguistic supervision, grammatical categories (nouns, verbs, adjectives) spontaneously emerge and become separable via unsupervised clustering, organized by predictive horizon.

Papers Reasoning Embeddings

SIG

HYP

arXiv cs.LG·May 26

From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression

Theoretical study of scaling laws for sketched linear regression with mini-batches. Comparative analysis of one-pass SGD, multi-pass SGD with and without replacement. Key result: variance O(min(M,(T_eff*γ)^(1/a))/(B*T_eff)), 1/B reduction in multi-pass without-replacement regime, zero fluctuation at B=N.

Papers Benchmarks Reinforcement learning

SIG

HYP

arXiv cs.CL·May 26

Improving the Completeness and Comparability of Segment Disclosures: A Large Language Model Approach

An LLM-based framework extracts segment disclosures from Form 10-K filings to improve completeness and comparability of financial data. The system uses RAG to integrate information across multiple periods and firms, demonstrating effectiveness for longitudinal analysis and cross-firm geographic alignment.

RAG Benchmarks

SIG

HYP

arXiv cs.LG·May 26

Filtered Posterior Mean Collections: A Unified Framework for Analytical Models of Diffusion Generalization

Unified framework (FPMC) modeling denoising functions in diffusion models. Consolidates existing approaches through query precision vectors, response weights, and source distributions. Improves performance via soft relaxations and distribution augmentations.

Image generation Papers Benchmarks

SIG

HYP

arXiv cs.LG·May 26

Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

Agent-ToM is a learning-to-monitor framework using Theory-of-Mind reasoning to detect covert malicious behavior in autonomous LLM agents. It infers agent beliefs, intent hypotheses, and behavioral deviations from task-consistent baselines. Evaluated on SHADE-Arena and CUA-SHADE-Arena benchmarks, it outperforms ensemble monitoring baselines with a two-call reasoning pipeline.

AI Agents AI safety Reasoning

SIG

HYP

arXiv cs.LG·May 26

Knowledge Graph Modulated Deep Learning for Limited-Sample Clinical Data Analysis

Graph-in-Graph (GiG) integrates biological knowledge graphs into deep learning for clinical analysis with limited data. Tested on ~9,700 patients across 5 tasks (cancer detection, prostate diagnosis, pan-cancer classification), GiG outperforms existing methods with gains up to 49 macro-F1 points in limited-sample settings.

Papers Benchmarks RAG

SIG

HYP

arXiv cs.LG·May 26

Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions

New arXiv paper proposing DINOSaur, a training-free method for continual anomaly detection in industrial settings. Combines frozen DINOv3 backbone, spatially-indexed coreset memory, and neighborhood-restricted anomaly scoring. Achieves zero forgetting, outperforms all baselines across 5 protocols, runs <100ms inference on Jetson Orin Nano with on-device adaptation <30s.

Benchmarks Vision

SIG

HYP

arXiv cs.LG·May 26

Mixture of Complementary Agents for Robust LLM Ensemble

Study on optimal model selection in LLM ensembles. Authors reframe proposer selection as a combinatorial problem based on complementarity rather than accuracy alone. Greedy algorithms tested on small labeled sets to balance performance and computational cost.

Multi-agent Evals Benchmarks

SIG

HYP

arXiv cs.LG·May 26

Generative Representation Learning on Hyper-relational Knowledge Graphs via Masked Discrete Diffusion

KREPE, a generative representation learning method for hyper-relational knowledge graphs, uses masked discrete diffusion to generate complete facts from partially observed queries. Unifies link prediction and fact generation in a single framework, outperforming LLM-based baselines on standard benchmarks.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·May 26

Iterative Refinement Neural Operators are Learned Fixed-Point Solvers: A Principled Approach to Spectral Bias Mitigation

IRNO (Iterative Refinement Neural Operator) enhances neural operators with an iterative refinement module using fixed-point iteration theory. A progressive spectral loss explicitly targets high-frequency errors. Results: 56% improvement on turbulent flow, error reduction to 1.48-2.04% in high frequencies on Active Matter.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·May 26

LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs

LLM-AutoSciLab proposes a closed-loop scientific discovery framework coupling hypothesis generation, hypothesis-conditioned experiment selection, and mechanism refinement. Evaluated on ActiveSciBench (57 enzyme-kinetics tasks, 45 gene-regulatory-network tasks), the system achieves 67.6% symbolic accuracy and 2-5x better sample efficiency than competing baselines.

Reasoning AI Agents Benchmarks

SIG

HYP

arXiv cs.LG·May 26

Verified SHAP: Provable Bounds for Exact Shapley Values of Neural Networks

Algorithm to compute exact bounds on SHAP values for neural networks by leveraging neural network verification. Reduces exponential complexity and scales to orders of magnitude larger search spaces than existing exact methods.

Evals Papers Reasoning

SIG

HYP

arXiv cs.LG·May 26

ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

ChaosBench-Logic v2 is a 40,886-question benchmark evaluating logical reasoning of 14 LLMs on 165 dynamical systems. The CARE protocol reveals critical failures: regime-transition reasoning remains near-random (MCC=0.05), while FOL deduction reaches MCC=0.52. Qwen 2.5-32B outperforms proprietary models on indicator diagnostics.

Benchmarks Reasoning Qwen

SIG

HYP

arXiv cs.LG·May 26

Riemannian Archetypal Analysis: Interpretable non-linear data analysis on deformed star distributions

Riemannian archetypal analysis using data-driven pullback geometry on deformed star distributions. Combines interpretability of classical archetypal analysis with non-linear model expressiveness. Riemannian archetypal mapping (RAM) projects onto manifolds of geodesically convex archetype combinations. Experiments on MNIST demonstrate meaningful geodesics and geometry-aware denoising.

Papers Reasoning

SIG

HYP

arXiv cs.LG·May 26

Feature Lottery? A Bifurcation Theory of Concept Emergence

Bifurcation theory to detect in real time the emergence of structured representations in neural networks. A dynamic ratio β(t)/βc(t) based on loss Hessian predicts four distinct transition regimes (SAE on Pythia, SSL CIFAR, arithmetic grokking). At 5% of training, early atom purity predicts final convergence with 12x baseline improvement.

Papers Reasoning Fine-tuning

SIG

HYP

arXiv cs.LG·May 26

Algometrics: Forecasting Under Algorithmic Feedback

A theoretical framework (algometrics) analyzes deployment risks of predictive models in algorithmic markets, where predictions modify future data. Authors prove deployment risk is not identifiable from historical data alone, and model rankings can invert under crowding.

Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 26

A lift for input-convex neural network training

Novel training method for input-convex neural networks (ICNNs) using an unconstrained hypernetwork that emits inter-layer weights. Approach inspired by parameter-extension lifts from PDE-constrained inverse problems, circumvents limitations of projected gradient descent and softplus reparametrization. Results on log-concave density estimation and convex-potential normalizing flows show improved convergence.

Papers Reasoning Reinforcement learning

SIG

HYP

arXiv cs.LG·May 26

Interdomain Attention: Beyond Token-Level Key-Value Memory

Interdomain Attention merges transformers and state space models via kernel methods: attention features are projected onto basis functions maintained by an SSM, enabling query-conditioned attention over fixed-size state. On FineWeb-Edu (125M–1.3B), outperforms softmax baselines at 1.3B on validation perplexity and commonsense tasks, with length-flat behavior up to 3.5× training context.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 26

From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems

Survey on reproducibility of AI systems in regulated finance (credit, fraud, AML). Identifies three sources of nondeterminism: post-hoc explanation variance (tabular models), stochastic sampling (graph networks), batch-dependent divergence (LLM agents). Proposes evaluation framework with RBO, D_cos, TDI, PSD metrics for audit readiness.

Evals AI safety Regulation

SIG

HYP

arXiv cs.AI·May 26

Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform

arXiv paper arguing LLMs fail at causal reasoning and long-horizon planning due to lack of world models. Authors introduce Latent Dynamics Inference (LDI) and Flux, a sequential reasoning environment specified in natural language. RL agents with explicit latent state access achieve 79% win rate vs 11% for LLMs, revealing failures in persistent state tracking.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·May 26

Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving

RIA couples an LLM with an action-conditioned world model for autonomous driving. At each step, the LLM proposes actions, the world model validates via short-horizon rollouts, and a safety scorer selects the safest action. On CARLA (1000 episodes): 80.05% route completion, 51.10% arrival rate, 0.20% collision rate.

Reasoning AI Agents Benchmarks

SIG

HYP

arXiv cs.AI·May 26

EvoSci: A Bio-Inspired Multi-Agent Framework for the Evolution of Scientific Discovery

EvoSci is a bio-inspired multi-agent framework for scientific discovery using LLMs. It integrates evolution, knowledge graphs, and specialized agents (mentor, researcher, reviewer) to iteratively generate, evaluate, and refine research ideas. On real-world topics, EvoSci achieves ICLR peer-review score of 4.90 and Top-10 ranking of 54%.

Multi-agent AI Agents Reasoning

SIG

HYP

arXiv cs.AI·May 26

BoxLitE: A Faithful Knowledge Base Embedding Based on Convex Optimization

BoxLitE is a knowledge base embedding model for DL-Lite^H using convex optimization. It maps concepts to convex regions in vector space to represent ontological hierarchies. For any satisfiable DL-Lite^H KB, BoxLitE produces a weakly faithful embedding.

Embeddings Reasoning Papers

SIG

HYP

arXiv cs.AI·May 26

DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning

DRIVE is a dual-level skill modeling framework for web agents under continual learning. It separates experiences into reasoning skills (transferable task logic across websites) and interaction skills (executable page-specific operations). On WebArena, DRIVE achieves 52.8% task success rate, +7.3pp over skill-free baseline.

AI Agents Reasoning Papers

SIG

HYP

arXiv cs.AI·May 26

BODHI: Precise OS Kernel Specification Inference

BODHI, a domain knowledge prompting method, improves automated OS kernel specification generation via LLMs. Tested on 9 models (Anthropic, Mistral, Amazon, DeepSeek, Meta, Alibaba), it reaches 96.73% Pass@1 with Claude Opus 4.6 versus 55.10% baseline, by structuring C-to-Python translation across pattern categories.

Prompt engineering Benchmarks Code generation

SIG

HYP

arXiv cs.AI·May 26

Practical Quantum CIM Empowerment via All-Domestic-Core Agentic Large Model

Femtosecond laser-pumped Coherent Ising Machine (CIM) integrated with LLM-driven agentic system using LangGraph and LangChain. Large language models automatically calibrate QUBO/Ising models, iterate constraint weights, and validate schemes. Fully implemented on domestic Chinese models and hardware.

AI Agents MCP Reasoning

SIG

HYP

arXiv cs.AI·May 26

A Dynamical Framework for Cognitive Processes Based on Transformations and Semantic Equivalence

Dynamical framework for modeling cognitive processes via feedback systems. Cognitive states evolve through X_{t+1} = π(F(f(X_t))) where f describes internal transformations, F interpretative mappings, π enforces semantic equivalence. Categorical formulation and stability analysis via fixed-point arguments. Linguistic application: context-dependent interpretation as trajectory toward stable semantic class.

Reasoning Papers

SIG

HYP

arXiv cs.AI·May 26

How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning

Study quantifying reasoning redundancy in LLMs: 61-93% of thinking steps can be truncated without affecting correct answers. Analysis across 4 frontier models and 2 math benchmarks (MATH-500). Redundancy is structural, stemming from length-agnostic outcome rewards, not model-specific artifact.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 26

CSP-Atlas: Concept-Specific Neural Circuits in a Sparse Python Transformer

Study identifies 106 dedicated neural circuits in a sparse 8-layer transformer trained on Python code. Circuits organize by computational principles (atomicity, lexical ambiguity) rather than semantics. Up to 62.5% of loudest-firing neurons at mid-to-late layers are concept-specific for AST constructs.

Code generation Reasoning Papers

SIG

HYP

arXiv cs.CL·May 26

Measuring the Depth of LLM Unlearning via Activation Patching

New UDS (Unlearning Depth Score) metric to evaluate whether knowledge is truly erased in LLMs. Via activation patching, UDS measures mechanistic depth of unlearning layer-by-layer. Evaluation on 150 models and 8 methods: UDS outperforms 20 existing metrics in faithfulness and robustness.

AI safety Alignment Evals

SIG

HYP

arXiv cs.LG·May 26

LLMs Show No Signs Of Individuated Metacognition

Analysis of 20 frontier LLMs across 6 benchmarks: stated confidence does not reflect individual model capabilities. Tetrachoric factor analysis reveals confidence matrix is approximately rank-one. Models share a common item-difficulty axis and differ mainly in decision thresholds. No evidence of significant verbalised individuated metacognition found.

Evals Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 26

Low-Cost Labels, Reliable Choices: Rollout-Calibrated Hyper-Heuristics for Job Shop Scheduling

Learning-assisted hyper-heuristics for Job Shop Scheduling (JSSP). Proposed selector uses regret-normalized rollout labels, contextual KNN uncertainty estimation, and a gate that acts only when predicted gain exceeds uncertainty-adjusted margin. Reduces Random-HH mean RPD by over an order of magnitude on synthetic instances.

Reinforcement learning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 26

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

EvoCode-Bench evaluates 13 coding agents on 26 tasks with 5-15 iterative rounds. Agents must maintain a working codebase as specifications change. Results: 22-40 point gap between single-round (SR) and multi-turn (MT@4) performance, <50% success on multi-turn metrics, and progressive degradation (pass rate halved by round 5).

Code generation AI Agents Benchmarks

SIG

HYP

arXiv cs.AI·May 26

HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

HyperGuide uses hyperbolic geometry to guide multi-step reasoning in LLMs. A lightweight head projects hidden states into hyperbolic space, where distance-to-origin encodes solution proximity. A low-rank adapter is fine-tuned interactively. Consistent gains across benchmarks, with larger improvements on deeper reasoning chains.

Reasoning Fine-tuning

SIG

HYP

arXiv cs.CL·May 26

Distinguishing Right from Wrong in Debates: Attribution Analysis of Chinese Harmful Memes

New arXiv paper on interpretable detection of harmful Chinese memes. Authors create Ex-ToxiCN-MM, first explanation dataset with opposing interpretations (harmful/non-harmful), and C-HarmKB, Chinese cultural knowledge base. They propose RIKE, attribution analysis framework with AKE and RIR modules, outperforming baselines. Code and data open-sourced.

Vision AI safety Evals

SIG

HYP

arXiv cs.CL·May 26

Generating Legal Commentaries from Case Databases via Retrieval, Clustering, and Generation

Automated pipeline transforms 4,555 German Federal Court decisions into legal commentaries. Extracts paragraph-level chunks, summarizes reasoning, embeds and clusters keywords. LLMs generate headings and citation-rich sections merged into coherent commentaries. Evaluated on 5 dimensions: topical relevance, citation faithfulness, cluster distinction, logical ordering.

RAG Code generation Evals

SIG

HYP

arXiv cs.CL·May 26

Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval

Unveil is a visual-textual embedding framework for multi-modal document retrieval. It integrates textual and visual features through knowledge distillation, transferring semantic capabilities from a visual-textual model to a purely visual model. Results: improved retrieval accuracy and efficiency without parsing.

RAG Embeddings Vision

SIG

HYP

arXiv cs.CL·May 26

WhenLoss: Diagnosing Write and Retrieval Bottlenecks in Long-Context Memory Systems

WhenLoss introduces a diagnostic protocol to identify bottlenecks in long-context memory systems. Expected Predictive Compression (EPC) uses an LLM to anticipate future questions and preserve minimal evidence at write time. On LongMemEval (500 questions), EPC achieves 0.49 CSM score vs 0.44 for strongest baseline, reducing write-side gap to 0.04.

RAG Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 26

Temporal Concept Drift in Legal Judgment Prediction: Neural Baselines Across Three Epochs of Ukrainian Court Decisions

Study of temporal concept drift in legal NLP on 428K Ukrainian court decisions (2008-2026). Four transformer models (XLM-RoBERTa, legal variants) show severe forward degradation (−27.2 pp macro-F1) but robust backward transfer. Chronological continual learning eliminates catastrophic forgetting.

Benchmarks Fine-tuning Papers

SIG

HYP

arXiv cs.CL·May 26

Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation

Method to improve consistency in automated labeling pipelines for content moderation. Authors propose an AI-driven workflow where an LLM writes detailed per-category constitutions (harassment, hate speech, non-violent crime), then a frontier LLM interprets them to generate golden labels. Result: 57x reduction in cross-model inconsistency vs paragraph definitions.

Evals AI safety Alignment

SIG

HYP

arXiv cs.CL·May 26

Phonetic Modeling of Dialectal Variation in Vietnamese Speech

Dialect-aware phonetic framework for Vietnamese speech recognition. Decomposes syllables into structured phonetic components mapped to dialect-specific IPA representations. On UIT-ViMD dataset, matches wav2vec2-base-vi-250h performance with fewer parameters and no external pretraining.

SIG

HYP

arXiv cs.CL·May 26

Side-by-side Comparison Amplifies Dialect Bias in Language Models

arXiv paper demonstrating that language models amplify dialect bias (AAVE vs Standard American English) when comparing tweet pairs side-by-side, far more than in isolated evaluation. Counterfactual fairness finetuning partially mitigates bias in isolation but fails in contrastive settings, exposing a critical gap in current evaluation frameworks.

Benchmarks AI safety Alignment

SIG

HYP

arXiv cs.CL·May 26

End-to-End Intracortical Speech Decoding from Neural Activity

Speech decoding from intracortical recordings in an ALS patient without external language model. End-to-end Conformer decoder achieves 23.80% character error rate on held-out validation data. Main errors stem from word boundary segmentation failures.

Reasoning Benchmarks AI safety

SIG

HYP

arXiv cs.CL·May 26

Teaching Through Analogies: A Modular Pipeline for Educational Analogy Generation

Modular pipeline for educational analogy generation in four stages (source finding, sub-concept generation, explanation, evaluation). Evaluation of 12 LLMs across two annotated datasets (SCAR, ParallelPARC). Sub-concepts improve explanation quality and retrieval precision. Claude Sonnet 4.6 aligns better with human rankings than absolute scores.

Claude Papers Evals

SIG

HYP

arXiv cs.LG·May 26

Overcoming "Physics Shock" in Earth Observation A Heteroscedastic Uncertainty Framework for PINN-based Flood Inference

Heteroscedastic uncertainty-aware PINN framework for flood extent mapping from SAR data. Attention-Gated FNO-UNet with dynamic Warm-Start protocol and aleatoric uncertainty modeling prevents gradient divergence ("Physics Shock"). On Sen1Floods11: +25% relative IoU improvement over deterministic baselines, with calibrated confidence bounds for disaster response.

Papers Reasoning Evals

SIG

HYP

arXiv cs.LG·May 26

Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning

LoRDBA replaces low-rank LoRA adapter factors with binary sign carriers and channel-wise magnitude scales, reducing adapter footprint by over 10× while matching fp16 LoRA quality. Outperforms low-bit baselines at matched model sizes with ≤8% prefill latency overhead and ~1.6× training memory overhead versus fp16 LoRA.

Fine-tuning

SIG

HYP

arXiv cs.LG·May 26

TUBE: Tangent Upper Bound on Evidence for Discrete Diffusion Language Models

TUBE is a variational upper bound on log-likelihood for discrete diffusion models. Unlike existing ELBOs, TUBE admits an unbiased Monte Carlo estimator and applies to masked diffusion models, any-order ARMs, and block variants. Experiments show discrete diffusion models lie strictly below exact ARM baselines in likelihood.

Papers Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 26

Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

Study on rationalization bias in LLM judges. Researchers test whether model explanations remain stable when non-evidential cues are perturbed (verbosity, confidence). They propose PROOF-BEFORE-PREFERENCE to improve cue invariance and reduce explanation anchoring.

Evals Reasoning Alignment

SIG

HYP

arXiv cs.LG·May 26

Towards Verifiable Transformers: Solver-Checkable Circuit Explanations

Verifiable Transformers framework converts task-localized Transformer circuits into solver-checkable formal claims. Extracts circuits and verifies functional equivalence, edge necessity, invariance, and robustness via SMT encoding. Demonstrates direct verification on symbolic tasks and surrogate-mediated verification at GPT-2 scale with SMT-representable operators (Signed L1 BandNorm, sparsemax, LeakyReLU).

Reasoning AI safety Papers

SIG

HYP

arXiv cs.CL·May 26

Discovering Lexical Gaps Using Embeddings from Multilingual LLMs

Automated framework to detect lexical gaps (words absent in certain languages) using embeddings from multilingual LLMs. On Korean-English translation pairs, 4000 embedding spaces show gap words have weaker cross-lingual semantic alignment. Logistic classifiers achieve AUC 0.81–0.76 and retrieve 18/19 and 26/27 gap words.

Embeddings Benchmarks Papers

SIG

HYP

Le Big Data·May 26

MiniCPM5-1B : cette minuscule IA de 0,5 Go enterre déjà des modèles bien plus gros

MiniCPM5-1B, a 1-billion-parameter model weighing 0.5 GB, outperforms significantly larger models. Demonstrates that efficiency and performance do not require massive scale.

Open source Benchmarks

SIG

HYP

Hacker News (AI)·May 26

SK Group chairman says memory chip shortage will last until 2030

SK Group chairman forecasts memory chip shortage extending to 2030. Statement reflects sustained high demand for semiconductors driven by AI infrastructure needs.

Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·May 26

Added direct model downloads right from the UI in Anubis OSS - if anyone would help test that would be great

Anubis OSS v3.6, macOS app for benchmarking local LLMs (Ollama, LM Studio, MLX), adds direct model downloads from UI. Available via Homebrew and direct download. Call for testing on Apple Silicon. GPL-3.0, open-source, leaderboard with 400+ runs.

Open source Tools Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·May 26

New local model reaching near frontier on PII removal at 9 ms CPU inference

A new local model achieves near-frontier performance on PII removal with 9 ms CPU inference. The author seeks feedback on the project.

Open source Code generation AI safety

SIG

HYP

Vercel AI Blog·May 26

Sandbox persistence is now GA

Vercel Sandboxes enables filesystem persistence by default in GA. Snapshots are automatic, sandboxes resume from the latest saved state. New methods: fork(), getOrCreate(), delete(), custom tags, and lifecycle hooks.

Tools Infrastructure

SIG

HYP

Vercel AI Blog·May 26

Vercel Domains now supports price sorting and availability filtering

Vercel Domains adds price sorting and availability filtering. Lower-cost domains appear first, unavailable domains are pushed to the bottom of search results.

Tools

SIG

HYP

Vercel AI Blog·May 26

Microfrontends routing now applies to vc alias and branch domains

Vercel rolls out routing update for Microfrontends. Aliases created with `vc alias` now inherit full routing config from source deployment. Branch-assigned domains now route to that branch across all projects in the Microfrontend, not just the owning project.

Infrastructure Tools

SIG

HYP

Vercel AI Blog·May 26

Firecrawl joins the Vercel Marketplace

Firecrawl now available on Vercel Marketplace. Vercel teams can power AI agents and applications with structured web data without managing crawling infrastructure. Key features: scrape pages to markdown/HTML/structured data, search and retrieve full page content, interact with dynamic websites via AI prompts.

AI Agents RAG Tools

SIG

HYP

Simon Willison·May 25

Notes on Pope Leo XIV's encyclical on AI

Vatican releases Pope Leo XIV's encyclical Magnifica Humanitas on AI and human dignity. The document addresses ethics of AI integration into modern society, referencing Pope Leo XIII's 1891 Rerum novarum on capital and labor. Leo XIV discusses challenges posed by this new industrial revolution.

Regulation AI safety Alignment

SIG

HYP

Hacker News (AI)·May 25

Using AI to write better code more slowly

Article exploring the paradox of AI for programming: tools generate code faster but encourage counterproductive practices. The author advocates a deliberate approach prioritizing quality and understanding over raw speed.

Code generation Prompt engineering

SIG

HYP

Hacker News (AI)·May 25

Cox Media fined after bragging it spied on users through their phones

Cox Media fined for spying on users through their phones. The company collected location data without explicit user consent.

Regulation AI safety

SIG

HYP

Reddit r/MachineLearning·May 25

Aiki my local Wikipedia Retrieval-Augmented Generation system [R]

Aiki is a lightweight local RAG tool for chatting with Wikipedia offline. It downloads and chunks Wikipedia articles, uses a custom TF-IDF + cosine similarity retriever, supports query expansion via Wikipedia links, and optional LLM-based answer generation. Minimal dependencies, fully local execution.

RAG Vector search Open source

SIG

HYP

Reddit r/LocalLLaMA·May 25

Update on 12x32gb sxm v100 cluster / local AI for legal drafting

A lawyer shares experience running a 12 V100-SXM2 32GB cluster for local legal document drafting. After abandoning vLLM due to GPU Volta incompatibility with MoE models, he switched to llama.cpp with Gemma-4-26B and Qwen3.5-122B. Dense models on V100 are inefficient (~20-28 tok/s); MoE models achieve 50-113 tok/s decode on long-context legal prompts.

Llama Open source Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·May 25

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

ThriftAttention introduces selective mixed precision for optimized FP4 attention on long contexts. The method reduces memory consumption and accelerates inference by applying varying precision levels to critical attention regions.

Llama Fine-tuning Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·May 25

Using Local LLMs for Generating Custom Interactive Recursive Textbooks on the Fly

A r/LocalLLaMA user demonstrates generating custom interactive recursive textbooks on-the-fly using local LLMs. The project leverages models' ability to dynamically adapt educational content based on learner needs in real-time.

Open source Tools RAG

SIG

HYP