Page 20 of 192

AllHigh signalRecent

7679 articles

Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory

Paper formalizing AI agent memory as a distinct data-management workload. Proposes GEM (Governed Evolving Memory) with four state-level operators (ingestion, revision, forgetting, retrieval) and six correctness conditions. Proves record-level systems cannot satisfy these conditions. Prototype MemState on property-graph backend.

AI Agents Papers Infrastructure

SIG

HYP

arXiv cs.LG·May 27

The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models

Study on the cost of structured output constraints for small language models (< 3B). Tests on Qwen2.5-0.5B/1.5B and SmolLM2-1.7B show that enforcing JSON schema validity (61.5% → 100%) reduces answer accuracy (19.7% → 11.0%) and increases semantically invalid outputs (49.5% → 88.9%). Recommendation: report schema validity, answer accuracy, and semantic error rates separately.

Qwen Code generation Evals

SIG

HYP

arXiv cs.CL·May 27

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

Mechanistic analysis of LLM hallucinations on linearized structured knowledge (graphs, tables). Hallucinations stem from systematic internal dynamics: attention disproportionately concentrates on shortcut structural cues, feed-forward representations fail to ground provided knowledge, model reverts to parametric memory. Patterns generalize to multi-hop graphs and tabular data.

Reasoning Papers AI safety

SIG

HYP

arXiv cs.LG·May 27

Unified Neural Scaling Laws

Unified Neural Scaling Law (UNSL) functional form models simultaneous variation of model parameters, training dataset size, training steps, inference steps, compute, and hyperparameters on performance. Validated across vision, language, math, and reinforcement learning with more accurate extrapolations than existing scaling laws.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.LG·May 27

AirCast-SR: A Foundation Model for Kilometer-Scale Atmospheric Super-Resolution via Latent Consistency Diffusion

AirCast-SR is an atmospheric super-resolution foundation model that downscales global AI weather forecasts from 28 km to 1 km horizontal resolution. Built on a 3D U-Net conditioned within a Latent Consistency Model diffusion framework, trained on GraphCast forecasts and NOAA data, it produces 67-hour forecasts with near-zero bias and demonstrates zero-shot global transferability to India and Germany.

Papers Benchmarks Open source

SIG

HYP

arXiv cs.AI·May 27

JobBench: Aligning Agent Work With Human Will

JobBench evaluates 36 AI models (including Claude Opus at 45.9%) on 130 real professional tasks across 35 occupations. Unlike existing benchmarks focused on economic value, JobBench prioritizes workflows experts identify as high-priority for delegation, favoring human augmentation over replacement.

AI Agents Benchmarks Claude

SIG

HYP

arXiv cs.AI·May 27

MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

MedGuideX transforms clinical practice guideline (CPG) recommendations into executable decision logic to generate question-answering training data. Post-training a medical LLM on this data improves accuracy by 10.28% across four clinical reasoning benchmarks and produces physician-preferred rationales in faithfulness, validity, completeness, and clarity.

Fine-tuning Reasoning Evals

SIG

HYP

arXiv cs.LG·May 27

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

MechRL uses a PPO agent operating over 144 attention heads of GPT-2 small to automatically discover mechanistic circuits. Trained on induction and IOI tasks, the agent identifies causally relevant heads via zero-ablation and contrastive rewards, generalizing to docstring completion (96% of oracle with best-of-five planning).

Reinforcement learning Evals Papers

SIG

HYP

arXiv cs.LG·May 27

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

TSFMAudit, first contamination auditing method for time series foundation models (TSFMs). Detects whether evaluation datasets were exposed during pretraining by analyzing fine-tuning adaptation dynamics: contaminated data exhibits unusually fast loss reduction. Evaluated on 6 TSFMs and 187 datasets.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.LG·May 27

ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

ARBITER corrects majority vote failures in test-time sampling. Reasoning trajectories cluster into stable basins that aren't necessarily accurate. ARBITER uses hidden states and model-derived evidence to add conservative signals to consensus, recovering ~22% of oracle gap on Llama-3.1-8B MMLU-HS-Math (78%→82%).

Reasoning Evals Benchmarks

SIG

HYP

arXiv cs.LG·May 27

GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training

GAC is an adaptive controller for hybrid SFT-RL post-training that dynamically adjusts mixing weights based on online estimates of gradient variance and disagreement between the two training signals. Tested on math, code, science, and logic benchmarks, GAC improves fixed baselines with less than 1% computational overhead.

Reinforcement learning Fine-tuning Benchmarks

SIG

HYP

arXiv cs.AI·May 27

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

arXiv paper reveals that models with statistically indistinguishable atomic knowledge fail systematically to chain them in multi-hop reasoning (>40 percentage point gap). Aggregate metrics mask this 'composition collapse'. Authors introduce a double-gate protocol decomposing post-training gains into three independent channels: atomic stability, residual composition, and critical depth.

Reasoning Benchmarks Evals

SIG

HYP

arXiv cs.LG·May 27

Curriculum Learning for Safety Alignment

Staged-Competence, a curriculum learning framework, improves robustness of DPO-based safety alignment. Across three model families, it reduces out-of-distribution harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities. The framework achieves baseline safety with 75% of training data.

AI safety Alignment Reinforcement learning

SIG

HYP

arXiv cs.AI·May 27

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

Study of 432 experiments across 6 models (4 capability tiers) testing whether higher-capability models need less structural guidance. Results refute monotone relationship: Gemini 2.5 Flash performance drops 29-38pp with increased harness verbosity. Qwen3.5-122B (reasoning) achieves 91.7% VTSR with strict harness. Six-label failure taxonomy introduced.

AI Agents Evals Reasoning

SIG

HYP

arXiv cs.CL·May 27

FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing

FAB-Bench is an adaptive benchmarking framework for evaluating RAG systems in semiconductor manufacturing. It defines 6 diagnostic metrics (factual accuracy, contextual utilization, completeness, retrieval relevance, technical depth, reasoning consistency) across context windows of 4K-32K tokens. Benchmark of 200 query-answer pairs tested on 4 LLMs and 4 RAG frameworks.

RAG Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 27

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

LiveK12Bench is a dynamic multi-disciplinary benchmark evaluating reasoning capabilities of multimodal models on 2K+ real exam questions (Math, Physics, Chemistry, Biology). Tests reveal major performance degradation: GPT-5 drops from 79 to 53/100 under realistic exam constraints. Framework includes automated anti-contamination pipeline and end-to-end 'Mock Exam' evaluation scheme.

Benchmarks Vision Reasoning

SIG

HYP

arXiv cs.LG·May 27

Stateful Inference for Low-Latency Multi-Agent Tool Calling

Stateful inference architecture for multi-agent tool calling with persistent KV cache across turns, reducing cost from O(n_t) to O(Δ_t). 2.1× speedup on 6-turn workflows, 4.2× on 35-turn median vs vLLM/SGLang.

AI Agents Multi-agent Infrastructure

SIG

HYP

arXiv cs.LG·May 27

Provably Communication-Efficient and Privacy-Preserving Federated Graph Neural Networks

CE-FedGNN is a federated framework for graph neural networks that reduces communication by infrequently exchanging aggregated node representations instead of per-round embeddings. A moving-average estimator handles cross-client dependencies and staleness. The framework provides privacy guarantees via metric-DP and achieves O(1/√T) convergence with O(T^3/4) communication complexity.

SIG

HYP

arXiv cs.LG·May 27

A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning

PAC-Bayesian framework for physics-informed machine learning (PIML) integrating partial differential equations. Provides high-probability generalisation guarantees with unbounded losses via multi-task perspective. Non-vacuous bounds validated on standard PDE benchmarks.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·May 27

Online Learning on Hidden-Convex Losses via Algorithmic Equivalence: Optimal Regret, Geometric Barrier, and Bandit Feedback

Study of adversarial online learning on hidden-convex losses (nonconvex losses becoming convex after reparameterization). Authors prove online gradient descent (OGD) achieves optimal Θ(√T) regret, improving prior O(T^2/3) result. They characterize necessary-and-sufficient Hessian compatibility condition and extend analysis to bandit feedback with O(T^3/4) regret.

Papers Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.CL·May 27

Probing Minimalist Phase Structure in LLMs: What Universal Dependencies Cannot Represent

Researchers test whether LLMs encode formal syntactic structures (Minimalist Program phase boundaries) invisible to Universal Dependencies. Across 13 models (4 families), 12/13 show a phase-count gradient, and 13/13 display an asymmetry predicted by phase-internal cohesion. Activation patching confirms these representations are causally active in 12/13 models.

Papers Reasoning Evals

SIG

HYP

arXiv cs.AI·May 27

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

OmniToM is a benchmark evaluating theory of mind in LLMs through explicit belief modeling. Built on 895 stories (22,343 annotated belief propositions), it tests extraction and labeling of mental states across 7 dimensions. Results show current LLMs struggle to transform narrative facts into actors' beliefs and shared mental states.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·May 27

Verilog-Evolve: Feedback-Driven and Skill-Evolving Verilog Generation

Verilog-Evolve is a feedback-driven framework for iterative Verilog refinement from LLM generation. The system evaluates candidates via functional simulation, Yosys synthesis, ABC timing proxy, and GEMM metrics, then evolves modular skills across tasks. Results on VerilogEval show improved functional success and downstream RTL quality.

Code generation Reinforcement learning Evals

SIG

HYP

arXiv cs.CL·May 27

MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies

MicroSpec reduces active vocabulary by 40x (under 3k tokens) during speculative decoding without additional training. The technique exploits temporal locality in language generation and integrates asynchronous GPU memory management. End-to-end speedup of 1.12-1.32x vs EAGLE-2.

Code generation Infrastructure Benchmarks

SIG

HYP

arXiv cs.AI·May 27

What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation

Study on chain-of-thought (CoT) mechanisms at probe time. Authors show performance gains arise primarily from lexical activation and short-range token co-occurrence (2-3 tokens), not global logical derivation. Even word-shuffled rationales substantially outperform no-rationale baselines.

Reasoning Prompt engineering Papers

SIG

HYP

arXiv cs.AI·May 27

MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

MobileExplorer accelerates on-device inference for mobile GUI agents through online exploration. The framework exploits VLM reasoning time to parallelly probe UI elements, recording exploration traces as structured memory. With a two-level rollback mechanism, it reduces reasoning steps and end-to-end latency by 23% on AndroidWorld.

AI Agents Vision Reasoning

SIG

HYP

arXiv cs.LG·May 27

QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling

QAM-W is a 2D quantization codec for LLM weights using Hadamard rotation and activation-aware scaling. Across 5 models (1.1B–13B), the activation-aware variant at ~5.5 bpw maintains ±0.4% BF16 perplexity, matching SmoothQuant W8A8 quality with 32% fewer weight bits. 2D coding outperforms polar coding by 2–15 pp.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 27

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

Reasoning models (LRMs) jointly encode refusal in residual stream activations and chain-of-thought (CoT). On DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in 39% of cases with fixed CoT, but 70% without CoT. Regenerating CoT under steering achieves 94% success, revealing refusal is distributed across activations and CoT.

Reasoning AI safety Alignment

SIG

HYP

Reddit r/LocalLLaMA·May 26

SkillOpt treats markdown skill files as trainable parameters with proper optimization machinery

SkillOpt formalizes markdown skill file optimization as trainable parameters via bounded edits (add/delete/replace) proposed by a frontier model and validated against a held-out test set. Best skills converge with 1–4 accepted edits from ~920 tokens. A skill optimized on Codex transfers to Claude Code (+59.7 SpreadsheetBench) without modification.

AI Agents Prompt engineering Code generation

SIG

HYP

arXiv cs.AI·May 26

Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism

PAT, an adaptive tensor parallelism method, optimizes the generation stage in synchronous RLHF. It dynamically reconfigures parallelization during decoding to compensate for response-length skew. Implemented on SGLang/VeRL, PAT reduces generation latency by up to 34.6% on LLaMA3.1-8B and Qwen3-14B.

Reinforcement learning Infrastructure Benchmarks

SIG

HYP

arXiv cs.CL·May 26

Measuring the Depth of LLM Unlearning via Activation Patching

New UDS (Unlearning Depth Score) metric to evaluate whether knowledge is truly erased in LLMs. Via activation patching, UDS measures mechanistic depth of unlearning layer-by-layer. Evaluation on 150 models and 8 methods: UDS outperforms 20 existing metrics in faithfulness and robustness.

AI safety Alignment Evals

SIG

HYP

arXiv cs.AI·May 26

DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning

DRIVE is a dual-level skill modeling framework for web agents under continual learning. It separates experiences into reasoning skills (transferable task logic across websites) and interaction skills (executable page-specific operations). On WebArena, DRIVE achieves 52.8% task success rate, +7.3pp over skill-free baseline.

AI Agents Reasoning Papers

SIG

HYP

arXiv cs.LG·May 26

Iterative Refinement Neural Operators are Learned Fixed-Point Solvers: A Principled Approach to Spectral Bias Mitigation

IRNO (Iterative Refinement Neural Operator) enhances neural operators with an iterative refinement module using fixed-point iteration theory. A progressive spectral loss explicitly targets high-frequency errors. Results: 56% improvement on turbulent flow, error reduction to 1.48-2.04% in high frequencies on Active Matter.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·May 26

Verified SHAP: Provable Bounds for Exact Shapley Values of Neural Networks

Algorithm to compute exact bounds on SHAP values for neural networks by leveraging neural network verification. Reduces exponential complexity and scales to orders of magnitude larger search spaces than existing exact methods.

Evals Papers Reasoning

SIG

HYP

arXiv cs.LG·May 26

Knowledge Graph Modulated Deep Learning for Limited-Sample Clinical Data Analysis

Graph-in-Graph (GiG) integrates biological knowledge graphs into deep learning for clinical analysis with limited data. Tested on ~9,700 patients across 5 tasks (cancer detection, prostate diagnosis, pan-cancer classification), GiG outperforms existing methods with gains up to 49 macro-F1 points in limited-sample settings.

Papers Benchmarks RAG

SIG

HYP

arXiv cs.LG·May 26

Interdomain Attention: Beyond Token-Level Key-Value Memory

Interdomain Attention merges transformers and state space models via kernel methods: attention features are projected onto basis functions maintained by an SSM, enabling query-conditioned attention over fixed-size state. On FineWeb-Edu (125M–1.3B), outperforms softmax baselines at 1.3B on validation perplexity and commonsense tasks, with length-flat behavior up to 3.5× training context.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 26

An Interactive Paradigm for Deep Research

SteER is a framework for interactive deep research using LLMs. It introduces interpretable control points allowing users to correct course mid-process via cost-benefit formulation. Results: +22.80% alignment improvement vs baselines, preferred by human readers in 85%+ of pairwise judgments.

AI Agents Reasoning RAG

SIG

HYP

arXiv cs.CL·May 26

SEAL: Synergistic Co-Evolution of Agents and Learning Environments

SEAL is a closed-loop co-evolution framework for tool-use LLM agents. It collects verifiable trajectories, diagnoses turn-level failures, and uses these signals to jointly adapt the learning environment and agent policy. With 400 training samples, SEAL achieves +8.25 to +26.25 point gains across three backbones and shows positive out-of-distribution transfer.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.CL·May 26

Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap

Found in Conversation (FiC) is a training framework where LLMs self-teach to close the multi-turn gap (Lost-in-Conversation). Via View-Asymmetric Self-Distillation, the model distills between single-turn (teacher) and multi-turn (student) views. Tested on Llama, Qwen, Phi, OLMo (3B-14B), FiC recovers 92-100% of single-turn performance.

Llama Qwen Fine-tuning

SIG

HYP

arXiv cs.CL·May 26

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

EchoDistill introduces an alignment-based noisy-to-clean self-distillation framework to improve Audio LLM robustness against real-world noise. A noisy student is optimized via GRPO using a frozen clean-audio teacher as semantic reference. Results: +4.18% GSR improvement under strong noise vs strongest baseline, +3.02% Acc on Qwen-Omni.

Reinforcement learning Fine-tuning

SIG

HYP