Page 22 of 192

AllHigh signalRecent

7679 articles

Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

Study of 16 language models (1.5B–72B parameters) showing representational convergence does not extend to reasoning processes. Models align more on collectively failed problems (CKA=0.897) than solved ones (CKA=0.830). Post-decision representations diverge sharply (CKA=0.274), and shared information exerts minimal causal influence (1.5–5.5% flip rate).

Papers Reasoning Evals

SIG

HYP

arXiv cs.CL·May 25

Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

MaR (Metacognition-as-Reward) is an RL framework improving LLM reasoning via two dimensions: metacognitive knowledge (identifying task-relevant information) and metacognitive regulation (planning the reasoning process). Tested on 22 benchmarks, Qwen3.5-9B + MaR achieves up to 7.7% gain over base model and 11.0% over vanilla DAPO, surpassing GPT-OSS-120B on average.

Reinforcement learning Reasoning Qwen

SIG

HYP

arXiv cs.LG·May 25

Reading Calibrated Uncertainty from Language Model Trajectories

Method to quantify uncertainty in language models by analyzing layer-wise representation trajectories. Eleven geometric features extracted from MLP updates outperform maximum softmax probability (MSP) by up to 21 AURC points, revealing where and how errors emerge across depth.

Evals Reasoning AI safety

SIG

HYP

arXiv cs.LG·May 25

The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

On 1-3B models, CoT in arithmetic relies on a positional shortcut: the model simply copies the number in the final position before the answer delimiter, regardless of intermediate reasoning. This strategy accounts for 54-92 pp of accuracy on GSM8K. Replacing that number with an incorrect value collapses performance even with correct steps.

Reasoning Evals Benchmarks

SIG

HYP

arXiv cs.LG·May 25

When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

Study showing explicit reasoning (CoT) benefits only specific tasks. Authors propose EDRM, a training-free framework using early-stage entropy dynamics to adaptively route to CoT or direct inference. Across 15 benchmarks and 4 LLMs, EDRM achieves 41–55% token reduction while improving accuracy up to 4.7%.

Reasoning Evals Benchmarks

SIG

HYP

arXiv cs.LG·May 25

Tensor Cache: Eviction-conditioned Associative Memory for Transformers

Tensor Cache introduces a two-level cache for transformers: sliding-window local attention (L1) plus fixed-size outer-product fast-weight memory (L2) storing evicted KV pairs as a matrix. A learned gate fuses outputs. Improves memory-quality tradeoff on long-context models.

Reasoning Infrastructure Benchmarks

SIG

HYP

arXiv cs.AI·May 25

BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems

BOHM is a hierarchical attribution method for compound AI systems that extracts component contributions directly from routing weights without evaluating arbitrary subsets. Zero marginal cost, compatible with opaque third-party APIs. On 18 LLMs (880 LiveCodeBench problems), Kendall tau=0.928 vs SHAP tau=0.980 at 9,000x more evaluations.

AI Agents Evals Reasoning

SIG

HYP

arXiv cs.LG·May 25

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

Learned Relay Representations (Relay) enables Masked Diffusion Models to propagate latent information across denoising steps via a differentiable per-token channel trained with truncated BPTT. Applied to Fast-dLLM v2, it outperforms supervised finetuning on coding tasks and reduces inference latency by 32%.

Code generation Reasoning Papers

SIG

HYP

arXiv cs.AI·May 25

LFRAG: Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding

LFRAG introduces a multimodal RAG system using block-level instead of page-level retrieval. A semantic-layout fusion encoder integrates local semantics with global context. On LFDocQA benchmark, LFRAG improves answer accuracy by 7.20% and reduces token consumption by 73.07%.

RAG Vision Benchmarks

SIG

HYP

arXiv cs.LG·May 25

Anytime Training with Schedule-Free Spectral Optimization

SF-NorMuon, a schedule-free spectral optimizer, matches or exceeds tuned AdamW on 125M and 772M parameter language models without requiring a predefined learning-rate schedule. Theoretical proof of stationarity guarantee and identification of weight decay as essential for long-horizon stability.

Reinforcement learning Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 25

GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

GEMQ introduces global expert-level mixed-precision quantization for MoE-LLMs. The method uses global linear-programming formulation to estimate expert importance and fine-tunes routers to adapt routing to quantized experts. Results: significant memory reduction and inference acceleration with minimal accuracy loss.

SIG

HYP

arXiv cs.LG·May 25

Test-Time Training Undermines Safety Guardrails

An arXiv study reveals that Test-Time Training (TTT) creates security vulnerabilities. Researchers identify three threat models enabling safety filter bypass: with LoRA, attack success rates reach 95% and 93% respectively. Vulnerabilities transfer to production fine-tuning APIs.

AI safety Alignment Fine-tuning

SIG

HYP

arXiv cs.AI·May 25

One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

PCSP, a single RL policy conditioned on frozen LLM embeddings, controls 300 NPCs with distinct personas. Achieves 17x above-chance zero-shot identification, ρ=0.73 semantic-behavioral alignment, 22x faster than LLM-as-policy baseline. Deployed in UE5 on 64 agents with low failure rate.

Reinforcement learning AI Agents Multi-agent

SIG

HYP

arXiv cs.LG·May 25

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

ThriftAttention uses mixed precision (FP16/FP4) for long-context attention on Blackwell GPUs. By selecting 5% of critical query-key blocks in FP16 and computing remaining blocks in FP4, the method recovers 89.1% of FP16 performance while maintaining FP4 efficiency. Code released.

Benchmarks Infrastructure Reasoning

SIG

HYP

arXiv cs.AI·May 25

MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

MemAudit is a post-hoc auditing framework to detect poisoned memories in LLM agents. It combines a counterfactual influence score and a memory consistency graph to identify malicious records injected through normal interactions. Evaluated against MINJA attack, it reduces success rates from 70% to 0% in QA and 83.3% to 0% in reasoning tasks.

AI Agents AI safety RAG

SIG

HYP

arXiv cs.CL·May 25

Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

Fast-dDrive is a block-diffusion VLA (Vision-Language-Action) model for autonomous driving. It combines bidirectional refinement within semantic units with strict causal ordering, handles structured JSON outputs, and achieves 12× throughput speedup with SGLang. On nuScenes, L2 error reduced to 0.32m (22% improvement), SOTA on WOD-E2E.

Vision Code generation Reasoning

SIG

HYP

arXiv cs.CL·May 25

Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection

Sparse autoencoders (SAEs) trained on multilingual data improve language control in LLMs. Authors propose a principled layer-selection rule based on multilingual alignment and language separability, validated on LLaMA-3.1-8B and Gemma-2-9B for machine translation and cross-lingual summarization.

Benchmarks

SIG

HYP

arXiv cs.AI·May 25

From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

Systematic study of the full lifecycle of model-generated agent skills: extraction, consumption, and transfer. Evaluation framework spanning 5 agentic task domains. Findings: skills beneficial on average but exhibit non-trivial negative transfer; extractor/consumer performance independent of model scale. Introduction of meta-skill to improve quality and reduce negative transfer.

AI Agents Multi-agent Reinforcement learning

SIG

HYP

arXiv cs.AI·May 25

Agentic Proving for Program Verification

Claude Code evaluated on CLEVER (Lean 4 benchmark) generates valid specifications for 98.8% of problems, certifies 87.5% of implementations, and achieves 98.1% success on end-to-end program generation and verification. Study reveals mismatch between current benchmark difficulty and modern agentic prover capabilities.

Claude Code AI Agents Reasoning

SIG

HYP

arXiv cs.CL·May 25

A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text

Reproducible pipeline for building a Universal Dependencies parsing resource for Katharevousa Greek (post-junta parliamentary texts). Dataset of 1,697 annotated sentences, comparison of 6 architectures (spaCy, Stanza, XLM-R, mBERT, etc.). Best model (XLM-R): 0.5162 LAS vs 0.4183 baseline. Code and annotations released open-access.

Papers Benchmarks Open source

SIG

HYP

Reddit r/MachineLearning·May 25

Sponsio: Deterministic Contract Layer for LLM Agents [P]

Sponsio introduces a deterministic contract layer for LLM agents in production. Operators declare invariants in YAML compiled to AST evaluated per tool call. ODCV-Bench benchmark (12 frontier LLMs × 80 trajectories): unguarded models violate in 11.5%-66.7% of runs; with Sponsio, 95.6% misalignment avoided on average.

AI Agents AI safety Tools

SIG

HYP

The Decoder·May 24

Researchers let Claude Code discover AI scaling algorithms that humans probably wouldn't have designed

Researchers from UMD, Google, and Meta use AutoTTS to let Claude Code independently discover control algorithms for AI reasoning. The discovered algorithm reduces compute by 70% versus standard self-consistency while matching accuracy. The search cost $40 and took 160 minutes.

Claude Code AI Agents Reasoning

SIG

HYP

Reddit r/LocalLLaMA·May 23

Command A+ (218B MoE) running on Apple Silicon — MLX port, PR open

Cohere released Command A+ (218B MoE, 25B active) on the 20th. MLX port for Apple Silicon in PR: cohere2_moe implementation with sigmoid routing, 128 top-8 experts, 3:1 sliding window. Validation on M3 Max (128GB): 22.9 tok/s generation, 57.6 tok/s prompt in BF16→Q8.

Open source Infrastructure Code generation

SIG

HYP

Reddit r/LocalLLaMA·May 23

Benchmarked Needle 26M vs Qwen3-0.6B on CPU function calling, 50 queries across 5 difficulty tiers. The 23x smaller model wins on accuracy and is 4.4x faster.

CPU benchmark of Needle (26M) vs Qwen3-0.6B on function calling: 50 queries across 5 difficulty tiers. Needle wins on accuracy (72% vs 56% tool_match) and latency (10.9s vs 47.9s). Needle fails on tool selection, Qwen3 on tag emission. Qwen3 dominates on multilingual queries (Hindi, French).

Qwen Benchmarks Code generation

SIG

HYP

Reddit r/LocalLLaMA·May 23

Apex-Testing: real-world, real repos, agentic coding benchmark (Update)

Apex-Testing, an agentic coding benchmark based on 65-70 real GitHub repos, updated to 95% with recent models. 70 tasks across 8 categories test AI agents on production codebases. ELO leaderboard, cost/time metrics and model comparisons available. Qwen 3.7 Max, Deepseek v4 and other models still being completed.

AI Agents Code generation Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·May 23

I added native MTP to exo for Qwen3.6 MLX models; here are the exactness and speed results

Contribution to exo: native multi-token prediction (MTP) support for Qwen 3.6 MLX models. Benchmarks on 27B (2x speedup at K=2/K=3) and 35B-A3B (1.16x at K=1). Exactness verified: identical token IDs to greedy path, speculative probability-ratio acceptance in sampling.

Qwen Open source Code generation

SIG

HYP

Reddit r/LocalLLaMA·May 22

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

CODA is a GPU kernel abstraction that rewrites Transformer blocks as GEMM-epilogue programs. It fuses memory-bound operations (normalization, activations, residuals) with GEMM output before writing to memory, reducing data movement. Covers nearly all non-attention computation in forward/backward pass.

Infrastructure Benchmarks Code generation

SIG

HYP

Reddit r/LocalLLaMA·May 22

ztok — a fast multithreaded tokenizer in Zig that loads tiktoken / HF / SentencePiece and is 2–5× faster

ztok is a multithreaded tokenizer library in Zig, 2–5× faster than tiktoken/HF/SentencePiece. Loads tiktoken, HF tokenizer.json, SentencePiece, TokenMonster, Mistral Tekken formats. Bit-identical to reference implementations, 8 language bindings, optimized for RAG and dataset tokenization.

Tools RAG Open source

SIG

HYP

arXiv cs.AI·May 22

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

PlanningBench is a framework for generating scalable and verifiable planning data. It abstracts 30+ task types and difficulty factors from real scenarios, then synthesizes problems with adaptive control and automatic verification. RL training on verified data improves performance on unseen benchmarks.

Benchmarks Reasoning Reinforcement learning

SIG

HYP

arXiv cs.LG·May 22

Embedding-Based Federated Learning with Runtime Governance for Iron Deficiency Prediction

Real-world deployment of federated learning pipeline for iron deficiency prediction from full blood count data. Uses DeepCBC (frozen haematology foundation model) + FedMAP (personalised aggregation). Tested across two clinical sites (AUMC, NHSBT) with non-IID data. FedMAP improves ROC-AUC from 0.947→0.959 (AUMC) and 0.856→0.867 (NHSBT) versus local-only training.

Embeddings Benchmarks

SIG

HYP

arXiv cs.CL·May 22

TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation

TransitLM: 13M+ transit route records from 4 Chinese cities (120k stations, 13.6k lines) for training LLMs to generate routes without map infrastructure. Models learn to ground GPS coordinates to stations and produce structurally valid routes without explicit mapping.

Benchmarks Papers Code generation

SIG

HYP

arXiv cs.LG·May 22

Provable Joint Decontamination for Benchmarking Multiple Large Language Models

JECS (Joint Envelope Conformal Selection) is a method to decontaminate LLM evaluation benchmarks by controlling global contamination rate (GCR) across multiple models. It aggregates per-model conformal p-values and applies adaptive Benjamini-Hochberg procedure to select a benchmark with provable fairness guarantees and higher power than baseline approaches.

Benchmarks Evals AI safety

SIG

HYP

arXiv cs.LG·May 22

Don't Collapse Your Features: Why CenterLoss Hurts OOD Detection and Multi-Scale Mahalanobis Wins

GOEN (Geometry-Optimised Epistemic Network) combines multi-scale features, L2 normalisation, Mahalanobis distance, and calibration to detect out-of-distribution inputs. Key finding: CenterLoss degrades OOD detection (AUROC 0.9366 vs 0.9483 without), despite improving classification accuracy. GOEN-NoCenterLoss achieves 0.9483 AUROC on CIFAR-10, outperforming deep ensembles (0.8827), KNN (0.8967), and ODIN (0.8870).

AI safety Evals Benchmarks

SIG

HYP

arXiv cs.LG·May 22

From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

P2D, an LLM alignment framework, couples data selection with parameter-efficient fine-tuning by identifying task-critical attention heads. It mines high-affinity data and prunes 90% of parameters using these heads as a functional filter. Result: +8.3pp performance gain and 7.0x end-to-end speedup using only 10% of data and 10% of heads.

Fine-tuning Reasoning Alignment

SIG

HYP

arXiv cs.LG·May 22

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

Authors show teacher-token reliability in reasoning self-distillation depends on position within trajectory, not local entropy. They propose Position-Weighted OPSD (PW-OPSD), applying increasing position weights to token supervision. On Qwen3-4B, AIME 2024/2025 improve by +1.0/+1.1 points; validation on DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think confirms gains.

Reasoning Fine-tuning Benchmarks

SIG

HYP

arXiv cs.LG·May 22

EntmaxKV: Support-Aware Decoding for Entmax Attention

EntmaxKV introduces a sparse decoding framework for entmax attention, exploiting exact zeros produced by entmax versus softmax's dense tails. Combines query-aware page scoring, support-aware candidate selection, and sparse entmax attention. Achieves 3.36× speedup (softmax) and 5.43× (entmax) on 1M context using reduced KV cache fraction.

Reasoning Benchmarks Infrastructure

SIG

HYP

arXiv cs.LG·May 22

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

X-Token introduces cross-tokenizer knowledge distillation via two complementary loss formulations (P-KL and H-KL) using a projection matrix W. On Llama-3.2-1B, the method outperforms GOLD by +3.82 points with Qwen3-4B and +0.5 with Phi-4-Mini; two-teacher setup (Phi-4-mini + Llama-3B) gains +1.3 points.

Fine-tuning Benchmarks Llama

SIG

HYP

arXiv cs.AI·May 22

OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

OSCToM combines RL and surrogate models to generate observer-agent conflicts in Theory of Mind tasks. On FANToM (information-asymmetric benchmark), OSCToM-8B reaches 76% accuracy vs 0.2% for ExploreToM. Data synthesis is 6x more efficient.

Reasoning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·May 22

Open-World Evaluations for Measuring Frontier AI Capabilities

New evaluation approach for frontier AI: 'open-world evaluations' complement benchmarks by testing complex real-world tasks over long horizons. CRUX project demonstrates an AI agent developing and publishing an iOS app to Apple App Store with only one avoidable manual intervention, revealing emerging capabilities.

Evals AI Agents Benchmarks

SIG

HYP

arXiv cs.AI·May 22

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

AgentAtlas proposes multidimensional evaluation of LLM agents beyond simple success rates. The study introduces a 6-state control taxonomy, a 9-category error taxonomy, and audits 15 existing benchmarks. On 8 models (4 closed, 4 open-weight), removing explicit labels drops accuracy by 14-40 pp, revealing strong prompt dependency.

AI Agents Benchmarks Evals

SIG

HYP