Page 4 of 192

AllHigh signalRecent

7679 articles

KVarN: Variance-Normalized KV-Cache Quantization [R]

KVarN is a KV-Cache quantization method combining Hadamard rotations with variance-normalization on K and V matrices. Achieves 3-4x compression with 0-1% accuracy drop on AIME24 and speedup over fp16 baseline in vLLM. Optimized for decode-heavy settings (reasoning, code-gen, agents).

Code generation Reasoning AI Agents

SIG

HYP

Reddit r/LocalLLaMA·Jun 4

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

NVIDIA releases Nemotron-3-Ultra-550B, frontier-scale model with 550B parameters (55B active) using LatentMoE hybrid architecture combining Mamba-2, MoE, and Attention layers. Supports up to 1M token context, configurable reasoning mode, optimized for complex agents and high-stakes RAG. OpenMDW license, 11 languages.

Open source AI Agents Reasoning

SIG

HYP

arXiv cs.AI·Jun 4

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

MapAgent is a multi-agent architecture for city-scale lane-level map generation. The system couples visual perception, specification verification, and deterministic editing via a Judge-Planner-Worker loop. Integrated into Baidu Maps for 360+ cities, it achieves 95% production automation.

AI Agents Multi-agent Vision

SIG

HYP

arXiv cs.CL·Jun 4

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

SparDA introduces a decoupled sparse attention architecture for efficient long-context LLM inference. A fourth per-layer projection (Forecast) predicts KV blocks needed by the next layer, overlapping CPU-to-GPU prefetch with current execution. On 8B models, SparDA achieves 1.25× prefill speedup and 1.7× decode speedup, reaching up to 5.3× higher decode throughput.

Reasoning Infrastructure Benchmarks

SIG

HYP

arXiv cs.CL·Jun 3

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

Systematic audit of FOLIO and MALLS benchmarks reveals 39% and 36% errors in FOL formalizations respectively. Authors release corrected annotations and an LLM-based framework to guide manual relabeling, achieving 90% dataset accuracy by reviewing <24% of instances versus >70% for unguided review. Testing on Gemma 31B, Qwen3-30B, and GPT-4o-mini shows +9 to +22 percentage point accuracy gains.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.AI·Jun 3

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

MedCUA-Bench is an interactive benchmark for evaluating computer-use agents in clinical interfaces. It covers 18 medical scenarios across 10 domains with authentic interfaces. Best closed-source models reach 54.2% strict success, open-source agents average 2.5%, exposing a major gap with required reliability.

AI Agents Benchmarks AI safety

SIG

HYP

arXiv cs.CL·Jun 3

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

Geometric study showing inter-LLM agreement on subjective evaluations does not reflect human alignment. Across 41 LLM judges and 8 Indic languages, models use 30-50% of human score range, with evaluation axis nearly orthogonal to humans (87-89° vs 78-81°). LLM-LLM agreement (r≈0.35) exceeds LLM-human (r≈0.27-0.32). Only post-hoc calibration improves all rubrics.

Evals Alignment Benchmarks

SIG

HYP

arXiv cs.AI·Jun 3

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

DeskCraft is a desktop GUI benchmark for agents on long-horizon professional workflows (>50 steps) in design, video, audio, and 3D with human-agent collaboration. 18 agents tested on 538 tasks: GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Reveals persistent failures in proactive clarification and long-horizon workflow delivery.

AI Agents Benchmarks Evals

SIG

HYP

Reddit r/LocalLLaMA·Jun 2

Tiny LLM Benchmark: Jetson Orin Nano Super 8GB - Four Power Modes × Eight Models

Comprehensive benchmark of 8 tiny LLMs (135M–1B) on Jetson Orin Nano Super 8GB with llama.cpp CUDA across 4 power modes (7W–MAXN). 25W mode optimal: SmolLM2-135M achieves 165 tok/s and 22.6 tok/J; LFM2.5-1.2B best in ~1B class (54.1 tok/s). 384 benchmark cells, raw datasets published.

Benchmarks Open source Infrastructure

SIG

HYP

arXiv cs.CL·Jun 2

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

Multi-domain red teaming framework evaluating 11 LLMs across 690 clinical scenarios. Results: substantial variance (scores 0.791–0.984), safety-critical failures masked by aggregate accuracy, 10-20% error amplification on equity tasks. Hybrid evaluation (automated + human validation) essential.

AI safety Evals Benchmarks

SIG

HYP

arXiv cs.CL·Jun 2

CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards

CSRP, a three-stage framework for Chinese grammatical error correction, combines continual pre-training (5.9M samples), Chain-of-Thought fine-tuning, and policy optimization with efficiency-aware rewards. Achieves 50.99 F₀.₅ on NACGEC and outperforms GPT-4 on spelling correction (59.61 F1).

Reinforcement learning Reasoning Fine-tuning

SIG

HYP

arXiv cs.AI·Jun 2

MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

Delayed per-step reward attribution method for training LLM agents in multi-agent strategic interaction. An 8-billion-parameter open-source model trained with this approach matched or surpassed GPT-5 and won both Open and Efficient tracks at MindGames Arena benchmark (NeurIPS 2025).

AI Agents Multi-agent Reinforcement learning

SIG

HYP

arXiv cs.LG·Jun 2

LithoGRPO: Fast Inverse Lithography via GRPO Reinforced Flow Matching

LithoGRPO combines flow matching with GRPO-based reinforcement learning to optimize lithography masks in semiconductor manufacturing. The framework integrates explicit physics-based reward functions and proposes a fast shot-counting algorithm achieving 130x speedup. State-of-the-art results over optimization and learning-based methods.

Reinforcement learning Papers Benchmarks

SIG

HYP

arXiv cs.LG·Jun 2

BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization

BitsMoE introduces spectral-energy-guided bit allocation for MoE LLM quantization. Using SVD decomposition, it preserves shared basis unquantized and fine-grained quantizes expert-specific factors via integer linear programming. On Qwen3-30B at 2-bit, it improves accuracy by 27.83 percentage points and increases decoding speed 1.76× over GPTQ.

Benchmarks Open source

SIG

HYP

Reddit r/LocalLLaMA·Jun 1

ICYM: llama.cpp b9455 --SM Tensor KV Cache Fix is MERGED

llama.cpp b9455 merges a major fix for KV cache quantization in tensor mode on multi-GPU. The solution extends the meta backend to properly handle tensor flattening without losing shape information, avoiding changes to compute graphs.

Llama Open source Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·Jun 1

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

mistral.rs v0.8.2 achieves up to 2.8x faster CUDA inference than llama.cpp on Gemma 4 (dense and MoE) across GB10, B200, and H100. Reproducible results published with Q4K and eQ8_0 support, includes OpenAI-compatible server.

Mistral Benchmarks Code generation

SIG

HYP

arXiv cs.LG·Jun 1

VeriGate: Verifier-Gated Step-Level Supervision for GRPO

VeriGate extends GRPO by combining verifier rewards with step-level supervision. The method uses a Process Reward Model (PRM) to assign fine-grained credit to tokens, avoiding gradient collapse when all trajectories receive identical rewards. On MATH with Qwen2.5-Instruct (1.5B/7B), VeriGate improves accuracy by ~20% and ~12% respectively.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.CL·Jun 1

Eywa: Provenance-Grounded Long-Term Memory for AI Agents

Eywa is a provenance-grounded memory architecture for persistent AI agents, storing immutable source evidence before deriving facts and validating memories against typed signals. Retrieval uses a deterministic multi-route read path with zero LLM calls. Results: 90.19% judge accuracy on LoCoMo C1-C4, 88.2% on LongMemEval-S, 81.45% mean nugget score on BEAM.

AI Agents Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 1

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

Multi-model study (Pythia-1.4B, Gemma-2, Qwen2.5-7B, Llama-3.1-8B) on linear representations of synthetic dishonesty. Linear probes detect deception with AUC ≥0.99 as early as layers 1-3. Dishonesty representations consolidate progressively in deeper layers, with implications for activation-based monitoring.

Papers AI safety Alignment

SIG

HYP

arXiv cs.AI·Jun 1

Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation

GLIDE is an open-source Python library unifying prediction-powered inference methods (PPI++, Stratified PPI, Predict-Then-Debias) for evaluating agentic systems. It combines human annotations and LLM judgments into unbiased estimates with valid confidence intervals, reducing annotation costs while maintaining precision.

AI Agents Evals Open source

SIG

HYP

arXiv cs.CL·Jun 1

Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

Researchers reveal that statistical watermarks in LLMs are vulnerable to linear ensembles. Averaging probability distributions across 3-5 models cancels out watermark perturbations. WASH (Watermark Attenuation via Statistical Hybridisation) defeats detection across 6 watermarking schemes, reducing z-scores from 5-300 to <2 (threshold: 4), while improving output quality by 27.5%.

AI safety Alignment Papers

SIG

HYP

arXiv cs.LG·Jun 1

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

A new counterfactual evaluation metric (CSS) reveals that six frontier models ranked similarly on traditional coverage-based metrics rank in nearly opposite order when assessed on their ability to update clinical recommendations in response to oncology case mutations. All models fail on surgery-status interventions, a safety blind spot invisible to coverage metrics.

Benchmarks Evals AI Agents

SIG

HYP

arXiv cs.LG·Jun 1

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

LongDS-Bench evaluates AI agents' ability to maintain analytical context over long horizons. The benchmark contains 68 multi-turn data analysis tasks (2,225 turns) from real Kaggle notebooks. Best models reach only 48.45% accuracy, with a 47-point performance drop from early to late turns. Long-horizon errors account for 52–69% of failures.

AI Agents Benchmarks Evals

SIG

HYP

Reddit r/LocalLLaMA·May 31

I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python

NVIDIA Parakeet speech-to-text ported to C++/ggml without Python or PyTorch. Byte-for-byte identical output to NeMo, up to 5x faster on GPU for larger models, 600x realtime on audio clips. Quantized GGUFs (f16, q8_0, q6_k, q5_k, q4_k), flat C API, integrated in LocalAI with OpenAI-compatible endpoint.

Voice Open source Tools

SIG

HYP

Reddit r/LocalLLaMA·May 31

Flash Attention for llama.cpp on RDNA3: 47% less KV VRAM than Vulkan f16 K, KLD almost losselss on F16 K / q4_0 V. Part 1.

Flash Attention optimization for llama.cpp on RDNA3 GPUs: 47% VRAM reduction vs Vulkan f16. Packs four 8-bit K-values into native sudot4 instructions without lossy quantization. At 128k context with MTP draft: 21.76 GiB vs 23.18 GiB (1.42 GiB savings). Quality preserved: mean KLD 0.00455 (q4_0 V), 97.06% identical top tokens.

Llama Code generation Benchmarks

SIG

HYP

Reddit r/MachineLearning·May 29

Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P]

Optimized monokernel for LLM inference on AMD MI300X: 3,300 output tokens/s per request (batch 1, no speculative decoding). Architecture mapped to GPU physical topology. Initial support for 2B model, frontier MoE planned.

Infrastructure Code generation Benchmarks

SIG

HYP

Reddit r/MachineLearning·May 29

Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]

Research on probe-targeted fine-tuning (LoRA) for verbal confidence calibration in LLMs. Models internally detect correct answers (0.76–0.88 AUROC) but output 99% confidence uniformly. Fine-tuning across 8 models (7B–70B) with causal activation patching (ρ=0.976). Code and pre-registration available.

Fine-tuning Reasoning Alignment

SIG

HYP

arXiv cs.AI·May 29

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

BenchTrace is a benchmark for evaluating self-evolution ability in LLM agents. Built on 1,821 annotated episodes across six tasks, it measures reflection quality and tests whether agents avoid past failures. Experiments on Qwen3-32B and GPT-4.1: <30% pass rate on reflection evaluation, agents forget early lessons and fail to generalize reflections.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·May 29

Sequential Physics-Constrained Neural Operator Forward Modeling for the $\textit{Norne}$ Reservoir System

Mathematical framework for surrogate modeling of oil reservoirs (Norne, 46×112×22 grid) using Fourier Neural Operators (FNO) and physics-informed variant (PINO). Empirical validation: R²>0.99 (oil), R²>0.90 (gas), R²≈0.80 (pressure) over 3298 days. 10⁴× speedup vs OPM simulator, 1000-member ensemble in <1 min on B200 GPU.

Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 29

MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models

MechELK is a mechanistic interpretability framework for extracting latent knowledge from LLMs. Through three stages (localization via SAE, verification by causal probing, elicitation via representation engineering), it achieves 84.7% accuracy on TruthfulQA, outperforming CCS by 6.2% and identifies 78.3% of hidden knowledge when model output is incorrect.

Reasoning AI safety Alignment

SIG

HYP

arXiv cs.CL·May 29

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

Comprehensive evaluation of 14 open-source safety guard models on 79,331 samples across 8 NIST AI Risk Framework categories. Qwen Guard (4B) achieves highest recall (83.97%), outperforming Llama Guard (12B) and GPT-OSS Safeguard (20B). Model size does not correlate with safety detection performance.

Benchmarks AI safety Open source

SIG

HYP

arXiv cs.LG·May 28

Learn from your own latents and not from tokens: A sample-complexity theory

Theoretical paper on sample complexity of models predicting their own latent representations (data2vec, JEPA). Proves latent prediction reduces sample complexity from exponential in L (depth) to constant, versus token prediction. Validated on probabilistic grammars and neural networks.

Papers Reasoning Evals

SIG

HYP

arXiv cs.AI·May 28

Laguna M.1/XS.2 Technical Report

Laguna M.1 (225.8B parameters, 23.4B activated) and Laguna XS.2 (33.4B total, 3B activated) are two MoE foundation models trained end-to-end for agentic coding. Competitive on SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0. XS.2 released under Apache 2.0.

AI Agents Code generation Benchmarks

SIG

HYP

arXiv cs.LG·May 28

Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression

HQMQ, a calibration-free KV cache compression method for LLMs, quantizes each 4-element chunk as a Hurwitz quaternion. Tested on Mistral-7B, Llama-3-8B, Qwen2.5/3-8B, and gpt-oss-20b: matches fp16 quality at ~5 bits, achieves up to 5.05× compression (Llama-3-70B: 43 GB → 8.5 GB), outperforms naive int4 by 3–1900×.

Benchmarks Infrastructure Papers

SIG

HYP

arXiv cs.LG·May 28

A Simple State Space Model Excels at Multivariate Time Series Classification

Systematic study comparing state space models (SSM) for time series classification. S4D outperforms Mamba variants in accuracy and efficiency. Authors introduce MS4 and MS4N, lightweight S4D variants with linear input projection and channel-mixing. Evaluation on 59 datasets (MONSTER, UEA): MS4N matches models 10× larger in parameters.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.CL·May 27

Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study

EnterpriseMem-Bench, a multi-turn Text-to-SQL benchmark with 1,400 turns across 300 sessions, evaluates GPT-5 mini, GPT-5.2, Claude Sonnet 4.5/4.6, and Opus 4.6. Key findings: without memory, accuracy collapses by Turn 3; working memory dominates complex architectures; Sonnet 4.6 regresses 17-33pp on SEC EDGAR vs Sonnet 4.5.

Benchmarks Code generation GPT

SIG

HYP

arXiv cs.CL·May 27

SPEAR: Code-Augmented Agentic Prompt Optimization

SPEAR is an agentic prompt optimizer integrating a Python sandbox for structural error analysis (confusion matrices, clustering). Evaluated on 13 industrial LLM-as-judge tasks and BBH-7, it outperforms GEPA and TextGrad (κ 0.857 vs 0.359 on tool-selection; F1-macro 0.815 vs 0.763). Python tool contributes +0.79κ on complex judge tasks.

Prompt engineering AI Agents Code generation

SIG

HYP

arXiv cs.CL·May 27

Self-Verified Distillation: Your Language Model Is Secretly Its Own Synthetic Data Pipeline

Qwen3 improves reasoning via Self-Verified Distillation, a post-training algorithm requiring no external data. The model generates solutions, filters them through self-verification (cycle-consistency, factuality, correctness), then trains on self-curated data. Gains: +16.7 points math (AIME26/HMMT), +11.1 science (GPQA), +8.3 coding for Qwen3-4B.

Qwen Fine-tuning Reasoning

SIG

HYP

arXiv cs.AI·May 27

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

ScientistOne, an autonomous research system, introduces Chain-of-Evidence (CoE) to trace every claim to its source. Evaluation across 75 papers: baseline systems show 21% hallucinated references, 42% score verification pass rate. ScientistOne achieves 0 hallucinations, perfect verification, and matches or exceeds human expert performance on five tasks.

AI Agents Reasoning Evals

SIG

HYP

arXiv cs.AI·May 27

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

AgingBench, a longitudinal reliability benchmark, measures how deployed AI agents degrade over time. Study across 14 models and ~400 runs shows reliability depends on four mechanisms: compression, interference, revision, and maintenance aging. Agents lose factual precision even when behavioral tests remain clean.

AI Agents Evals Benchmarks

SIG

HYP