Page 16 of 192

AllHigh signalRecent

7679 articles

Pruning Deep Neural Networks via the Marchenko--Pastur Distribution

Neural network pruning method using Marchenko–Pastur random-matrix theory with minimal post-pruning fine-tuning. On ImageNet-1k, ViT-B/16 achieves 83.41% top-1 with 59.81% MAC reduction after 3 distillation epochs; ResNet50 8:16 reaches 75.87% with 1.62× A40 speedup.

Benchmarks Papers Vision

SIG

HYP

arXiv cs.CL·Jun 3

EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction

EURO-5K is a 5K-sentence corpus for extracting reporting obligations from EU legislation (136 legislative acts). Comparison of fine-tuned BERT and LLMs (QLoRA): generic and legal BERT achieve similar 0.89 F1; legal pretraining helps mainly for parameter-efficient tuning. Convergence at 3K samples.

Benchmarks Fine-tuning Papers

SIG

HYP

arXiv cs.LG·Jun 3

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

Reward guidance algorithms steer generative processes toward reward-tilted measures. The paper shows reward hacking stems from finite-particle plug-in estimation of the Doob h-function in practical implementations. Authors propose a closed-form reward damping schedule and validate on Gaussian targets, 2D checkerboard, and FLUX.1 text-to-image generation.

Reinforcement learning Reasoning Papers

SIG

HYP

arXiv cs.AI·Jun 3

Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks

Study of 'handoff debt': the rediscovery cost when a coding agent resumes an interrupted task. Across 75 tasks and 724 runs, structured notes reduce median agent events by 20–59% and tokens by 42–63% vs. repository-only takeover. Agent benchmarks should evaluate resumption efficiency, not just task resolution.

AI Agents Code generation Benchmarks

SIG

HYP

arXiv cs.CL·Jun 3

AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

Factorial study of 4 open-source LLMs rating clinical decisions in type 2 diabetes pharmacotherapy. LLMs as AI raters score 74–78 points under rubric-free protocol vs 7.69–49.64 points under anchored Gold Rubric. Rubric amplifies discrimination between CDSS models (1.76–5.10×) and reveals behavioral variation suppressed without rubric.

Evals Benchmarks AI safety

SIG

HYP

arXiv cs.CL·Jun 3

Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics

Method to predict best-of-N inference scaling gains without running the full procedure. Ridge predictor identifies 3 stable features (prompt-level agreement spread, label-assisted first-correct-sample position, completion-length variance) plus entropy, reaching Spearman ρ=0.90 correlation with actual gains across model families and math/reasoning tasks.

Reasoning Evals Reinforcement learning

SIG

HYP

arXiv cs.LG·Jun 3

ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services

ReLoRA is an efficient re-adaptation framework for continuously evolving LLM services. It uses Bayesian optimization to initialize LoRA adapters compatible with base-model updates, then fine-tunes with scheduled regularization. Results: up to 8.9× reduction in time-to-readiness and up to 4.6% accuracy improvement.

Fine-tuning Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 3

MemTrain: Self-Supervised Context Memory Training

MemTrain introduces a self-supervised training framework to enhance context-memory capabilities of LLM agents. Two coupled proxy tasks on Wikipedia (masked entity reconstruction and intermediate memory recall) are jointly optimized using GRPO. Achieves gains up to 17.67 points on long-text QA and search-based QA benchmarks.

AI Agents Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·Jun 3

AURA: Action-Gated Memory for Robot Policies at Constant VRAM

AURA-Mem introduces a constant-size recurrent memory (4,224 bytes) for robot policies, with a learned gate that writes only when observations would change the next action. On LIBERO-Long with OpenVLA-OFT 7B, it matches baseline policy (0.233 success) while reducing memory writes by 7× and VRAM by 6,061× versus KV-cache.

Robotics AI Agents Reasoning

SIG

HYP

arXiv cs.AI·Jun 3

Inducing Reasoning Primitives from Agent Traces

Method to extract reasoning primitives from ReAct agent traces. Recurrent reasoning moves are clustered and converted into typed pseudo-tools. Induced libraries outperform the source agent: +44pp on RuleArena NBA, +30pp on MuSR, +22pp on NatPlan.

AI Agents Reasoning Prompt engineering

SIG

HYP

arXiv cs.AI·Jun 3

When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

Study across 6,000 task-condition pairs shows multi-agent debate degrades generation (-1.6 to -15.5pp) via critique-induced confusion, yet improves error detection (+27.4pp F1). Adversarial separation with code-execution grounding and evidence-gated generation achieves +5.3pp on generative tasks.

Multi-agent AI Agents Evals

SIG

HYP

arXiv cs.CL·Jun 3

Experience-Driven Dynamic Exits for LLMs with Reinforcement Learning

LEDE, an offline reinforcement learning framework, optimizes LLM inference by dynamically selecting exit layer and speculation length based on local sequence context. On Llama-2 and Llama-3, it achieves 2.0×–2.7× speedup over autoregressive decoding, +17% over static speculative baselines.

Llama Reinforcement learning Code generation

SIG

HYP

arXiv cs.CL·Jun 3

The Deliberative Illusion: Diagnosing Factual Attrition and Stance Homogenization in Multi-Agent LLM Deliberation

Multi-agent LLM systems lose up to 72% of issue-critical facts during deliberation, creating a 'deliberative illusion'. DelibTrace measures factual attrition and stance homogenization. Agents converge toward consensus while forgetting essential elements needed to interpret the problem.

Multi-agent AI Agents Evals

SIG

HYP

arXiv cs.CL·Jun 3

Regret Pre-training: Bridging Prior and Posterior Views for Enhanced Knowledge Grounding

Regret Pre-training introduces a self-supervised framework based on LUPI using dual-view architecture generating Student (causal) and Teacher (future-conditioned) distributions. On OLMoE-1B-7B after 4B tokens, GlobalRegret and LocalRegret achieve 33.9% and 32.2% average accuracy vs 30.2% baseline, with 18.1pp gain on BoolQ. No additional parameters.

Papers Reasoning Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 3

IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation

IdiomX is a large-scale multilingual benchmark for idiom understanding, containing 190K+ contextualized examples across 12K+ idioms in English, Arabic, and French. The dataset includes idiomatic/literal usage labels and linguistic metadata. Four tasks evaluate idiom detection, retrieval, and interpretation.

Benchmarks

SIG

HYP

arXiv cs.LG·Jun 3

RESCAST-100K: A Comprehensive Dataset for Cross-Domain Residential Load and Indoor Temperature Forecasting

RESCAST-100K is a benchmark of 100,000 U.S. homes simulated via EnergyPlus/ResStock for evaluating cross-domain generalization in residential energy load and indoor temperature forecasting. 15-minute time series dataset with 40+ static building covariates, integrating 5 real-world datasets. Cross-attention and MLP-mixer models outperform classical transformers under domain shift.

Benchmarks Fine-tuning Papers

SIG

HYP

arXiv cs.LG·Jun 3

GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning

GRZO is a zeroth-order optimizer for memory-efficient LLM fine-tuning. It draws one perturbation per mini-batch example and aggregates losses via group-relative normalization, increasing effective gradient directions from one to batch size at no additional forward cost. On Llama3-8B, GRZO achieves +3.0 accuracy over MeZO with 23% lower peak GPU memory.

Fine-tuning Papers Benchmarks

SIG

HYP

arXiv cs.LG·Jun 3

RRISE: Robust Radius Inference via a Surrogate Estimator

RRISE compresses randomized smoothing certification into a single forward pass via a learned surrogate, replacing up to 10⁴ Monte Carlo evaluations per query. Conformal calibration ensures conservative certified radii. On CIFAR-100 and Tiny ImageNet, 1.23–1.91× higher certified accuracy than prior offline-surrogate methods.

Benchmarks AI safety Evals

SIG

HYP

Simon Willison·Jun 2

Microsoft's new MAI models

Microsoft announces MAI-Thinking-1 (35B, reasoning) and MAI-Code-1-Flash (5B, code). The former outperforms Claude Sonnet 4.6 in blind human evaluation. Both trained on commercially licensed data without third-party distillation.

Code generation Reasoning Benchmarks

SIG

HYP

Reddit r/MachineLearning·Jun 2

Backpropagation destroys V1 brain alignment in one epoch, tracking RSA alignment to fMRI across training for BP, FA, predictive coding, and STDP [R]

Comparative study of learning rules (backprop, feedback alignment, predictive coding, STDP) via RSA alignment with human V1 fMRI. Backprop destroys 90% of V1 alignment after 1 epoch (r: 0.102→0.011), while PC and STDP lose only 25-31%. At epoch 40: PC/STDP >> BP/FA. Suggests fundamental trade-off between global error signals (higher layers) and early-layer alignment.

Alignment Benchmarks Papers

SIG

HYP

Reddit r/MachineLearning·Jun 2

LLM agents patch security bugs, pass all tests, but still leave the vulnerability open [R]

CVE-Bench evaluates 5 frontier models on 20 real-world CVEs (Pillow, GitPython, urllib3, etc.) across 300 runs. Max solve rate 50% (60% under advisory). Agents patch syntactically but leave vulnerabilities open. Significant cross-family gaps (OpenAI vs Laguna, p<0.05), within-family noise. Failure modes: wrong-search drift, hallucinations, context loss.

AI Agents Benchmarks AI safety

SIG

HYP

arXiv cs.LG·Jun 2

Agentic Transformers Provably Learn to Search via Reinforcement Learning

Theoretical study showing how transformers learn to implement tree search (DFS) via RL. A two-head transformer naturally emerges from policy gradient training on stochastic trees without expert demonstrations. The model generalizes to unseen depths and adapts its strategy based on goal distributions.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·Jun 2

Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design

PROBE, an optimization framework for LLM agents in drug design, resolves the conflict between binding affinity and druggability. Through controlled edit probing and pocket-specific site mapping, it guides a multi-agent loop (affinity, druggability, co-optimization) on CrossDocked2020 with SOTA results.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.LG·Jun 2

Quantized Reasoning Models Think They Need to Think Longer, but They Do Not

Post-training quantization (PTQ) reduces reasoning model accuracy and increases chain-of-thought length. 52% of failures involve correct intermediate answers not output as final answers. A training-free logit penalty on overthinking markers ("wait", "but", "alternatively") reduces CoT length by 12-23% while preserving accuracy across 5 models (1.5B-32B) and 5 benchmarks.

Reasoning Fine-tuning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 2

From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging

Preference Delta Aggregation (PDA) aggregates weak preference signals from model pairs (e.g., Qwen3 4B vs 1.7B) via LoRA merging. Geometric Alignment Merging (GAM) aligns adapter subspaces before aggregation. On knowledge reasoning and agentic search benchmarks, PDA+GAM improves Qwen3 8B by +6.8 and +7.3 points respectively.

Qwen Fine-tuning Reinforcement learning

SIG

HYP

arXiv cs.LG·Jun 2

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

Study of effectiveness and efficiency of tool-calling in LLM agents. Authors show evaluation pipelines are sensitive to minor choices (random seed, system prompt, multi-turn templates) affecting leaderboard reliability. They identify two sources of computational waste in RL and propose two acceleration techniques without performance degradation.

AI Agents Reinforcement learning Evals

SIG

HYP

arXiv cs.AI·Jun 2

Robust Shielding for Safe Reinforcement Learning

Novel shielding framework for RL agents ensuring formal safety guarantees in MDPs with unknown transition dynamics. Uses robust MDPs (RMDPs) with sets of transition probabilities and LTL formulas. Combines shielding with PAC-learning methods to construct minimally restrictive shields while guaranteeing safety.

Reinforcement learning AI safety Reasoning

SIG

HYP

arXiv cs.AI·Jun 2

Capability Self-Assessment: Teaching LLMs to Know Their Limits

Modern LLMs systematically overestimate their competence and attempt unsolvable queries. Researchers propose Capability Self-Assessment (CSA), formulated as a policy-learning problem using reinforcement learning, to teach models to recognize their limits. RL significantly outperforms supervised fine-tuning, preserves original capabilities, and generalizes out-of-distribution.

Reinforcement learning Alignment Evals

SIG

HYP

arXiv cs.CL·Jun 2

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

UniKE, the first benchmark for cross-modality knowledge editing in unified multimodal models (UMMs), reveals a critical gap: text-side efficacy reaches 92% but VQA accuracy in image generation drops to 18.5%. A reasoning-augmented parameter editing method improves results by up to +18.6 percentage points.

Benchmarks Vision Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 2

SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors

SALSA adapts speech-aware LLMs via layer-wise learned steering vectors optimized with supervised objectives. Tested on children's speech, multilingual, and Mandarin-English code-switching, it achieves up to 46.8% relative improvement over zero-shot. Steering the encoder's later layers outperforms steering the LLM backbone.

Voice Fine-tuning Reasoning

SIG

HYP

arXiv cs.CL·Jun 2

Which Institutional Frameworks Do Chatbots Assume? Auditing Jurisdictional Defaults in Multilingual LLMs

Audit of 7 LLMs (US/China) on 2,520 responses to 60 legal-administrative prompts in English and Mandarin. Models default to the institutional framework of input language: 74.5% of English responses adopt US framework, 53.3% of Chinese responses adopt China framework. Risk of jurisdictional misselection when preferred language differs from applicable jurisdiction.

Benchmarks AI safety Regulation

SIG

HYP

arXiv cs.CL·Jun 2

Graph-Augmented Retrieval for Cross-Entity Financial Sentiment Analysis: A Comparative Study

Comparative study of a two-hop Graph-RAG architecture versus standard vector-only RAG for cross-entity financial sentiment analysis. On 100 queries (30 direct, 70 relational), Graph-RAG improves entity recall (+6.4%, p<0.001) and answer relevancy for complex queries (+11.7%), with no quality degradation, modest 22.6% latency increase but 80% variance reduction.

RAG Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 2

KG-Guard: Graph-Based Hallucination Detection for Knowledge Base Question Answering

KG-Guard detects hallucinations in knowledge base question answering (KBQA) systems using an augmented graph and lightweight encoder. The model achieves F1 scores of 82.0–87.4 on WebQSP/ComplexWebQuestions with 305× fewer parameters than baselines, and improves downstream KBQA F1 by 13–14.5 points through iterative refinement feedback.

Reasoning Evals RAG

SIG

HYP

arXiv cs.LG·Jun 2

FLaG: Fine-Grained Latent Grouping for Hallucination Detection

FLaG is a lightweight hallucination detection framework for LLMs that models correctness through latent evidence groups. Using energy-based routing and log-marginal aggregation, it captures heterogeneous hallucination patterns without modifying the underlying model. SOTA results across multiple benchmarks with robust transfer across datasets.

AI safety Evals Reasoning

SIG

HYP

arXiv cs.LG·Jun 2

RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting

RAFT is a two-stage domain fine-tuning method that mitigates catastrophic forgetting. It refines data via self-conditioned rewriting and answer fusion, then applies on-policy distillation where the original model provides soft targets on student-generated trajectories. Across five domains, RAFT improves domain accuracy by 23.2% over standard SFT and recovers 18.2% of degradation on MS-Bench.

Fine-tuning Reinforcement learning Papers

SIG

HYP

arXiv cs.LG·Jun 2

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

BudgetDraft trains a sparse drafter for speculative decoding in long-context inference (4K-16K tokens). The method exposes the model to multiple KV budgets during training and aligns each sparse view with a shared full-cache teacher target. Results: 6.55x, 4.46x, 2.10x speedup vs autoregressive decoding at 4K, 8K, 16K tokens.

Reasoning Benchmarks Infrastructure

SIG

HYP

arXiv cs.AI·Jun 2

KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning

KACE decouples storage from usage in context for mathematical reasoning. An epistemic tree stratified by difficulty and domain is built offline via self-reflective loop. At evaluation, tiered self-consistency dynamically classifies problems and selectively retrieves matching cards. On AIME 2025: 62.2% accuracy (+10.4 points vs Best-of-5).

Reasoning Prompt engineering Benchmarks

SIG

HYP

arXiv cs.AI·Jun 2

Threshold-Based Exclusive Batching for LLM Inference

arXiv paper on LLM inference batching optimization. Authors demonstrate mixed batching (MB) is suboptimal on bandwidth-constrained GPUs: exclusive batching (EB) achieves 41.9% higher throughput on RTX PRO 6000 (1.792 TB/s). They propose EB+, a hybrid scheduler that dynamically switches between EB and MB based on GPU bandwidth, model size, and workload composition, reaching 36.4% gains under non-stationary traffic.

Infrastructure Benchmarks Papers

SIG

HYP

arXiv cs.AI·Jun 2

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

TAPS introduces a target-aware prefix selection method for diffusion-drafted speculative decoding. By converting diffusion marginals into path-conditioned acceptance estimates, TAPS selects a compact prefix-closed subtree under fixed verification budget. Results: 7.9x lossless speedup vs vanilla autoregressive decoding, 1.36x and 1.74x over DFlash and DDTree.

Code generation Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 2

SDR: Set-Distance Rewards for Radiology Report Generation

New set-distance reward method for reinforcement learning of vision-language models on chest X-ray report generation. Tested on Qwen3-VL, Gemma3 with GRPO: 6.80% (BERTScore), 7.82% (RadGraph F1), 4.45% (CheXbert F1) improvements over supervised fine-tuning. Enables test-time best-of-N selection and mid-generation pruning reducing tokens by 50%.

Reinforcement learning Vision Code generation

SIG

HYP