Page 14 of 192

AllHigh signalRecent

7679 articles

Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers

Study of 403 US hyperscale data centers (May 2024–April 2025): estimated consumption 68–99 TWh, emissions 37–54 Mt CO2. Represent 1.8% of US electricity consumption. Carbon intensity 545 gCO2/kWh, 48% above national average (370 gCO2/kWh). 54% of electricity from fossil-fuel sources.

Infrastructure Regulation AI safety

SIG

HYP

arXiv cs.AI·Jun 6

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

LLM judges used to evaluate AI models are unstable under post-decision interaction. On MT-Bench and AlpacaEval, researchers show initial judgments can be reversed through targeted challenges, degrading agreement with human preferences and shifting benchmark rankings. They introduce the Evaluation Robustness Score (ERS) to measure this fragility.

Evals Benchmarks AI safety

SIG

HYP

arXiv cs.AI·Jun 6

What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems

PACT is a protocol for inter-agent communication that compresses messages between LLM agents into compact action-state records. Tested on two MAS topologies, it reduces token usage by 10-50% while maintaining or improving performance on OpenHands and SWE-agent.

Multi-agent AI Agents Code generation

SIG

HYP

arXiv cs.AI·Jun 6

Evaluation of LLMs for Mathematical Formalization in Lean

Comparison of LLMs for generating formal proofs in Lean 4. Gemini 3.1 Pro and Claude Opus 4.7 achieve best performance (92% and 86% success rates respectively via refine@32). NVIDIA Nemotron 3 Super and GPT-OSS 120B offer best cost-efficiency (<$0.01 per correct proof).

Benchmarks Claude Gemini

SIG

HYP

arXiv cs.AI·Jun 6

Multilingual Fine-Tuning via Localized Gradient Conflict Resolution

Novel multilingual fine-tuning method using multi-objective optimization (MOO) applied locally on parameter buckets. Resolves gradient conflicts across languages without communication overhead. Demonstrates improved performance on seen and unseen languages across 4 base LLMs.

Fine-tuning Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·Jun 6

Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution

arXiv study on LLM-driven program mutation: mutation chains converge rapidly toward restricted regions of program space. 87% of chains revisit 93% of previously seen structural forms. Phenomenon is robust across models and prompts, revealing systematic bias toward structural homogeneity incompatible with open-ended exploration.

Papers Code generation Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 6

Answer Presence Drives RAG Rewriting Gains

Controlled intervention study shows RAG rewriting gains are driven by gold answer presence in rewritten context, not curation quality. Tests across Qwen2.5/3.5, GLM-4 and HotpotQA/2WikiMultihopQA: removing answer drops F1 by 28–64 points, injecting it raises F1 by +0.7 to +9.7 points. Authors release intervention runner and sentinel panel for reproducible evaluation.

RAG Evals Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·Jun 5

dots.tts 2B🎙️ SOTA TTS from RedNote

RedNote (Xiaohongshu) releases dots.tts, an open-source TTS model with 2B parameters under Apache 2.0. Fully continuous architecture without codec tokens, 48 kHz synthesis, zero-shot voice cloning, direct text-to-speech pipeline.

Voice Open source Tools

SIG

HYP

Reddit r/MachineLearning·Jun 5

Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]

CPU inference benchmark for Parakeet TDT 0.6B on 2 x86-64 vCPUs (7.7GB RAM). ONNX Runtime FP32 achieves RTF 0.328 (37% faster than HF Transformers bfloat16 at 0.519) but peaks at 2.7GB memory. GGUF Q6_K reduces to 928MB but doubles RTF to 0.708. Methodological note: espeak-ng inflates WER to 20.9% vs gTTS 4.65%.

Benchmarks Code generation Voice

SIG

HYP

arXiv cs.CL·Jun 5

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

AdaPlanBench is an interactive benchmark evaluating LLM agents' ability to adaptively plan and replan under progressively revealed world and user constraints. Built on 307 household tasks, it tests 10 leading models: best achieves 67.75% accuracy. Performance degrades with constraint accumulation, particularly for user constraints.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 5

Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

arXiv paper demonstrating that optimal hyperparameters for LLM continued pre-training follow predictable scaling laws. Two-stage framework: empirical law discovery via proxy models, then state-aware prediction using validation loss and equivalent pre-training compute. Reduces hyperparameter search overhead by 90% while maintaining performance.

Benchmarks Fine-tuning Papers

SIG

HYP

arXiv cs.LG·Jun 5

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

A4D introduces a functional latent space organized around object affordances rather than appearance to improve robot planning generalization. The system achieves 94% accuracy on existing affordances (+15 points vs SOTA) and 90% on new affordances using only 10% of original training data, with 100x faster inference.

Robotics Reasoning Vision

SIG

HYP

arXiv cs.CL·Jun 5

Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution

PRIG, a gradient attribution method, localizes ambiguity in LLM prompts by training a linear probe to distinguish clear from ambiguous prompts, then attributes the probe score to token representations. Evaluated on synthetic datasets (coding, math, writing) and a human-written gold benchmark, PRIG achieves 0.840 AUROC on combined synthetic benchmark and 0.891 AUROC on gold set.

Prompt engineering Evals Papers

SIG

HYP

arXiv cs.LG·Jun 5

LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")

LEVANTE-bench benchmarks vision-language models against children aged 5-12 (N=1547) on cognitive tasks across 3 countries. Larger models show better overall alignment, but smaller models match younger children's error patterns better. VLMs struggle on matrix reasoning and mental rotation tasks.

Vision Benchmarks Evals

SIG

HYP

arXiv cs.LG·Jun 5

SHALA-LLM: Smartly Handling Ambiguous Labels in Aligning LLMs

SHALA-LLM is a reinforcement learning framework that treats label ambiguity as useful information rather than noise. On NLI and emotion recognition tasks, it reduces Jensen-Shannon Distance by 62.1% on ChaosNLI and improves F1 by 16.7% by learning directly from annotator distributions.

Reinforcement learning Alignment Evals

SIG

HYP

arXiv cs.LG·Jun 5

Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents

Agentic Monte Carlo (AMC) optimizes black-box LLM agents without parameter access. The method uses Sequential Monte Carlo to sample from the optimal policy by learning a value function to steer the agent, leaving the underlying model unchanged. Validated on AgentGym, AMC outperforms prompting baselines and GRPO.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.CL·Jun 5

Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics

Epidemiological study of model collapse from synthetic data training. Bilayer SIR/SIRS framework models cross-contamination between data corpora and AI models. GPT-2 experiments on WikiText and Shakespeare (192 runs) confirm dose-response degradation; R₀ > 1 indicates supercritical dynamics. Synthetic-text detection and filtering identified as highest-leverage interventions.

Papers AI safety Benchmarks

SIG

HYP

arXiv cs.CL·Jun 5

TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

TensorBench is a benchmark of 199 coding tasks (feature additions and refactoring) on an open-source compiler-based tensor framework extending PyTorch. Evaluation of 7 coding agents: pass rates from 64.8% (strongest) to 22.1% (weakest), with low inter-agent agreement (κ=0.05 for top two agents).

Benchmarks Code generation AI Agents

SIG

HYP

arXiv cs.CL·Jun 5

ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

ArcANE is an automatically constructed benchmark evaluating whether role-playing language agents maintain character psychological consistency across narrative phases. Built on 17 novels and 80 principal characters, it tests responses across story phases and unseen scenarios. Conditioning on Character Arc outperforms all other context strategies across 6 models and 6 context modes.

Benchmarks AI Agents Evals

SIG

HYP

arXiv cs.CL·Jun 5

Self-supervised User Profile Generation for Personalization

BUMP is a self-supervised framework generating textual user profiles for LLM personalization. Trained via GRPO on bidirectional ranking objectives (profile vs interactions, interactions vs profiles), it requires no labeled supervision. Evaluated on LaMP, BUMP matches or outperforms closed-source APIs without task labels.

Reinforcement learning RAG

SIG

HYP

arXiv cs.LG·Jun 5

Less is MoE: Trimming Experts in Domain-Specialist Language Models

Fisher-MoE proposes compressing Mixture-of-Experts models by targeting intermediate FFN dimensions rather than entire experts. On Qwen1.5-MoE, removing just 12 of 1.35M critical dimensions (identified via Fisher importance) preserves performance while reducing memory by ~45% and improving inference throughput by 21%.

Qwen Benchmarks Fine-tuning

SIG

HYP

arXiv cs.LG·Jun 5

The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models

Stereological theory of LLM benchmark coverage. For d_eff ∈ [2.86, 4.80], structural blind spot exceeds runner-up score gap by two orders of magnitude. Submodular greedy algorithm identifies 4 stable benchmarks; 7 of 12 suffice for 90% coverage. Validation across 12 internal benchmarks and 27 Chatbot Arena categories.

Benchmarks Evals

SIG

HYP

arXiv cs.LG·Jun 5

Pattern Selectivity is Not Task-Causal Structure: A Cross-Architecture Mechanistic Study of Composed-Task Circuits in 1B-Class Language Models

Mechanistic cross-architecture study on 3 1B-class models (Pythia, OLMo, OLMoE) testing whether circuit identification via pattern selectivity + causal ablation yields reproducible findings. Result: same task, same behavioral capability, different implementations across models. Five-category taxonomy (primary cause, secondary cause, correlate, interferer, null) with quantitative thresholds introduced.

Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 5

Multimarginal flow matching with optimal transport potentials

Multimarginal flow matching with optimal transport potentials. Novel approach combining flow matching and dynamic optimal transport to model temporal evolution with observed intermediate marginals. Simulation-free algorithm (OTP-FM) validated on RNA-seq, oceanographic, and meteorological datasets.

Reasoning Papers

SIG

HYP

arXiv cs.LG·Jun 5

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

PRECISE extends Prediction-Powered Inference for ranking evaluation by combining small human-labeled sets with large LLM-judged sets using Claude 3 Sonnet. Reduces Precision@4 standard error from 4.45 to 3.50 (−21% relative). In production, correctly identifies best system variant from 100 human labels; A/B testing confirms +407 bps daily sales lift.

Evals Claude Benchmarks

SIG

HYP

arXiv cs.LG·Jun 5

State commitment learning: training language models to distinguish computation from memory

New training method to distinguish temporary computation from persistent state in language models. Counterfactual Erasure RL (CERL) rewards models when answers remain correct after erasing intermediate thoughts. Evaluation on mathematics, logic, and scientific QA shows reduced dependence on hidden computations without accuracy loss.

Reasoning Reinforcement learning Papers

SIG

HYP

Reddit r/LocalLLaMA·Jun 5

proveKV – Honest 36× lossless (vs f32, 18x vs fp16) KV‑cache compression for LLMs (zero PPL regression)

proveKV: open-source KV-cache compression technique for LLMs. Results: 36× lossless memory reduction vs f32, 18× vs fp16 on SmolLM2-1.7B + WikiText-2 (0% PPL regression). Automated audit pipeline with reproducible validation.

Open source Infrastructure Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·Jun 4

cyankiwi AWQ 4-bit — 26.05 update, NVFP4 + FP8 Dynamic quantization and benchmarks across Qwen3.6 4-bit quants

cyankiwi releases AWQ 4-bit update with NVFP4 and FP8 Dynamic quantization support. KL divergence benchmarks on Qwen3.6 27B and 35B-A3B: cyankiwi/Qwen3.6-27B-AWQ-INT4 achieves 0.020443 KLD (best dense), cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit achieves 0.017126 KLD (best MoE).

Qwen Benchmarks Open source

SIG

HYP

Reddit r/MachineLearning·Jun 4

We built a source-available LLM reliability library (free for research / personal / internal eval) that can cut inference cost by half at matched quality, and you adopt it by changing one import [P] [R]

Agentcodec unifies 28 LLM reliability techniques (retries, ensembling, verification, adaptive routing) under a single API. Adoption via one-line import change. On Nemotron + Devstral + GLM-5.1, adaptive router achieves 56% cost reduction at matched quality, or 7% quality gain at matched cost. Single λ parameter controls the trade-off.

Reasoning Evals Open source

SIG

HYP

Reddit r/LocalLLaMA·Jun 4

Qwen3.6-27B on 2x3090s: llama.cpp vs vLLM, all the flags, and the MTP acceptance/inference speed/context

Detailed benchmark of Qwen3.6-27B on 2x RTX 3090 comparing llama.cpp (Q6_K/Q8_0) and vLLM (INT4/INT8). Real measurements: throughput 43-54 tok/s, MTP acceptance rates 27-77% per backend. Setup with OpenAI-compatible proxy hot-swapping 4 configs, no PCIe P2P (Threadripper 1950X).

Qwen Code generation Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·Jun 4

KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag)

Huawei open-sources KVarN, a KV-cache quantization method (Apache 2.0, vLLM single-flag integration). 3–5× compression vs FP16, throughput up to 1.4× FP16, maintains reasoning quality unlike TurboQuant (Google). No retraining, no calibration required.

Open source Infrastructure Benchmarks

SIG

HYP

arXiv cs.LG·Jun 4

Spectral Scaling Laws of Muon

Systematic study of spectral behavior in Muon, an orthonormalization-based optimizer using Newton-Schulz iteration. Across 77M-2.8B parameter models, singular value quantiles of momentum buffers stabilize following power laws dependent on layer depth (exponents M^-0.25 to M^-0.96). Implications for NS configuration at frontier scale.

Benchmarks Papers Infrastructure

SIG

HYP

arXiv cs.LG·Jun 4

Training-Free Lexical-Dense Fusion for Conversational-Memory Retrieval

Training-free lexical-dense fusion study for long-term conversational memory retrieval. Score-level fusion of late-interaction dense + BM25 improves Hit@1 by +8.8 to +17.2 points across six encoders (Hit@1 0.752 with e5-large-v2). Web search cross-encoder reranker degrades results (-6.9 pp). Analysis shows division of labor: dense excels on multi-hop/temporal questions, BM25 on adversarial ones.

RAG Embeddings Benchmarks

SIG

HYP

arXiv cs.AI·Jun 4

Learning Admissible Heuristics via Cost Partitioning

Novel framework for learning admissible heuristics in optimal planning via cost partitioning. A neural network with axial self-attention predicts cost weights guaranteed admissible by construction, leveraging Lagrangian dual equivalence. Results show reduced node expansions versus suboptimal baselines while preserving optimality.

Reasoning Papers

SIG

HYP

arXiv cs.AI·Jun 4

Scaling Self-Evolving Agents via Parametric Memory

TMEM introduces a self-evolving parametric memory framework for LLM agents. Instead of storing experience solely as textual context, the agent absorbs distilled supervision into lightweight LoRA weights (Δ_t) via online updates, genuinely altering behavior within a single episode. Evaluated on LoCoMo, LongMemEval-S, and CL-Bench, TMEM consistently outperforms summary-based and retrieval-based baselines.

AI Agents Fine-tuning Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 4

StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

StepPRM-RTL combines stepwise trajectory modeling, Process Reward Models, and retrieval-augmented fine-tuning to improve LLM-based RTL code generation. The framework uses MCTS to explore alternative reasoning paths and achieves >10% improvement in functional correctness on Verilog/VHDL benchmarks.

Reinforcement learning Code generation Reasoning

SIG

HYP

arXiv cs.LG·Jun 4

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

RUBAS is a rubric-based reinforcement learning framework for LLM agent alignment. It decomposes behavior into four dimensions (tool-use safety, argument safety, response safety, helpfulness) and generates structured rewards over complete trajectories. Experiments show improved safety and reduced hallucinations while maintaining utility.

AI Agents Reinforcement learning AI safety

SIG

HYP

arXiv cs.LG·Jun 4

EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

EvalStop detects and stops RLHF jobs that overoptimize the reward model at the expense of real-world metrics. On 80% RLHF workloads (64 GPUs), the system achieves 98% precision and cuts wasted compute by 22% while improving JCT by 9% over SRTF-Est.

Reinforcement learning Evals Alignment

SIG

HYP

arXiv cs.LG·Jun 4

dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats

dMX is a differentiable mixed-precision quantization framework for learnable floating-point bit-width assignment across LLM layers. Tested on Llama, Qwen3, and SmolLM2 using the MXFP standard (Open Compute Project), it optimizes layer formats continuously then discretizes via annealing, outperforming KL-divergence heuristics on WikiText-2 and zero-shot reasoning benchmarks.

Llama Qwen Benchmarks

SIG

HYP

arXiv cs.CL·Jun 4

A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

Systematic study of positional bias in multi-video summarization with MLLMs. Benchmark on ActivityNet and News videos (2-4 inputs). Evaluation of 9 models (open-source and proprietary) using three metrics: Coverage, Directional Positional Bias, Middle-Edge Gap. Finding: positional effects are domain- and model-dependent; increasing visual budget does not uniformly remove imbalance.

Vision Benchmarks Evals

SIG

HYP