Page 12 of 192

AllHigh signalRecent

7679 articles

DiffusionGemma

Google releases DiffusionGemma-26B, an open-weight Gemma model (Apache 2 license) based on its May 2024 Gemini Diffusion research. The model generates text at 500+ tokens/second. NVIDIA hosts it free on NIM cloud API.

Gemini Open source Code generation

SIG

HYP

Reddit r/LocalLLaMA·Jun 10

DeepMind Just Dropped "DiffusionGemma" — Text Generation via Image-Style Diffusion Model

DeepMind releases DiffusionGemma, a 26B MoE model (3.8B active) under Apache 2.0. Instead of sequential token generation, it uses diffusion to refine 256 tokens simultaneously. Achieves 1000+ tokens/s on H100, 700+ on RTX 5090. Native integration with vLLM, Unsloth, HF Transformers.

DeepMind Code generation Open source

SIG

HYP

The Decoder·Jun 10

Claude Fable 5: The first Mythos model is powerful, expensive, and heavily filtered

Anthropic releases Claude Fable 5, first Mythos-class model. Leads benchmarks (SWE-bench Verified 95%), but costs 2× more than Opus 4.8 (10-50$/M tokens). Strict safety filters block 9% of requests; mandatory 30-day data retention policy.

Claude Benchmarks AI safety

SIG

HYP

Reddit r/LocalLLaMA·Jun 10

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

Comprehensive benchmark of Bonsai LM models (1-bit and 1.58-bit, 1.7B–8B) on Jetson Orin Nano Super ($250) using llama.cpp CUDA across 4 power modes. Key findings: 25W is efficiency sweet spot for ≤4B models (47–48% faster than 15W), no thermal throttling observed, Bonsai-1.7B Q1_0 achieves 5.84 tok/J in 237 MB with 26 tok/s.

Open source Benchmarks Infrastructure

SIG

HYP

arXiv cs.CL·Jun 10

Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

New Program-based Posterior Training (PPT) method to train LLMs for inductive reasoning. Uses LLM-generated probabilistic programs to create 10,000 training scenarios with distributional labels. Significantly improves accuracy, human alignment, and calibration independent of temperature scaling.

Reasoning Fine-tuning Papers

SIG

HYP

arXiv cs.CL·Jun 10

Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

arXiv study measuring VLM reliance on textual priors over visual content. 540-image benchmark with 4 question variants per image. 11 models tested: all degrade on hardest variant, open-source models drop furthest. No-image ablation reduces open models to 1–9% performance. GRPO post-training improves image-dependence across variants.

Vision Evals Benchmarks

SIG

HYP

arXiv cs.CL·Jun 10

MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

Researchers identify a shared low-dimensional encoding subspace in LLM residual streams that detects when agents covertly encode sensitive data (Base64, ROT13, etc.). MIRAGE, a real-time monitor leveraging two mechanistic signals, achieves AUC=0.918 on 126 exfiltration scenarios, substantially outperforming output-only detection (AUC=0.518).

AI safety Alignment Reasoning

SIG

HYP

arXiv cs.CL·Jun 10

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

Prefilling-dLLM optimizes diffusion language model inference by partitioning context into chunks, caching their KV representations, and selecting relevant chunks with intra-chunk token sparsity. Achieves 9.1–28.0x speedup on 8K–32K contexts without full prefix re-encoding.

Reasoning Benchmarks Infrastructure

SIG

HYP

arXiv cs.LG·Jun 10

Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning

DualSelect, a fine-tuning method for LLMs, jointly selects safety references and compatible task samples to preserve safety alignment during adaptation. Tested on 1B-8B models, it improves Safety Avg. by at least 5.10 points over strongest baselines while maintaining task utility.

Fine-tuning AI safety Alignment

SIG

HYP

arXiv cs.AI·Jun 10

Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts

Visual-SDPO, a self-distillation framework using visual feedback, improves code generation for visual artifacts (charts, web pages, slides). The method traces detected visual defects back to responsible code statements and amplifies distillation signals on those statements. On ChartMimic, Design2Code, and AeSlides benchmarks, it gains +10 absolute points over zero-shot and +2.4 over GRPO.

Code generation Vision Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 10

STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

STAGE-Claw is an automated framework for building and evaluating AI agents in realistic scenarios. It automatically generates tasks, environments, and state-based metrics. A benchmark of 40 tasks evaluates 11 frontier models on tool-call reliability and failure patterns.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.AI·Jun 10

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

RealMath-Eval, a benchmark of 224 real exam responses, reveals that state-of-the-art LLM judges fail to evaluate authentic human reasoning (MSE ~2.96 vs ~1.17 on synthetic solutions). Analysis shows human errors form a more diverse error space than synthetic errors, with higher information-theoretic surprisal.

Evals Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·Jun 10

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Latent Memory replaces each memory item (text/image) with a single compressed latent token, reducing generator token consumption by 3-10x. Trained with reconstruction, contrastive, and distillation objectives, the system achieves competitive performance on HotpotQA and multimodal benchmarks while lowering memory pressure.

RAG Embeddings Vision

SIG

HYP

arXiv cs.CL·Jun 10

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

LakeQA is a QA benchmark over 9.5 TB of heterogeneous data (Wikipedia + government sources) requiring search and multi-hop reasoning. GPT-4.5 achieves 18.37% exact-match. Evaluates LLM agents' ability to discover and analyze documents in massive data lakes.

Benchmarks Reasoning RAG

SIG

HYP

arXiv cs.LG·Jun 10

Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming

Co-GLANCE is a real-time onboard perception system for heterogeneous robot teams. It distills vision-language model semantic reasoning into an end-to-end model for occlusion segmentation and robot allocation, with statistical coverage guarantees via conformal prediction. Outperforms cloud-based baselines by 25-36% in accuracy while reducing inference latency 350x.

Vision Robotics Reasoning

SIG

HYP

arXiv cs.AI·Jun 10

Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents

Study on context optimization for autonomous LLM agents in enterprise workflows. Testing 4 GPT-5 configurations on 50 expense itemization tasks (Microsoft Dynamics 365). Pruning context to last 5 tool calls + summarization achieves 91.6% completion with 553k tokens (vs 1.48M full context), reducing runtime from 14.56h to 5.79h.

GPT AI Agents MCP

SIG

HYP

arXiv cs.LG·Jun 10

Conformal Risk Prediction for Non-Alcoholic Fatty Liver Disease Using Gradient Boosting with Distribution-Free Coverages

ML framework coupling gradient boosting and conformal prediction for NAFLD risk. Evaluated on 2,599 patients (Guangzhou), achieves AUROC 0.912 internally and 0.891 external validation. Conformal coverage 91.3% at 90% nominal level. Three-tier stratification: high-risk group shows 4.7× progression rate vs. low-risk tier.

Benchmarks Evals AI safety

SIG

HYP

arXiv cs.LG·Jun 10

TENP: Trapezoidal Expert Neuron Pruning For Mixture-of-Experts

TENP proposes a structured pruning framework for Mixture-of-Experts LLMs. The method identifies important experts and applies neuron-level pruning to less important experts in a trapezoidal pattern across layers. On DeepSeek with 40% routing sparsity and 63.76% activated expert parameters, accuracy drop is limited to 1 point, with +10% improvement on code generation tasks.

DeepSeek Qwen Benchmarks

SIG

HYP

arXiv cs.LG·Jun 10

SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs

SHAPE is a pruning method for sparse MoE models that evaluates experts via observed coalitions rather than independently. Using Shapley attribution over top-k routings, it identifies experts essential to collaborations. Tested on Qwen3-30B-A3B, GPT-OSS-20B, and DeepSeek-V2-Lite, SHAPE maintains accuracy with 20-40% expert pruning without retraining and reduces peak GPU memory.

Open source Benchmarks Infrastructure

SIG

HYP

arXiv cs.CL·Jun 10

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

BenSyc is the first benchmark for evaluating conversational sycophancy in Bengali social contexts. Built from 170k Reddit comments, it tests 15+ LLMs on alignment classification and response generation. Best models achieve only 61.8% Macro-F1 on binary detection, revealing difficulty distinguishing empathetic support from excessive validation.

Benchmarks Alignment AI safety

SIG

HYP

arXiv cs.LG·Jun 10

TRAPS: Therapeutic Response Analysis via Pathway-informed Stratification

Unified benchmark for cancer therapy response prediction using pathway-informed deep learning architectures (BINN, GraphPath, PATH). Evaluation on 2,622 patients from The Cancer Genome Atlas across three clinical tasks: targeted molecular therapy, radiation therapy, and 6-month survival. GraphPath achieves AUROC 0.92 on prostate targeted therapy prediction.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.CL·Jun 10

UniSVQ: 2-bit Unified Scalar-Vector Quantization

UniSVQ introduces unified 2-bit quantization for LLMs combining scalar and vector quantization benefits through affine transforms of integer lattices. Block-wise fine-tuning strategy minimizes reconstruction error. Experiments across LLM families show outperformance vs state-of-the-art SQ methods and comparable performance to advanced VQ, with higher inference throughput.

SIG

HYP

arXiv cs.CL·Jun 10

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

LC-QAT introduces a 2-bit weight-only vector quantization framework for LLMs using learned affine mappings over discrete vectors. Eliminates explicit codebook lookup during training via fully differentiable optimization. Outperforms state-of-the-art QAT methods using only 0.1%–10% of training data across diverse LLMs.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 10

Mix, Don't Pick: Why Synthetic Corpus Composition Matters for Time Series Foundation Model Pretraining

Study on synthetic corpus composition for time-series foundation model pretraining. Equal-weight mixture of 11 generator families outperforms best individual generators on Chronos-T5-Mini and Moirai-Small, reducing forecasting error gap up to 2×. Generator rankings vary across architectures.

Benchmarks Papers Fine-tuning

SIG

HYP

arXiv cs.LG·Jun 10

Time Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models

UniTok is a universal tokenizer converting continuous time series into discrete tokens. UniTok-FM, a foundation model pretrained via next-token prediction, supports zero-shot forecasting, generation and classification via training-free in-context inference without task-specific modifications.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 10

WebChallenger: A Reliable and Efficient Generalist Web Agent

WebChallenger is an autonomous web agent using PageMem, a structured DOM representation, to navigate efficiently without costly proprietary models. The system combines selective attention, persistent memory, and compound action workflows. Results: 56.3% on WebArena, 48.7% on VisualWebArena, 51.0% on Online-Mind2Web, 70.9% on WorkArena.

AI Agents Benchmarks Open source

SIG

HYP

arXiv cs.LG·Jun 10

From Confident Closing to Silent Failure: Characterizing False Success in LLM Agents

Study of 'false success' in LLM agents: they claim task completion when environment state contradicts it. Analysis of 9,876 tau2-bench trajectories and 1,879 AppWorld traces. LLM judges fail (max AUROC 0.65 on tau2-bench, 0.54 on AppWorld), while lightweight TF-IDF detectors reach 0.83–0.95 AUROC with 3,300x lower latency.

AI Agents Evals AI safety

SIG

HYP

Reddit r/LocalLLaMA·Jun 10

Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers

Fine-tuned Qwen2.5-7B reaches 96% of Claude Haiku performance on domain-specific decision-reasoning task using ~$3 API spend and zero human annotators. DV-DPO method: 3-voice council + adversarial cross-examination generates 1,040 training pairs. Latency 11s vs 3s (T4 4-bit). Autonomous loop in production with failure detection and auto red-teaming.

Qwen Fine-tuning Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 9

Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy

AI-MASLD, a stress-audit framework, evaluates 7 medical LLMs on 240 clinical cases with narrative perturbations. All perform well at baseline but diverge under realistic stress. Quantized models hide functional collapse; medical fine-tuning degrades logical stability and fairness. An open-weight model matches or exceeds proprietary alternatives on all safety dimensions.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.AI·Jun 9

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

PPV (Propagational Proxy Voting) outperforms majority voting on MMLU-Pro (+1.5 pp, +2.24 pp on non-trivial subset, p~1.0e-14). This unsupervised aggregator leverages letter entropy and reasoning geometry to weight 128 sampled generations partitioned into 16 groups, requiring no gold labels or auxiliary training.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 9

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

ThinkBooster is a unified framework for test-time compute (TTC) scaling of LLM reasoning. It includes a modular Python library, a benchmark evaluating performance and computational efficiency, and an OpenAI-compatible proxy service. Results on mathematical and coding tasks demonstrate performance-compute trade-offs of TTC strategies.

Reasoning Benchmarks Code generation

SIG

HYP

arXiv cs.AI·Jun 9

Overcoming the Regulatory Bottleneck via Agent-to-Agent Protocols: A Nuclear Case Study

An agent-to-agent communication protocol (RCP) automates exchanges between regulators and applicants in advanced nuclear reactor review. Tested on 1,236 NRC documents, it reduces costs by 50-77% (21-44M USD vs 89M USD) and timelines by 65% (15 months vs 42 months). Applicable to other regulated sectors, potential savings reach 210-330 billion USD/year.

AI Agents Multi-agent MCP

SIG

HYP

arXiv cs.AI·Jun 9

Land cover and flood type govern the detection limits of satellite-based flood mapping across diverse global flood events

Prithvi-EO-2.0, a geospatial foundation model, tested on 19 flood events (2017-2025) across 6 continents. Accuracy varies by land cover: cropland 52% IoU, tree cover 4%. Riverine detection strong (F1=0.69). 23 failure modes identified; pipeline engineering dominates initial errors over model capacity.

Vision Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 9

Enabling KV Caching of Shared Prefix for Diffusion Language Models

Diffusion language models (DLMs) use bidirectional attention, invalidating standard KV caching techniques for shared prefixes. Researchers propose bicache, a method that dynamically identifies safe layer depth for reusing shared prefix KVs. Result: 36–98% throughput improvement without accuracy collapse.

Reasoning Benchmarks Infrastructure

SIG

HYP

arXiv cs.LG·Jun 9

UNIQ: Conformal Calibration for Adaptive Conservatism in Offline Reinforcement Learning

UNIQ introduces conformal calibration for adaptive conservatism in offline reinforcement learning. Built on IQL, the method uses a multi-expectile ensemble and split conformal prediction for distribution-free uncertainty estimation, dynamically adjusting penalties based on local data coverage. On D4RL MuJoCo, UNIQ outperforms IQL with 10× lower memory than EDAC.

Reinforcement learning Papers Benchmarks

SIG

HYP

arXiv cs.LG·Jun 9

Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them

Pre-training data mixture experiments fail to scale because repetition rates of high-quality data shift with training budget. A subsampling procedure matching target repetition rates recovers optimal mixtures using only 1/16 of target tokens (757M model), reducing error from 0.75 to 0.05 compared to uncontrolled baselines.

Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 9

The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM Routers

Study of 21 LLM routing methods across 5 benchmarks reveals a routing plateau: most converge to similar accuracy far below oracle performance. The bottleneck is predictability—routers learn global averaged trends rather than query-specific signals. Larger training datasets, stronger encoders, and end-to-end fine-tuning improve routing accuracy.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.LG·Jun 9

Reachability and asymptotics of Gaussian Transformer dynamics

Theoretical study formalizes data propagation through Transformers as a nonlinear control system. For mean-field Transformer with self-attention and affine feed-forward layers, Gaussian distributions remain exactly Gaussian. This reduces dynamics to finite-dimensional bilinear control system governing mean and covariance evolution, connecting Transformer expressivity to Riccati-type equations.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 9

A Multi-modal Agentic Co-pilot for Evidence Grounded Computational Pathology

PathPocket is a multimodal AI agentic co-pilot for evidence-grounded pathology. The system integrates a corpus of 110,472 medical documents and a hypergraph of 4.55 million entities to provide traceable diagnostics. Evaluated on 200,000 real-world cases, it outperforms existing approaches and improves pathologist diagnostic accuracy.

AI Agents Multi-agent Vision

SIG

HYP

arXiv cs.AI·Jun 9

PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents

PACE is an anytime-valid acceptance method for self-evolving agents. It replaces the naive "keep if score increases" rule with sequential hypothesis testing via betting e-process. On Qwen2.5 (0.5B-3B) on GSM8K/SVAMP/ARC-Challenge, PACE reduces false commits from 30-42% to near-zero and cuts evaluation costs by 18%.

AI Agents Prompt engineering Evals

SIG

HYP