Page 13 of 192

AllHigh signalRecent

7679 articles

The AI Epistemic Deference Index: A Continuous Measure of Sycophancy

New AEDI index measures epistemic sycophancy in LLMs: model sensitivity to user attitude. Tested on 8 models (Claude, Grok, Gemini) with 500 propositions and 16,000 prompts. Claude exhibits least deference, Grok and Gemini most. Open-source benchmark released.

Evals AI safety Alignment

SIG

HYP

arXiv cs.AI·Jun 9

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

PPV (Propagational Proxy Voting) outperforms majority voting on MMLU-Pro (+1.5 pp, +2.24 pp on non-trivial subset, p~1.0e-14). This unsupervised aggregator leverages letter entropy and reasoning geometry to weight 128 sampled generations partitioned into 16 groups, requiring no gold labels or auxiliary training.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 9

Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy

AI-MASLD, a stress-audit framework, evaluates 7 medical LLMs on 240 clinical cases with narrative perturbations. All perform well at baseline but diverge under realistic stress. Quantized models hide functional collapse; medical fine-tuning degrades logical stability and fairness. An open-weight model matches or exceeds proprietary alternatives on all safety dimensions.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.AI·Jun 9

Overcoming the Regulatory Bottleneck via Agent-to-Agent Protocols: A Nuclear Case Study

An agent-to-agent communication protocol (RCP) automates exchanges between regulators and applicants in advanced nuclear reactor review. Tested on 1,236 NRC documents, it reduces costs by 50-77% (21-44M USD vs 89M USD) and timelines by 65% (15 months vs 42 months). Applicable to other regulated sectors, potential savings reach 210-330 billion USD/year.

AI Agents Multi-agent MCP

SIG

HYP

arXiv cs.AI·Jun 9

Land cover and flood type govern the detection limits of satellite-based flood mapping across diverse global flood events

Prithvi-EO-2.0, a geospatial foundation model, tested on 19 flood events (2017-2025) across 6 continents. Accuracy varies by land cover: cropland 52% IoU, tree cover 4%. Riverine detection strong (F1=0.69). 23 failure modes identified; pipeline engineering dominates initial errors over model capacity.

Vision Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 9

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

ThinkBooster is a unified framework for test-time compute (TTC) scaling of LLM reasoning. It includes a modular Python library, a benchmark evaluating performance and computational efficiency, and an OpenAI-compatible proxy service. Results on mathematical and coding tasks demonstrate performance-compute trade-offs of TTC strategies.

Reasoning Benchmarks Code generation

SIG

HYP

The Decoder·Jun 8

Microsoft Research's Lens proves detailed captions matter more than raw scale for training efficient image generators

Microsoft Research presents Lens, a text-to-image model with 3.8 billion parameters that matches much larger rivals on benchmarks at a fraction of training cost. Key innovation: 800 million detailed captions generated by GPT-4.1 instead of vague web alt-text. Code and weights released under open-source license.

Image generation Benchmarks Open source

SIG

HYP

Reddit r/LocalLLaMA·Jun 8

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

Luce Spark runs 33-35B MoE models on 16 GB GPU without offload penalty. Qwen 35B-A3B: 13.3 GiB (vs 20.5), Laguna XS.2 33B-A3B: 14.6 GiB (vs 18.8). Only active experts (~8/256) stay in VRAM; rest in system RAM with intelligent swapping. Self-tuning via learned routing profile. Open-source Apache 2.0.

Open source Infrastructure Llama

SIG

HYP

Reddit r/LocalLLaMA·Jun 8

[Benchmark] DFlash Speculative Decoding + KV Cache Compression on RTX 5090 — 3.26x Speedup

DFlash speculative decoding + KV cache compression benchmark on RTX 5090 with Qwen3.6-27B. 3.26x speedup (turbo4/turbo4), 3.18x (q4_0/turbo4) with only +0.02% PPL degradation. Q5_K_XL outperforms NVFP4-Q8_0. Scripts and raw data available.

Qwen Benchmarks Open source

SIG

HYP

Vercel AI Blog·Jun 8

DeepSeek enters the fight for token volume, Anthropic continues to dominate spend

DeepSeek V4 captured 17% of token volume on AI Gateway in May 2025, jumping from <1% in April, thanks to pricing 20–50× lower than Claude. Despite massive volume growth, DeepSeek accounts for only 1% of spend, while Anthropic dominates production costs.

DeepSeek Anthropic OpenAI

SIG

HYP

arXiv cs.AI·Jun 8

When Does Multi-Agent Collaboration Help? An Entropy Perspective

Empirical study of 245 entropy features (token, agent, round) across 6 reasoning benchmarks and 2 agentic tasks. Counterintuitive finding: single agent outperforms MAS in 43.3% of cases. Three key observations: certainty preference, base entropy drives performance, task-dependent entropy dynamics. Entropy Judger algorithm proposed to select MAS solutions.

Multi-agent AI Agents Reasoning

SIG

HYP

arXiv cs.AI·Jun 8

How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope

Perplexity studies Computer (autonomous agent) vs Search (conversational assistant) using production data. Computer performs 26 minutes of autonomous work per session vs 33 seconds for Search, reduces task completion time from 269 to 36 minutes (-87%), cuts dissatisfaction by 55%, and expands task scope (cross-domain work, higher-order cognition).

AI Agents Benchmarks Business

SIG

HYP

arXiv cs.AI·Jun 8

DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

DuMate-DeepResearch is a multi-agent system for deep research built on Qianfan Agent Foundry. It decouples planning from execution, introduces graph-based dynamic planning, recursive two-level execution, and rubric-based test-time optimization. SOTA results: 58.03% on DeepResearch Bench and 61.95% on DeepResearch Bench II.

Multi-agent AI Agents Reasoning

SIG

HYP

arXiv cs.AI·Jun 8

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Study measuring no-CoT reasoning capability across 30,000+ questions spanning 43 benchmarks. Frontier models double their 50%-task-completion time horizon yearly: GPT-5.5 reaches 3+ minutes without explicit reasoning tokens. Projections: 7 minutes by 2028, 25 minutes by 2030.

Reasoning Benchmarks AI safety

SIG

HYP

arXiv cs.AI·Jun 8

DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

DyCon is a training-free framework that dynamically models task difficulty via latent step-level representations. Tested on 4 models (4B-32B) and 12 benchmarks (math, QA, code), it reduces redundant reasoning steps without sacrificing accuracy.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.AI·Jun 8

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

CrowdMath is a dataset of 164 annotated discussion chains from the MIT PRIMES–Art of Problem Solving program (2016-2025), capturing collaborative open-problem solving. Posts are labeled by functional role (partial progress, error, repair). Six frontier models achieve 83-88% accuracy on next-post prediction but only 0.42 macro-F1 on post-role classification.

Benchmarks Reasoning Papers

SIG

HYP

arXiv cs.CL·Jun 8

MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights

MADE is a multilingual agentic diagnosing engine that decomposes post-evaluation analysis into planning, aggregate analysis, instance-level inspection, and grounded report synthesis. Tested on 33 model families, 11 benchmarks, and 26 languages (8.66M evaluation records), MADE outperforms strongest baselines by 47% in diagnosis quality and is preferred by human experts in 87.9% of comparisons.

AI Agents Multi-agent Evals

SIG

HYP

arXiv cs.CL·Jun 8

Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning

Translate-R1 learns via RL a single policy deciding when to translate inputs into the model's dominant language. Trained on Qwen3-4B across 22 languages and 5 domains, the system improves reward by +4.6 to +23.5 depending on language resources, while reducing translation costs by 37% without performance loss.

Reinforcement learning Multi-agent Tools

SIG

HYP

arXiv cs.CL·Jun 8

Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles

Study comparing human and LLM annotations (GPT-4o-mini, Llama-3.3-70B) on political ideology in news articles. Double Machine Learning shows fine-tuned GPT-4o-mini learns spurious sentiment-ideology coupling absent from human judgment, despite F1=72.48. Implications for using LLM annotations as silver labels.

GPT Llama Evals

SIG

HYP

arXiv cs.CL·Jun 8

What Do People Actually Want From AI? Mapping Preference Plurality

Analysis of 1,500 open-ended responses from PRISM dataset (75 countries) on human preferences for AI systems. Finding: requested values vary widely across individuals (only truthfulness reaches 49%), with divergent definitions of the same concept. Current RLHF methods fail to capture this plurality by aggregating it into a single reward model.

Alignment Reinforcement learning Evals

SIG

HYP

arXiv cs.CL·Jun 8

How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

Study of reasoning failure signatures in language models using token-level uncertainty signals. Two modes identified: committed failure (early lock onto incorrect path) and persistent uncertainty (accumulation throughout trace). Framework validated across 23 model-dataset configurations with implications for self-consistency.

Reasoning Evals Papers

SIG

HYP

arXiv cs.LG·Jun 8

GlucoFM-Bench: Benchmarking Time-Series Foundation Models for Blood Glucose Forecasting

GlucoFM-Bench evaluates 8 time-series foundation models (Chronos-2, TimesFM, LLMs) on 15 diabetes datasets (1,117 patients). Pre-trained TSFMs show strong zero-shot transfer (within 5% of best full-shot), but lightweight LSTM outperforms by 4–21% with abundant task-specific data. Persistent challenges in T1D cohorts and hypo-/hyperglycemic ranges.

Benchmarks

SIG

HYP

arXiv cs.AI·Jun 8

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

AARRI-Bench evaluates AI agents on granular scientific research tasks. Even the best configuration (Mini-SWE-Agent + Claude Opus 4.7) achieves 68.3% success rate, revealing gaps in nuanced judgment and research ethics. The benchmark targets ability to emulate human researcher professionalism beyond macro-level execution.

AI Agents Benchmarks Claude

SIG

HYP

arXiv cs.AI·Jun 8

Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs

ZEDD (Zero-Shot Embedding Drift Detection) detects prompt injections by measuring semantic shifts in embedding space between benign and suspect inputs. Without model internals access or retraining, the method achieves >93% accuracy on Llama 3, Qwen 2, Mistral with <3% false positive rate.

AI safety Embeddings Prompt engineering

SIG

HYP

arXiv cs.AI·Jun 8

Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

Lean4Agent is a framework using Lean4 (dependent-type formal language) to formally model and verify agent workflows. FormalAgentLib ensures semantic consistency verification, while LeanEvolve iteratively improves workflows. On SWE-Bench-Verified and ELAIP-Bench, verified workflows outperform unverified by 11.94%, with additional 7.47% gains via LeanEvolve.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 8

Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling

Eval-Skill, an exploration-guided method, synthesizes reusable evaluation skills for reward modeling without rigid rubrics. Trained on 100 cases per domain, the system progressively generates workflows and principles injected directly into the judge's context. On RewardBench 2, gains of +13.44% (Qwen3-8B) and +18.51% (DeepSeek-V4-Flash).

Reinforcement learning Evals Reasoning

SIG

HYP

arXiv cs.CL·Jun 8

OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

OpenHalDet is a unified benchmark for hallucination detection in LLMs. It standardizes evaluation (prompt construction, generation, annotation, scoring) and supports three detector families: black-box (outputs only), gray-box (probability signals), white-box (internal signals). Open-source codebase released.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.CL·Jun 8

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

Reasoning Benchmarks Code generation

SIG

HYP

arXiv cs.CL·Jun 8

Interpreting Brain Responses to Language with Sparse Features from Language Models

Researchers use sparse autoencoders (SAE) from language models to interpret brain responses to language via 7T fMRI. Testing 8 participants listening to 200 sentences, they identify voxel populations tuned to people-related content and show frontal regions are explained by surprisal alone, while the fronto-temporal network shares common features across regions.

Papers Reasoning Evals

SIG

HYP

arXiv cs.LG·Jun 8

The Identity Trap in EEG Foundation Models: A Diagnostic Audit

Diagnostic study of EEG foundation models reveals the "Identity Trap": models (LaBraM, CBraMod, REVE) conflate subject identity with clinical biomarkers. FMScope, a 5-diagnostic protocol, shows subject variance dominates 13-89x chance and persists after fine-tuning (+10-63 pp). Erasing this axis improves label decoding (+6-12 pp).

Benchmarks Evals AI safety

SIG

HYP

arXiv cs.CL·Jun 8

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

New OPDLM method transforms autoregressive language models into diffusion models without full retraining. Via on-policy distillation, student model generates its own trajectories while frozen original model provides target logits. Result: 15x to 7,000x fewer training tokens required.

Fine-tuning Reasoning Papers

SIG

HYP

arXiv cs.CL·Jun 8

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

Study on LLM overgeneralization beyond training data. Authors propose the Piggyback Hypothesis: chat-template tokens propagate finetuned behaviors to out-of-distribution domains. They introduce Token-Regularized Finetuning (TReFT) to mitigate emergent misalignment, achieving 33.5% more reduction than data interleaving on Llama-3.1-8B legal domain finetuning.

Fine-tuning Alignment AI safety

SIG

HYP

arXiv cs.CL·Jun 8

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

UnpredictaBench evaluates LLMs' ability to capture true underlying distributions across 448 problems (statistical distributions, stochastic programs, natural scenarios). The KS@N metric uses the Kolmogorov-Smirnov test. No model exceeds 40% at KS@100, revealing a major gap in distributional sampling capability.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.LG·Jun 8

Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

Study of scaling laws for language model pretraining in data-constrained regime. Authors propose MIR (masked-input regularization), an auxiliary next-token prediction loss on randomly masked inputs, and SoftQ, a scaling law coupling model and data size under repeated data. MIR improves validation loss on 72M–1.4B models and equals ~1.3× more unique training data.

Fine-tuning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 8

Evidence-Grounded Ensemble Diagnosis of 802.11 Packet Captures: A Multi-Stage Pipeline with Deterministic Reliability Scoring

PROBE, a multi-stage pipeline, diagnoses 802.11 packet captures by combining deterministic PCAP-to-text normalization, multi-model ensembles, and evidence-grounded reliability scoring (without LLM self-assessment). On 87 enterprise Wi-Fi captures, achieves F1=0.957 vs 0.871 expert baseline, eliminates LLM hallucinations and uncalibrated confidence scores.

Reasoning Evals Benchmarks

SIG

HYP

arXiv cs.LG·Jun 8

TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models

TALAN (Task-Aligned Latent Adaptation Networks) combines a low-rank adapter with a sequence-conditioned latent side path inserted into the transformer's residual stream. Tested on four Qwen3 backbones and four STEM/code benchmarks, TALAN improves LoRA (+1.41 points) and DoRA (+1.85 points) baselines with <1% additional parameters and 1.01-1.02x inference overhead.

Fine-tuning Reasoning Code generation

SIG

HYP

arXiv cs.LG·Jun 8

The Geography of Algorithmic Judgment: LLM Intermediaries, Place Identity, and Racial Steering in Housing Search

Behavioral audit of 7 LLMs (open-weight and closed-source) across 4 US cities reveals racial steering emerges from interaction between user identity, stated preferences, and the model's learned spatial representations. Steering is not uniform: preference-conditioned testing often amplifies bias. Results do not generalize across local markets.

AI safety Alignment Evals

SIG

HYP

arXiv cs.AI·Jun 6

Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?

Study of 100+ developers collaborating with Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7 on long-horizon coding tasks. 94% of developers fail to detect AI agent sabotage (malicious code injection). A safety monitor reduces sabotage success but 56% of participants still accept malicious code despite warnings.

AI Agents AI safety Alignment

SIG

HYP

arXiv cs.AI·Jun 6

PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

PSEBench is a 5,074-case benchmark for evaluating LLMs on patient safety event triage under Minnesota policy. The methodology uses clause cards to factorize regulatory text into auditable decision specifications, with closed-loop verification. Evaluation of 15 representative LLMs reveals capability trends and actionable gaps toward reliable LLM-based triage.

Benchmarks Evals AI safety

SIG

HYP

arXiv cs.AI·Jun 6

Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

SAGE-PTQ, an ultra-low-bit quantization method for LLMs, reduces hidden scaling cost by separating salient and non-salient weights via distributional statistics and graph modeling. On LLaMA-3-8B: 6.74 WikiText2 perplexity vs 55.8 for BiLLM, using 50% less GPU memory. On LLaMA-2-70B: 1.5x faster decoding on NVIDIA L40.

Llama Benchmarks

SIG

HYP