Page 17 of 192

AllHigh signalRecent

7679 articles

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

TAPS introduces a target-aware prefix selection method for diffusion-drafted speculative decoding. By converting diffusion marginals into path-conditioned acceptance estimates, TAPS selects a compact prefix-closed subtree under fixed verification budget. Results: 7.9x lossless speedup vs vanilla autoregressive decoding, 1.36x and 1.74x over DFlash and DDTree.

Code generation Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 2

SDR: Set-Distance Rewards for Radiology Report Generation

New set-distance reward method for reinforcement learning of vision-language models on chest X-ray report generation. Tested on Qwen3-VL, Gemma3 with GRPO: 6.80% (BERTScore), 7.82% (RadGraph F1), 4.45% (CheXbert F1) improvements over supervised fine-tuning. Enables test-time best-of-N selection and mid-generation pruning reducing tokens by 50%.

Reinforcement learning Vision Code generation

SIG

HYP

arXiv cs.AI·Jun 2

The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary

Decoder-only models hit an information-theoretic limit in deterministic state-tracking tasks beyond ~25 steps. An Attention Bottleneck Theorem bounds capacity to O(H·log(L/H)·√dh). Across 12 models and 8 domains (SWE-Bench, WebArena, SQL), tool delegation achieves 86-94% vs 24-42% for pure neural reasoning. Fine-tuning improves <5%, confirming an architectural ceiling.

Reasoning AI Agents Benchmarks

SIG

HYP

arXiv cs.CL·Jun 2

RealityTest: How People Probe AI Identity and Whether Models Disclose It

RealityTest evaluates whether AI systems disclose their identity when asked. Multimodal, multilingual benchmark based on 3,152 identity-probing queries from ~750 participants across 49 countries, 5 languages (text and speech). Findings: only 31% ask directly; a single suppression instruction reduces disclosure below 30% even in best-performing models.

AI safety Evals Benchmarks

SIG

HYP

arXiv cs.CL·Jun 2

DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models

DLLM-JEPA pairs JEPA with masked-diffusion language models for self-supervised representation learning. Eliminates need for explicit multi-view data and reduces training FLOPs by 33% vs LLM-JEPA. Achieves +18.7pp improvement on GSM8K (LLaDA-8B) and +11.4pp (Dream-7B) while preserving base model capabilities.

Papers Fine-tuning Reasoning

SIG

HYP

arXiv cs.CL·Jun 2

Model-Based Quality Assessment for Massively Multilingual Parallel Data

Study of automatic assessment for massive multilingual bitext: decomposed into parallelism evaluation via multilingual embeddings and reference-free quality estimation. Benchmarks 4 embedding models and 9 evaluators on FLORES-200 covering 6,654 language-pair directions. Key finding: no single model is universally reliable; direction-aware routing and calibration required.

Benchmarks Embeddings Evals

SIG

HYP

arXiv cs.CL·Jun 2

SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding

SENSE improves retrieval-based speculative decoding by anchoring retrieval on target model hidden states for robust semantic alignment. A soft-gated evaluation module validates semantic equivalence rather than surface forms. On LLaMA and Qwen, SENSE achieves 4.09 mean acceptance length and 3.26x speedup.

Llama Qwen Reasoning

SIG

HYP

arXiv cs.CL·Jun 2

ProactiveLLM: Learning Active Interaction for Streaming Large Language Models

ProactiveLLM learns when to interact with streaming inputs without external signals. Through monotonic random masking and synchronized privileged self-distillation, the model perceives semantic sufficiency from partial inputs. Reduces interaction latency while maintaining quality on text and speech streaming tasks.

Reasoning Reinforcement learning

SIG

HYP

arXiv cs.CL·Jun 2

Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

TOPD (Trajectory-aware On-Policy Distillation) improves LLM reasoning by using near-future trajectory information to identify truly divergent states. On AIME24/25, TOPD reaches 63.3%/53.3% vs 60.0%/46.7% in standard OPD, showing 30% of high-loss tokens are false positives.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.CL·Jun 2

ProtStructQA: A Denotation Threshold in Protein Structural Reasoning

ProtStructQA is an executable benchmark for protein structural question answering. 382.2K questions generated from hidden domain-specific language, evaluated on Qwen3 (0.6B–8B) and Gemma-3. Key finding: capability threshold between Qwen3-1.7B and 4B where models transition from failing to produce executable denotations to mastering chain-of-thought reasoning.

Benchmarks Reasoning Qwen

SIG

HYP

arXiv cs.CL·Jun 2

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

arXiv study on LLM adaptation limits for annotation tasks. Toxicity detection experiments across diverse datasets show 66% of zero-shot errors resist correction via prompting (rescue rate 34.8%). Models follow misaligned definitions while maintaining confidence. Definition-Specific Familiarity (DSF) metric correlates with performance (r=+0.41), outperforming memorization metrics.

Prompt engineering Evals Benchmarks

SIG

HYP

arXiv cs.AI·Jun 2

TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation

TIGER is an inference-time framework to mitigate hallucinations in multimodal generation. It independently extracts an observation graph from input and a claim graph from output, then assigns risk scores to claims based on support and conflict. The model repairs high-risk claims while keeping the backbone frozen. Convergence analysis shows geometric risk reduction to an explicit asymptotic bound.

Reasoning Vision Papers

SIG

HYP

arXiv cs.AI·Jun 2

A Multi-AI-agent Framework Enabling End-to-end Finite Element Analysis for Solid Mechanics Problems

AbaqusAgent is a multi-agent LLM-based framework for finite element analysis (FEA) in solid mechanics. Composed of six agents (interpreter, architect, input writer, runner, reviewer, visualizer), it converts natural-language instructions into executed FEA analyses with Abaqus. Validated on 50 problems with 86% success rate.

AI Agents Multi-agent Code generation

SIG

HYP

arXiv cs.AI·Jun 2

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

Interactive reasoning evaluation benchmark with 474 executable games. LLMs receive only task rules, must query a hidden environment, integrate partial observations, and decide when to submit answers. Evaluates contextual robustness, metacognitive adaptation, and interaction efficiency across frontier models.

Reasoning Evals Benchmarks

SIG

HYP

arXiv cs.CL·Jun 2

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

Study of agreement metrics for LLM-as-Judge evaluation. Analysis of 24 recent papers shows that for binary criteria (MET/UNMET), Pearson r, Spearman ρ, Kendall τ_b, and phi are redundant. Cohen's κ alone adds information. Authors propose a reporting checklist including judgment scale, abstention handling, and confusion matrix.

Evals Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 2

Revisiting Parameter-Based Knowledge Editing in Large Language Models: Theoretical Limits and Empirical Evidence

Theoretical and empirical study of parameter-based knowledge editing limits in LLMs. Authors prove via dimensional collapse hypothesis that localized modifications propagate global interference degrading model capabilities. Retrieval-based methods consistently outperform parameter-editing approaches.

Fine-tuning Reasoning Papers

SIG

HYP

Reddit r/MachineLearning·Jun 1

Real-time multilingual ASR using rolling buffers and monolingual models [P]

Real-time multilingual ASR system routing audio between specialized monolingual models (~100M parameters each) instead of one large model. Detects language switches via SpeechBrain and re-transcribes with correct model. Achieves 13% WER on inter-utterance code-switching, outperforming cloud APIs. Open-source repo released.

Voice Code generation Open source

SIG

HYP

Reddit r/LocalLLaMA·Jun 1

A lightweight, real-time multilingual ASR router that runs on local hardware

Lightweight multilingual ASR routing system for local hardware using Zipformer, Silero VAD, and SpeechBrain. Routes audio between specialized monolingual models (~100M parameters) instead of one large model. Achieves 13% WER on inter-utterance code-switching, outperforming cloud APIs. Known limitation: 41% WER on intra-utterance switching. Open-source repo available.

Voice Open source Tools

SIG

HYP

arXiv cs.CL·Jun 1

Cross-Lingual Steering for Figurative Language Generation

Activation steering study across four multilingual LLMs (5 figurative categories, 6 languages). Directions learned in one language transfer effectively to others, particularly German. Composite cross-lingual directions match or exceed native directions, providing direct evidence of reusable but target-dependent figurative signals across languages.

Reasoning Multi-agent Papers

SIG

HYP

arXiv cs.AI·Jun 1

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

EHRBench is an automated and reliable benchmark for evaluating LLMs on clinical decision-making tasks. Built via an EHR-LLM-KB pipeline, it generates ~960k QA items covering diagnosis, treatment, and prognosis. 30+ LLMs benchmarked reveal persistent gaps toward clinical reliability.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.AI·Jun 1

PhyDrawGen: Physically Grounded Diagram Generation from Natural Language

PhyDrawGen is a neuro-symbolic pipeline generating physics diagrams from text while respecting physical laws. An LLM extracts a typed scene graph, a deterministic solver converts it to a planar straight-line graph, and Qwen-VL fine-tunes a propose-verify loop. Evaluated on 1,449 problems (mechanics, optics, electromagnetism), it outperforms GPT-5-image and Gemini.

Qwen Reasoning Vision

SIG

HYP

arXiv cs.AI·Jun 1

HypoAgent: An Agentic Framework for Interactive Abductive Hypothesis Generation over Knowledge Graphs

HypoAgent is a multi-agent framework for interactive abductive hypothesis generation over knowledge graphs. Three coordinated agents (intent recognition, hypothesis generation, root cause analysis) enable multi-turn dialogue and fine-grained diagnosis of failed hypotheses. SOTA on commonsense and biomedical KGs.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.AI·Jun 1

TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI

TRINE is an FPGA accelerator and compiler for end-to-end multimodal inference (ViT, CNN, GNN, transformers) without reconfiguration. It unifies layers as matrix operations, switches between systolic and SIMD architectures at runtime, and applies in-stream token pruning. On Alveo U50 and ZCU104, it achieves 22.57x latency reduction vs RTX 4090 while consuming 20-21 W.

Vision Code generation Infrastructure

SIG

HYP

arXiv cs.LG·Jun 1

AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis

AMNESIA is the first large-scale open-source benchmark for machine unlearning in medical LLMs. It contains 70,560 question-answer pairs from 8,820 patient notes across 11 disease categories. The authors evaluate 4 unlearning methods and show that forgetting individual patients erodes knowledge of others with the same condition.

Benchmarks Papers AI safety

SIG

HYP

arXiv cs.CL·Jun 1

CanLegalRAGBench: Evaluating Retrieval-Augmented Generation on Canadian Case Law

CanLegalRAGBench is an evaluation benchmark for RAG systems applied to Canadian law, based on realistic queries and expert-annotated answers. The study shows open-source embedding models are competitive with closed-source alternatives, but identifies hallucinations in 8-29% of generated answers unsupported by retrieved documents.

RAG Embeddings Evals

SIG

HYP

arXiv cs.LG·Jun 1

DisasterLex: An Expert Concept-to-Schema Knowledge Graph for Geospatial Reasoning in Disaster Analytics

DisasterLex is a knowledge-graph-mediated text-to-SQL framework for querying geospatial disaster-analytics databases. It uses an Expert Knowledge Graph (107 concepts, 117 causal edges) to route natural-language queries across 36 heterogeneous tables. On 75 test queries, it outperforms 4 baselines (LightRAG, HippoRAG 2, ReFoRCE, CHESS) by 1.4x to 2.75x.

RAG Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 1

The Long-Term Effects of Data Selection in LLM Fine-Tuning

Study on long-term effects of data selection during multi-stage LLM fine-tuning. Authors show that short-term optimal strategies (loss-based, gradient-based, diversity-based) can slow future learning and increase catastrophic forgetting. They propose LHAS (Long-Horizon Aware Selection) to evaluate selection as a global training intervention.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 1

SubsurfaceGen: Procedural Generation of Field-Scale Earth Models and Seismic Data

SubsurfaceGen is a GPU-accelerated generator for 3D velocity models and seismic data at field scale. Authors release a dataset of 4,276 2D slices covering 6 geological settings (10 km × 10 km × 6.19 km at 10 m resolution). Evaluation of neural operators on wavefield prediction and end-to-end velocity inversion with out-of-distribution testing.

Benchmarks Papers Open source

SIG

HYP

arXiv cs.CL·Jun 1

SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs

SAGE is an adaptive gate using von Mises-Fisher density estimation to control memory evolution in agentic LLMs. It classifies candidate facts as ADD (novel), NOOP (redundant), or MERGE (uncertain), reducing expensive LLM calls. On LoCoMo, SAGE cuts API cost by 3.4× and latency by 2.5× with GPT-4o-mini.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 1

Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics

Distributed approach for constrained multi-agent reinforcement learning combining state-augmented policy learning with consensus over Lagrange multipliers. Agents learn offline policies and coordinate via local communication. Linear scalability to thousands of agents, demonstrated on smart grid demand response.

Multi-agent Reinforcement learning Papers

SIG

HYP

arXiv cs.CL·Jun 1

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

Evaluation of semantic stability in 16 LLMs (general-purpose and medical) under clinically equivalent prompt reformulations. Proposes NLI-based verification framework and three sensitivity metrics (MVS, ΔC, WCI). Finding: domain specialization does not consistently improve robustness to meaning-preserving variations.

Evals AI safety Reasoning

SIG

HYP

arXiv cs.LG·Jun 1

Discovering a Zeta Map Algorithm on Dyck Paths via Mechanistic Interpretability

Researchers train a small encoder-decoder transformer on the zeta map, a classical bijection in q,t-Catalan combinatorics. Mechanistic interpretability tools (cross-attention analysis, linear probing, causal intervention) reveal a level-based mechanism. Translation into an explicit peak-centered traversal algorithm (scaffolding map) proven equivalent to the zeta map.

Reasoning Papers

SIG

HYP

arXiv cs.CL·Jun 1

Probing the Prompt KV Cache: Where It Becomes Dispensable

Study on KV cache prompt redundancy during decoding. Researchers show upper-layer prompt cache can be replaced with chat template scaffolds without significant accuracy loss, revealing redundancy is structural rather than semantic. Results validated across Qwen3, Gemma 3, and Llama 3 families.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 1

MAAT: Multi-phase Adapter-Aware Targeted Unlearning

5WBENCH, a balanced 5,000-sample benchmark across 5W categories, reveals unlearning methods fail on causal (Why) questions. MAAT, a three-phase framework operating on LoRA weights, combines gradient-projected ascent, SVD rank pruning, and KL-hidden-state repair to simultaneously achieve high forgetting and retention on causal knowledge.

Fine-tuning AI safety Alignment

SIG

HYP

arXiv cs.AI·Jun 1

GraphARC: A Comprehensive Benchmark for Graph-Based Abstract Reasoning

GraphARC is an AI benchmark for abstract reasoning on graph-structured data, generalizing the ARC paradigm to graph transformations. Current language models fail on full graph transformation tasks despite understanding graph properties, revealing a comprehension-execution gap.

Benchmarks Reasoning Papers

SIG

HYP

arXiv cs.CL·Jun 1

Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

MASA (Model-Aware Skill Alignment) adapts procedural skills for LLM agents to each model backbone without weight modification. A hierarchical evolution pipeline rewrites skills via hill climbing and UCB-driven tree search, then a lightweight rewriter trained on trajectories reproduces adaptation in a single forward pass. Gains up to 25.8 points across three interactive environments and four backbones.

AI Agents Prompt engineering Reasoning

SIG

HYP

arXiv cs.CL·Jun 1

ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

ElasticMem introduces a learnable latent memory framework for LLM agents with adaptive retrieval and elastic budget allocation via learned policy. On Qwen2.5-3B and 7B backbones, achieves 26.2% and 24.6% QA accuracy gains, 66.3% and 27.2% ALFWorld success improvements, with lowest token cost.

AI Agents Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 1

TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories

TraceGraph is a graph-based framework that transforms multi-model agent trajectories into shared decision landscapes. It builds graphs over state-action-observation spaces, identifies productive cores and trap regions, then proposes a trap-aware recovery pipeline. On SWE-bench, this approach improves resolution rate from 40.4% to 43.5%.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.LG·Jun 1

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

Study of stylistic signatures introduced by LLM alignment. Researchers show post-training creates a detectable AI-like style. They propose PASTA, a training-free method that localizes and ablates this signature during decoding, reducing detection rates across 11 aligned models and 6 AI detectors.

Alignment Evals AI safety

SIG

HYP

arXiv cs.CL·Jun 1

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

Comparative study of generic vs domain-specific embeddings for multilingual clinical search (ICD-10-CM). A bi-encoder fine-tuned on Gemini-generated synthetic data (6 languages) outperforms BioBERT-ST: R@5=0.822 vs 0.790, with major gains in Portuguese (+0.115). Open recipe for LLM-based medical retrievers.

Embeddings RAG Benchmarks

SIG

HYP