Page 8 of 192

AllHigh signalRecent

7679 articles

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

PhysAssistBench is an interactive medical assistance benchmark with 1,296 physician-validated turns built from real MIMIC-IV cases. It evaluates LLMs' ability to coordinate clinical knowledge, patient communication, and EHR system interaction within single dialogues. Experiments show current models remain unreliable in this setting.

Benchmarks AI Agents Multi-agent

SIG

HYP

arXiv cs.LG·Jun 18

Ghost Attractor Networks: Basin-Structured Dynamical Decoders for Closed-Loop Sequential Generation

Ghost Attractor Networks introduce an efficient dynamical decoder for sequential generation in robotics. With 2.3M parameters, it matches the offline accuracy of a 1.07B-parameter Diffusion Transformer (462× fewer parameters, 32× lower latency). On LIBERO-10, phase conditioning improves success rate by 13.5 percentage points over MLP baseline.

Code generation Robotics Reasoning

SIG

HYP

arXiv cs.LG·Jun 18

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

New LLM inference scheduler replacing explicit length prediction with lightweight statistical signals and dynamic priority boosting. Reduces P99 TTLT by 35-50% vs SRPT with perfect length knowledge, and TTFT by 34-47% across production and open-source traces.

Benchmarks Infrastructure Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

TW-LegalBench: Measuring Taiwanese Legal Understanding

TW-LegalBench evaluates 13 LLMs on Taiwanese law using 16,000+ multiple-choice questions, 117 open-ended essays, and 14,000+ legal judgment prediction cases. Top models exceed lawyer qualification threshold (11%) but fall short for judges/prosecutors (1-2%). Models struggle to cite exact legal articles.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

ScholarSum introduces a hierarchical knowledge graph framework for abstractive scientific summarization. The system organizes documents into semantically coherent units, generates an initial draft, then refines it through iterative verification and rewriting to ensure logical coherence and factual faithfulness.

Papers RAG Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

VISUALSKILL: Multimodal Skills for Computer-Use Agents

VISUALSKILL introduces hierarchical multimodal skills for computer-use agents. Combining authored documentation with live UI exploration, the system improves Claude Opus 4.6 performance by +15.3 points on CUA-World and OSExpert-Eval (0.456 vs 0.303 baseline). Visual figures outperform text-only descriptions (+8.3 points).

Claude AI Agents MCP

SIG

HYP

arXiv cs.CL·Jun 18

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

MCompassRAG improves RAG systems by using topic-level metadata as a semantic compass for paragraph-level retrieval. The method enriches chunk representations with topic signals in the same embedding space and trains a lightweight retriever via LLM-teacher distillation. Across six benchmarks, it gains 8.24% in information efficiency with 5× lower latency than efficient RAG baselines.

RAG Embeddings Benchmarks

SIG

HYP

arXiv cs.LG·Jun 18

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

PROPEL is a framework training task generators via RL to create optimally difficult problems for agent learning. A lightweight probe predicts solver pass rate without repeated rollouts, reducing evaluation to a single forward pass. On code and SWE tasks, learnable-frontier generation increases from 10.1% to 20% (Qwen2.5-3B) and 9.8% to 19.6% (Qwen3.5-27B).

Reinforcement learning AI Agents Code generation

SIG

HYP

arXiv cs.AI·Jun 18

X+Slides: Benchmarking Audience-Conditioned Slide Generation

X+Slides is a benchmark for evaluating audience-conditioned slide generation. Built on 113 topics and 8,133 probes, it measures four metrics: Audience Coverage, Domain-wise Coverage, Efficiency, and Correctness. Tests on DeepPresenter, SlideTailor, and NotebookLM show Audience Coverage scores between 0.594 and 0.853.

Benchmarks Code generation

SIG

HYP

arXiv cs.CL·Jun 18

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

Benchmark of 1,200 clinical documents with 9,184 uncertainty annotations across five levels. LLMs poorly preserve uncertainty expressions (less than 50% of cases) and struggle with nuanced distinctions between adjacent levels. Reveals a failure mode missed by standard metrics.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.CL·Jun 18

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

CoreMem introduces a memory architecture for personalized dialogue agents on edge devices (8 GB VRAM). Replaces cosine similarity with Fisher-Rao metric for retrieval and uses Fisher-guided token distillation for compression. Achieves +4.51 pp gains in open-domain reasoning and +4.17 pp in temporal reasoning on LOCOMO and LongMemEval-S benchmarks.

AI Agents RAG Embeddings

SIG

HYP

arXiv cs.CL·Jun 18

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

arXiv paper on improving long-context reasoning via data-centric approach rather than reward engineering. Data recipe targeting retrieval, multi-evidence synthesis, reasoning (~14K examples). Tests on Qwen3 (4B/8B/30B): +7.2/+3.2/+6.4 points across 7 long-context benchmarks, transfer to agentic tasks (+4.8 GAIA, +7.0 BrowseComp).

Reinforcement learning Reasoning AI Agents

SIG

HYP

arXiv cs.LG·Jun 18

What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy

Study on grokking (delayed transition from memorization to generalization). Authors show weight norm doesn't directly control grokking delay but acts through logit scale. Fixing norm and varying output temperature, they recover 85% of delay by matching logit scale. Effect is loss-dependent (cross-entropy vs MSE). Logit scale and softmax saturation are the proximal variables.

Papers Reasoning Evals

SIG

HYP

arXiv cs.LG·Jun 18

Self-CTRL: Self-Consistency Training with Reinforcement Learning

Self-CTRL optimizes consistency between language models' self-explanations and behavior via reinforcement learning. On probabilistic reasoning tasks, the method improves R² correlation from 0.24 to 0.64. In constitutional AI, it increases refusal prediction from 36% to 92% and reduces HarmBench failure rate from 15.0% to 0.5%.

Reinforcement learning Alignment AI safety

SIG

HYP

arXiv cs.CL·Jun 18

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

DICE improves long-document retrieval by splitting documents into chunks, encoding each independently, then aggregating vectors into a single representation. On LongEmbed, gains reach 90.0 for Dream Passkey >4k (vs 30.0) and 74.0 for Needle >4k (vs 23.3). The approach reduces Evidence Dilution Index (EDI) in 92.8% of cases.

RAG Embeddings Vector search

SIG

HYP

arXiv cs.AI·Jun 18

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

ARIADNE is a training-free framework for dynamic adapter selection at inference time. It represents each adapter through centroids computed from embeddings of its training set. Tested on Llama 3.2 1B across 23 NLP tasks, it recovers 97.44% of upper-bound performance and achieves 89.7% average selection accuracy on 44 tasks.

Fine-tuning Llama Benchmarks

SIG

HYP

arXiv cs.LG·Jun 18

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

Sparse Autoencoders (SAEs) decompose activations into interpretable features, but this study shows that clamping a 'harmful' feature does not eliminate the behavior—it can recover via other residual pathways. Even with active intervention, 95.8% behavior recovery is achievable in refusal-steering, exposing a gap between feature-level control and behavioral completeness.

AI safety Alignment Evals

SIG

HYP

arXiv cs.LG·Jun 18

DRIFT: Refining Instruction Data via On-Policy Data Attribution

DRIFT refines SFT training data distribution using on-policy Influence Functions. The method uses model rollouts as validation targets to minimize proximity gap and debias gradient norm bias. Experiments on 7B instruction and reasoning models show consistent performance ceiling improvements over existing curation baselines.

Fine-tuning Reinforcement learning Evals

SIG

HYP

arXiv cs.LG·Jun 18

CODEBLOCK: Learning to Supervise Code at the Right Granularity

CodeBlock is a structure-aware sparse supervision framework for code LLM fine-tuning. It selects syntactically coherent code blocks rather than isolated tokens, estimating utility via generalized cross-entropy and data-flow signals. On 6 code-generation benchmarks, CodeBlock outperforms full-token SFT while using only 1.9% of supervised response tokens.

Code generation Fine-tuning Papers

SIG

HYP

arXiv cs.LG·Jun 18

Measurement noise limits the advantage of nonlinear models over linear models in biomedical prediction

arXiv paper demonstrates that on biomedical tabular data, measurement noise limits the advantage of nonlinear models (deep networks, gradient boosting) over linear regression. Degree-k interactions are attenuated by the k-th power of feature reliability, while linear components are attenuated only once. Analysis of 140 UK Biobank tasks confirms this noise signature.

Benchmarks Evals

SIG

HYP

arXiv cs.AI·Jun 18

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

Decoupled Search Grounding (DSG) decouples search from reasoning via an MCP-compatible gateway. On SimpleQA, FreshQA, and HotpotQA, DSG achieves 86.1% accuracy (vs 87.7% native) with 91% lower search cost and 68% lower latency. In production e-commerce workload, DSG cuts search cost by 98% while maintaining accuracy.

AI Agents MCP RAG

SIG

HYP

arXiv cs.LG·Jun 18

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

Study shows SFT overtraining can invert model rankings during RLVR fine-tuning. On Qwen2.5-Coder-3B, increasing SFT depth raises pre-RL pass@1 but reduces GRPO pass@10 from 0.806 to 0.481. Pre-RL entropy positively correlates with RLVR outcomes (ρ=+0.69). Two-stage entropy-based diagnostic identifies high-risk checkpoints.

Reinforcement learning Fine-tuning Reasoning

SIG

HYP

Reddit r/LocalLLaMA·Jun 17

Multilingual-Multimodal-NLP/LoopCoder-V2 · Hugging Face

LoopCoder-V2 is a 7B code model based on Parallel Loop Transformer (PLT) that improves test-time performance through two passes of shared Transformer blocks. Trained on 18T tokens of mixed text/code data, it reaches 64.4 on SWE-bench Verified (vs 43.0 baseline), with two loops as the optimal gain-cost setting.

Code generation Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 17

The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

Pruned models pass multiple-choice benchmarks but fail in open generation. Multilingual study shows that under high-sparsity pruning (Wanda), correct answers are demoted rather than erased: they reappear with beam search or sampling. Multiple-choice benchmarks overstate the usability of compressed LLMs.

Benchmarks Evals Fine-tuning

SIG

HYP

arXiv cs.LG·Jun 17

CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models

CheckMIABench introduces a benchmark for principled evaluation of membership inference attacks (MIAs) on language models. Leveraging intermediate checkpoints from open-source models (Pythia, OLMo, 70M–7B parameters), the authors construct reliable testbeds where training data before and after a fixed point share the same distribution. They evaluate six published attacks and release a modular library (pandora_llm).

Papers Benchmarks AI safety

SIG

HYP

arXiv cs.CL·Jun 17

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

Study on agent routing at scale: with 110 agents and 584 tools, F1 accuracy drops 16–23 percentage points on under-specified requests. Analysis decomposes degradation into retrieval gap and confusion gap (10pp oracle ceiling loss). Embedding-based shortlisting recovers +10–11pp F1 at full scale across models.

AI Agents Multi-agent Evals

SIG

HYP

arXiv cs.LG·Jun 17

Operator Boosting Produces Pareto-Efficient PDE Surrogates

Operator Boosting constructs compact neural-operator surrogates for PDEs via stagewise residual learning. Tested on FNO, DeepONet, and CNO across 30 benchmarks (PDEBench, APEBench), the method reduces parameters by 72–95% while improving accuracy on 21 dataset-architecture pairs and achieves Pareto gains on 7/10 PDE benchmarks.

Papers Benchmarks Code generation

SIG

HYP

arXiv cs.AI·Jun 17

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

MemTrace is a benchmark evaluating long-term memory in LLM agents across three dimensions: memory age, question type (current state, earlier state, trajectory), and evidence conditions. Testing 13 configurations, the study finds that evidence use is the primary bottleneck (10× more often retrievable than missing), not retrieval itself.

AI Agents Evals Benchmarks

SIG

HYP

arXiv cs.LG·Jun 17

When the Next Step Is Not One Step: Distribution-Aware Execution Modeling for Concurrent Go Programs

7B model fine-tuned to predict next step in concurrent Go programs by learning event distributions rather than single labels. On 798 predictions from real bugs (CockroachDB, Kubernetes, gRPC, etcd), achieves 36.2% accuracy with <1000 traces, outperforming Gemini 3.5 Flash zero-shot (34.8%). Dataset, adapters, and tooling released.

Code generation Benchmarks Fine-tuning

SIG

HYP

arXiv cs.AI·Jun 17

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

SpeechDx is a multi-task benchmark for clinical speech AI covering 12 datasets and 27 tasks across diverse health conditions. Tasks are structured by speech production stages (conceptualization, formulation, articulation). Evaluation of 12 audio encoders shows large-scale speech models outperform domain-specific ones, but none generalize reliably across clinical speech.

Benchmarks Voice Evals

SIG

HYP

arXiv cs.AI·Jun 17

Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow

Foundation model-orchestrated workflow for pedestrian protection design. Integrates ML surrogate (R²=0.87), multi-objective evolutionary search, geometry generator, and LLM interface. Reduces evaluation time from hours to seconds; generates 35 safety-compliant alternatives in automotive bumper case study.

AI Agents Vision Reasoning

SIG

HYP

arXiv cs.CL·Jun 17

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

STATEWITNESS, an activation explainer, detects deception in reasoning LLMs by reading the target model's hidden states and answering natural-language queries. Achieves 0.916 mean AUROC, 11.6% relative gain over best black-box text monitor, 25.0% over best activation-probe baseline. Provides token- and sentence-level evidence traces for human inspection.

Reasoning AI safety Alignment

SIG

HYP

arXiv cs.LG·Jun 17

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

PowerOPD stabilizes on-policy distillation for LLMs by replacing unbounded log-ratio rewards with Box-Cox power transformation. On 6 mathematical reasoning benchmarks with Qwen3, achieves +6.37 Avg@8/+5.71 Pass@8 gains vs vanilla OPD, reduces wall-clock time by 59.2% and peak GPU memory by 23.1%.

Fine-tuning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search

DivInit improves test-time scaling for agentic search by diversifying initial queries. Instead of sampling k independent queries in parallel, the method generates n candidates then selects k diverse seeds. Gains of 5-7 points on multi-hop QA at matched compute, validated across 5 open-weight models and 8 benchmarks.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL

DecoSearch is a training-free framework for text-to-SQL translation that routes queries by complexity. A schema selector prunes the database, an LLM judger decides if decomposition is needed, and a DAG solves atomic sub-questions. Achieves 70.53% on BIRD and 88.31% on Spider with DeepSeek, outperforming training-free baselines.

Code generation Reasoning RAG

SIG

HYP

arXiv cs.LG·Jun 17

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

MODE is an expert-level mixed-precision quantization framework for MoE multimodal LLMs. It decomposes expert selection frequency by modality (vision/text) and filters redundant vision tokens to correct estimation biases. Results: <2.9% performance loss at W3A16.

Vision Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 17

An expressivity analysis of hierarchical modelling in deep transformers via bounded-depth grammars

Theoretical analysis of deep transformer expressiveness through bounded-depth, non-recursive context-free grammars. Authors explicitly construct transformers with positional attention whose depth scales linearly with grammar depth, demonstrating these architectures can encode abstract grammatical states into linearly separable subspaces within the residual stream.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 17

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

ChLogic is an English-Chinese aligned benchmark evaluating the robustness of logical reasoning in LLMs. Built from formal logical templates, it contains 100 aligned propositions and 15 Chinese-specific phenomena. Experiments on Qwen3, Ministral, and GLM reveal a persistent English-Chinese performance gap, with back-translation producing mixed effects.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 17

Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

RL-trained reasoning models often generate unnecessary reasoning after finding the correct answer (overthinking). This paper introduces Dynamic Rollout Editing (DRE), a training-time intervention during GRPO that edits successful trajectories continuing after answer emergence, preserving the verified prefix and weakening preference signals for unnecessary thinking.

Reinforcement learning Reasoning

SIG

HYP

arXiv cs.CL·Jun 17

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

Study of 450 chest X-ray reports showing LLM rewriting for standardization preserves image-text alignment (2.5% degradation) but erodes 26.8–29.3% of clinical entities and 14.9–16.5% of uncertainty language. The paradox: tasks producing 'cleaner' text pull content away from images.

Vision RAG Evals

SIG

HYP