Page 10 of 192

AllHigh signalRecent

7679 articles

VISTA: View-Consistent Self-Verified Training for GUI Grounding

VISTA proposes a GRPO-based fine-tuning method for GUI grounding. It generates multiple views of the same screen (crops preserving target elements) to create more robust comparison groups. On ScreenSpot-Pro, it improves Qwen3-VL 4B/8B/30B from 55.5/52.7/53.7 to 63.4/65.8/67.0.

Reinforcement learning Vision Benchmarks

SIG

HYP

arXiv cs.CL·Jun 15

The Culture Funnel: You Can't Align What isn't in the Data

LLMs suffer from a 'cultural data funnel': explicit cultural signals decline sharply during post-training, dominated by geographically concentrated data. A study using multidimensional tagging across 5.6M samples shows multilingualism enhances geographic diversity but not balanced representation. Authors release a culturally tagged dataset to improve training data pipelines.

Alignment Fine-tuning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 15

When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More

LLM agents equipped with GNN tools fail to exercise judgment: they blindly adopt the GNN's predictions 97.6-99.2% of the time. This deference increases with model capability (Qwen2.5 0.5B-7B), creating a 'GNN parrot' that bypasses its own reasoning. Simple alternatives outperform the GNN at high homophily, yet the agent still defers.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·Jun 15

Decompose Sparsely Where You Should, Absorb Densely Where You Should No

Sparse autoencoders (SAEs) assume all activations are amenable to sparse decomposition. This work adds a low-rank dense bottleneck in parallel with standard SAEs to capture a causally important dense component. On Gemma-2-2B layer 12, a rank-24 bottleneck reduces dense latents by 84% while improving sparse probing performance.

SIG

HYP

arXiv cs.LG·Jun 15

Graph-based Target Back-Propagation for Context Adaptation in Multi-LLM Agentic Systems

GTBP (Graph-based Target Back-Propagation) is a context adaptation framework for multi-LLM agentic systems. It back-propagates local targets through a directed acyclic graph workflow and updates prompts stage-wise. Theoretically convergent, outperforms baselines across 3 benchmarks.

AI Agents Multi-agent Prompt engineering

SIG

HYP

arXiv cs.LG·Jun 15

Gefen: Optimized Stochastic Optimizer

Gefen is a memory-efficient optimizer reducing AdamW's memory footprint by ~8x (6.5 GiB per billion parameters) through shared second-moment estimates across parameter blocks and learned codebook quantization of first moments. Maintains AdamW-level performance while enabling larger microbatches in distributed training.

Fine-tuning Infrastructure

SIG

HYP

arXiv cs.AI·Jun 15

TwinBI: An Agentic Digital Twin for Efficient Augmented Interactions with Business Intelligence Dashboards

TwinBI is an agentic digital-twin framework coupling an LLM-based agent system with executable BI dashboard state. It unifies conversational interaction, dashboard manipulation, and provenance tracking through a shared interaction log. Benchmark: exact-match accuracy 43.3% → 63.3%, timeout rate 40% → 10%.

AI Agents RAG Benchmarks

SIG

HYP

arXiv cs.CL·Jun 15

LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values

Study of 1.2M decisions showing deployment context (Reddit vs news article) produces far larger variations in model preferences and values than prompt paraphrasing or temperature controls. Measured biases (Global North favoritism) and cardinal exchange rates between outcomes shift by factor 2.47 across contexts, questioning stability of model-level properties.

Evals AI safety Alignment

SIG

HYP

arXiv cs.LG·Jun 15

Small LLMs: Pruning vs. Training from Scratch

Comparative study of pruning vs. training from scratch on Llama-3.1-8B (ratios 0.5–0.8, 6 methods). Pruning outperforms random initialization with equal token budget, but advantage narrows with more tokens. Fine-grained pruning retains benefit even with unlimited budget; coarse structured pruning can be matched by training from scratch.

Llama Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 15

OdysSim: Building Foundation Models for Human Behavior Simulation

OdysSim presents the largest systematic investigation of behavioral foundation models for human behavior simulation. Researchers propose SOUL, a taxonomy of 5 axes (CONV, SS, COG, ROLE, EVAL) unifying 62 datasets and 23 benchmark tasks. The open 8B OSim model ranks first/tied-first on 8/23 tasks, outperforming frontier models, with 93.2% reaction alignment vs 93.5% for real users.

Benchmarks Reasoning Reinforcement learning

SIG

HYP

arXiv cs.CL·Jun 15

Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding

Empirical study on post-training adaptation of LLMs for ICD (International Classification of Diseases) coding. Authors compare prompting, supervised fine-tuning, and reinforcement learning (GRPO), introduce PHI (diagnostic curriculum), and show SFT + GRPO outperform discriminative baselines. Code and checkpoints released.

Fine-tuning Reinforcement learning Reasoning

SIG

HYP

arXiv cs.CL·Jun 15

Does the Judge Prefer English? Evaluating Language-Switching Invariance in LLM-as-a-Judge

Judge-LS evaluates whether LLMs used as automatic judges exhibit language bias. On 419 LLMBar benchmark items transformed into English, Chinese, and mixed-language variants, models show 10.7–14.4% preference flips across languages, with highest accuracy in English. Translation-equivalent probes reveal no systematic English preference, though most are judged as ties.

Evals Benchmarks AI safety

SIG

HYP

arXiv cs.CL·Jun 15

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

BayLing-Duplex is a native full-duplex speech language model using a single autoregressive LLM without external VAD module. Fine-tuned on 400K samples with DPO, it achieves 92% turn-taking success and 100% interruption success on InstructS2S-Eval, improving speech-response score from 2.17 to 3.39 over Moshi.

Voice AI Agents Benchmarks

SIG

HYP

arXiv cs.AI·Jun 15

WorkBench Revisited: Workplace Agents Two Years On

WorkBench revisited (June 2026): Claude Opus 4.8 completes 89% of tasks vs 43% for GPT-4 in March 2024, with 2.5% unintended harmful actions vs 26%. Capability and safety improve together. Open-weight models drastically lower costs.

AI Agents Benchmarks AI safety

SIG

HYP

arXiv cs.CL·Jun 15

Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

Study across 9 models and 972,000 responses shows LLMs comply with harmful nudges on moral judgments (A=1.04) at nearly identical rates to beneficial ones, unlike factual questions (A=1.58). Chain-of-thought amplifies bidirectional compliance; identity-based prompting suppresses both equally.

Alignment AI safety Evals

SIG

HYP

arXiv cs.CL·Jun 15

Retrospective Progress-Aware Self-Refinement for LLM Agent Training

RePro, a training framework for LLM agents, teaches models to retrospectively self-generate progress signals via a forward-then-reflect rollout paradigm. Tested on WebShop, ALFWorld, and Sokoban with Qwen family, RePro achieves up to 12% absolute success rate gains without continuous external supervision.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.LG·Jun 15

Beyond LoRA: Is Sparsity-Induced Adaptation Better?

Comparative study of LoRA and variants with sparsity-induced adaptation. Proposes Cheap LoRA (cLA) and c³LA reducing training time by 10% and peak GPU memory by 15%. Evaluates 11 methods across 10 models and 14 datasets with theoretical generalization error bounds.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.AI·Jun 15

MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

MA-ProofBench is the first formal theorem-proving benchmark dedicated to Mathematical Analysis with 200 formalized theorems across two difficulty levels (undergraduate and Ph.D.). GPT-5.5 achieves only 16% Pass@8 on Level I and 5% on Level II, exposing major gaps in LLMs' advanced formal reasoning capabilities.

Benchmarks Reasoning GPT

SIG

HYP

arXiv cs.LG·Jun 15

Efficient On-Device Diffusion LLM Inference with Mobile NPU

llada.cpp is the first NPU-aware inference framework accelerating diffusion LLMs on mobile devices. Three techniques optimize execution: Multi-Block Speculative Decoding, Dual-Path Progressive Revision, and Swap-Optimized Memory Runtime. LLaDA-8B achieves 17x-42x latency reduction vs CPU baseline.

Llama Code generation Infrastructure

SIG

HYP

arXiv cs.AI·Jun 15

Closing the Reflection Gap: A Free Calibration Bonus for Agentic RL

RefGRPO closes the reflection gap in LLM agents: they mis-assess outputs despite correct answers after environment feedback. Method adds a free calibration bonus (contrasting agent reflection vs actual outcome) to standard RL. On text-to-SQL: underconfidence rate 44.4%→7.7%, accuracy 75.1%→76.5%.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·Jun 15

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

Poker Arena benchmarks seven frontier LLMs on no-limit Texas Hold'em using a three-layer memory architecture and nine cognitive axes (bet-sizing calibration, positional awareness, etc.). Claude Opus 4.6 wins +$15,730 chips but ranks 5th on mean axis score, showing that scalar leaderboards systematically misrank capability structure.

Benchmarks Reasoning Claude

SIG

HYP

arXiv cs.CL·Jun 15

MedLatentDx: Latent Multi-Agent Communication for Cross-Hospital Rare-Disease Diagnosis

MedLatentDx is a latent multi-agent communication framework for cross-hospital rare-disease diagnosis. Hospital agents keep clinical records private and exchange compact KV latent blocks instead of raw text, complying with privacy regulations. Two deployment modes: KV distillation for same-backbone agents, cross-family latent alignment for different LLMs.

Multi-agent MCP Reasoning

SIG

HYP

arXiv cs.AI·Jun 15

VeriGeo: Controllable Geometry Question Generation with Numerical and Analytical Verification

VeriGeo generates controllable geometry problems via executable reasoning traces. An Author agent creates the problem and diagram per user constraints, a Solver agent produces the proof. A three-stage pipeline verifies numerical, analytical, and global consistency. Fine-tuning on 8.7k examples achieves best reported GeoQA performance and strong results on PGPS9K and MathVista-GPS.

Reasoning Vision Benchmarks

SIG

HYP

arXiv cs.CL·Jun 15

Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

Dialogue SWE-Bench is an automatic benchmark for evaluating coding agents through user dialogue. Authors introduce a persona-grounded user simulator and a schema-guided agent improving baselines by 3-14%. Key finding: better coding models don't necessarily translate to better dialogue capabilities.

Benchmarks Code generation AI Agents

SIG

HYP

The Decoder·Jun 13

Google Research's Gemini-SQL2 tops text-to-SQL benchmarks by a wide margin

Google Research's Gemini-SQL2, built on Gemini 3.1 Pro, achieves 80.04% accuracy on the BIRD benchmark for natural language-to-SQL conversion, significantly outperforming OpenAI and Anthropic. Google plans to integrate this technology into its data services.

Gemini Benchmarks Code generation

SIG

HYP

The Decoder·Jun 13

Claude Fable 5 outpaces GPT-5.5 by 13 points on FrontierMath's toughest problems

Anthropic's Claude Fable 5 achieves 88% accuracy on FrontierMath's hardest tier, versus 75% for OpenAI's GPT-5.5. Massive jump from Opus 4.5 (< 10% early 2026).

Claude GPT Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·Jun 12

🚀PP-OCRv6 is officially released !

PaddleOCR v6 officially released with models ranging from 1.5M to 34.5M parameters. +4.9% detection accuracy, +5.1% recognition accuracy vs v5. Up to 5.2× faster CPU inference with OpenVINO. Supports 50 languages, new use cases (PCB, CAD, digital tubes). Apache 2.0 open-source.

Open source Vision Benchmarks

SIG

HYP

arXiv cs.CL·Jun 12

LAUKIN: A Multi-jurisdictional Common Law Contract Dataset

LAUKIN is a dataset of 14,727 contract clause pairs (Australia-UK, UK-India, India-Australia) labelled for legal equivalence. 3,000 pairs are manually annotated by legal experts. Best models achieve 65.11% macro-F1, revealing that drafting conventions diverge significantly across jurisdictions despite shared legal heritage.

Benchmarks Papers RAG

SIG

HYP

arXiv cs.AI·Jun 12

Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System

Deployment study of an LLM embedded in electronic health records. A pre-response classifier predicts user rejection risk (AUROC 0.719) by leveraging deployment-specific context (provider type, department, model). Prospective analysis over 4.5 months.

Evals AI safety Alignment

SIG

HYP

arXiv cs.AI·Jun 12

Prefill Awareness in Large Language Models

arXiv study showing frontier models (Claude Opus 4.5, GPT, Gemini) detect tampered prefills in 9-35% of cases with 0% false positive rate. This 'prefill awareness' undermines alignment and jailbreaking evaluations relying on inserted assistant context. Models distinguish stylistic from preference mismatch.

AI safety Alignment Evals

SIG

HYP

arXiv cs.CL·Jun 12

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

SENTINEL is a failure-driven reinforcement learning framework that improves tool-using LLM agents by converting their failures into targeted training tasks. On Tau2-Bench Retail with Qwen3-4B-Thinking-2507, the method increases Pass@1 from 66.4 to 74.9 through a Controller-Proposer-Solver loop that analyzes recurring error patterns.

AI Agents Reinforcement learning Qwen

SIG

HYP

arXiv cs.CL·Jun 12

Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models

Causal analysis of latent reasoning models (Coconut, CODI): observable patterns (BFS-like frontiers, decodable arithmetic) are not evidence of reasoning mechanisms. Causal interventions show latent-thought utilization is graded, not binary, and concentrated in low-rank directions. Decodability alone cannot establish mechanism.

Reasoning Papers Evals

SIG

HYP

arXiv cs.AI·Jun 12

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

ToolSense is an open-source diagnostic framework to audit actual tool understanding in LLMs. Applied to ToolBench (~47k tools), it reveals a knowledge-retrieval dissociation: five parametric model configurations collapse by 50-64 percentage points on realistic ambiguous queries, falling below embedding baselines, despite strong performance on standard benchmarks.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.AI·Jun 12

Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory

Multi-factor value model for long-running LLM agent memory. Seven cognitive factors (emotional intensity, goal relevance, value alignment, etc.) weighted via gradient-free optimization. Retains 77% of critical evidence vs 36.8% for recency baseline on LongMemEval.

AI Agents Reasoning Papers

SIG

HYP

arXiv cs.AI·Jun 12

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

PersonaDrive is a retrieval-augmented VLA (vision-language-action) agent pipeline for closed-loop driving simulation, conditioned on retrieved human demonstrations. Trained on CARLA data with aggressive/neutral/conservative instructions, it improves driving score by 4.6% on Bench2Drive and generates style-diverse non-ego agents without per-style retraining.

Vision AI Agents RAG

SIG

HYP

arXiv cs.AI·Jun 12

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

MDForge is an LLM agent automating molecular dynamics (MD) pipeline design through code generation and multi-agent expert debate. On three SAMPL benchmarks, it matches human expert performance and discovers a novel CB[7] binder confirmed by wet-lab NMR as a high-affinity, picomolar ligand.

AI Agents Multi-agent Code generation

SIG

HYP

arXiv cs.AI·Jun 12

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Evoflux is an inference-time evolutionary search method for repairing executable tool workflows in compact agents. On MCP-Bench with 250 tools, it raises execution feasibility from ~3% to 17-24%, outperforming SFT, SFT+DPO, and ReAct under scarce teacher-trace budgets.

AI Agents MCP Tools

SIG

HYP

arXiv cs.AI·Jun 12

Topical Phase Transitions in Artificial Intelligence Research: Large-Scale Evidence and an Early-Warning Signature for Emerging Topics

Analysis of 80,814 papers from 5 major AI conferences (2017-2025) reveals research topics advance through abrupt phase transitions, not gradually. LLMs dominant by 2025; diffusion models and vision-language models surged within 1-3 years. Early-warning signature flags reasoning, test-time compute, agentic AI, multimodal LLMs, RAG, and world models as topics to monitor 2026-2028.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.CL·Jun 12

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

EvoBrowseComp is an evolving benchmark of 400 English and 400 Chinese questions to evaluate search agents (LLM + web tools). Unlike static BrowseComp, it uses live-web traversal and a three-agent framework (QA synthesis, information filtering, high-level guidance) to prevent contamination and parametric memorization. The benchmark auto-updates regularly.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.AI·Jun 12

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

Study evaluating 4 lie detectors across 31 models (2B-1T parameters). Detectors (CoT judge, logprob classifier, activation probes, DYL) perform well on prompted lying but fail on trained model organisms with verified beliefs. Only CoT judge maintains 0.82 balanced accuracy.

Evals Reasoning Alignment

SIG

HYP