Page 11 of 192

AllHigh signalRecent

7679 articles

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Evoflux is an inference-time evolutionary search method for repairing executable tool workflows in compact agents. On MCP-Bench with 250 tools, it raises execution feasibility from ~3% to 17-24%, outperforming SFT, SFT+DPO, and ReAct under scarce teacher-trace budgets.

AI Agents MCP Tools

SIG

HYP

arXiv cs.CL·Jun 12

Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

PaperGuard, a multimodal benchmark, evaluates LLMs and MLLMs vulnerability to adversarial attacks in scientific peer review. Researchers test prompt injections and perturbations (GCG, PGD) on text and figures, proposing a defense using chunk-based embedding search to localize harmful instructions.

AI safety Alignment Vision

SIG

HYP

arXiv cs.AI·Jun 12

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

Study evaluating 4 lie detectors across 31 models (2B-1T parameters). Detectors (CoT judge, logprob classifier, activation probes, DYL) perform well on prompted lying but fail on trained model organisms with verified beliefs. Only CoT judge maintains 0.82 balanced accuracy.

Evals Reasoning Alignment

SIG

HYP

arXiv cs.CL·Jun 12

LAUKIN: A Multi-jurisdictional Common Law Contract Dataset

LAUKIN is a dataset of 14,727 contract clause pairs (Australia-UK, UK-India, India-Australia) labelled for legal equivalence. 3,000 pairs are manually annotated by legal experts. Best models achieve 65.11% macro-F1, revealing that drafting conventions diverge significantly across jurisdictions despite shared legal heritage.

Benchmarks Papers RAG

SIG

HYP

arXiv cs.AI·Jun 12

Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

SciAgentArena is a systematic benchmark evaluating ~200 real-world scientific tasks with stepwise verification. Current AI agents perform well on structured data-analysis workflows but struggle to generate novel insights, sustain self-directed exploration, and solve open-ended research questions.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.CL·Jun 12

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

SENTINEL is a failure-driven reinforcement learning framework that improves tool-using LLM agents by converting their failures into targeted training tasks. On Tau2-Bench Retail with Qwen3-4B-Thinking-2507, the method increases Pass@1 from 66.4 to 74.9 through a Controller-Proposer-Solver loop that analyzes recurring error patterns.

AI Agents Reinforcement learning Qwen

SIG

HYP

arXiv cs.CL·Jun 12

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

LoHoSearch is a 544-question benchmark for evaluating long-horizon search agents, built via an automated pipeline on a knowledge graph of 7 million Wikipedia entities. The strongest model achieves only 34.74% accuracy, versus >90% on prior saturated benchmarks.

Benchmarks AI Agents Reasoning

SIG

HYP

arXiv cs.CL·Jun 12

Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization

Three small LLMs (Phi-3-mini 3.8B, Qwen2.5-3B, Mistral-7B) fine-tuned via QLoRA for biomedical claim verification. Mistral-7B outperforms GPT-4o and GPT-5 (+12% F1) on 1,008 training examples. Study identifies structural artifact in SciFact and demonstrates robust cross-domain generalization.

Mistral Qwen Fine-tuning

SIG

HYP

arXiv cs.AI·Jun 12

Prefill Awareness in Large Language Models

arXiv study showing frontier models (Claude Opus 4.5, GPT, Gemini) detect tampered prefills in 9-35% of cases with 0% false positive rate. This 'prefill awareness' undermines alignment and jailbreaking evaluations relying on inserted assistant context. Models distinguish stylistic from preference mismatch.

AI safety Alignment Evals

SIG

HYP

arXiv cs.AI·Jun 12

Topical Phase Transitions in Artificial Intelligence Research: Large-Scale Evidence and an Early-Warning Signature for Emerging Topics

Analysis of 80,814 papers from 5 major AI conferences (2017-2025) reveals research topics advance through abrupt phase transitions, not gradually. LLMs dominant by 2025; diffusion models and vision-language models surged within 1-3 years. Early-warning signature flags reasoning, test-time compute, agentic AI, multimodal LLMs, RAG, and world models as topics to monitor 2026-2028.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.AI·Jun 12

Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System

Deployment study of an LLM embedded in electronic health records. A pre-response classifier predicts user rejection risk (AUROC 0.719) by leveraging deployment-specific context (provider type, department, model). Prospective analysis over 4.5 months.

Evals AI safety Alignment

SIG

HYP

arXiv cs.AI·Jun 12

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

DailyReport is an open-source benchmark evaluating search agents on 150 real-world daily tasks with 3,546 evaluation rubrics. Tasks decomposed into subtasks with cascade evaluation across disentangled dimensions. Testing 17 agentic systems reveals significant gaps versus user expectations.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.AI·Jun 12

MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling

MARS is an adversarial stopping rule for parallel LLM test-time scaling. It probes partial traces at intermediate checkpoints to estimate which traces will change answers, enabling early stopping once the leading vote is safe. Across three reasoning models and three competition-math benchmarks, MARS saves 25-47% of self-consistency tokens while maintaining accuracy.

Reasoning Evals Benchmarks

SIG

HYP

Reddit r/MachineLearning·Jun 11

Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R]

Adaptive video tokenisation method exploiting temporal redundancy in frozen tokeniser latent space via fixed threshold on per-position temporal-L1 differences. Latent Inpainting Transformer (LIT) reconstructs dropped positions. Single encoder + one LIT pass pipeline: 31× speedup over ElasticTok-CV, 2× over InfoTok on TokenBench and DAVIS benchmarks.

Video generation Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 11

AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable

Comparative study of Claude Code and Codex on social science data analysis. AI agents produce methodological diversity equal to or exceeding humans, but remain vulnerable to interpretive bias at the verdict level. A biased prompt does not shift aggregate estimates unlike biased human analysts.

Claude Code AI Agents Code generation

SIG

HYP

arXiv cs.AI·Jun 11

PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents

ProjectMem is a local-first, open-source memory layer for AI coding agents based on an immutable event log (issues, attempts, fixes, decisions). It reduces token consumption (5,000-20,000 per session) and adds preventive governance: agents are warned before repeating failed fixes or editing fragile files. Runs fully offline via MCP.

AI Agents Code generation MCP

SIG

HYP

arXiv cs.CL·Jun 11

NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track

NightFeats, a structured multi-agent RAG system, wins Best Dynamic Evaluation at NeurIPS 2025 (text-to-text track). The pipeline decomposes knowledge synthesis into three phases: retrieval, curation, composition, with temporal-semantic reranking and contradiction reconciliation. Outperforms Claude-SonnetV2 and Nova-Pro on human evaluations.

Multi-agent RAG AI Agents

SIG

HYP

arXiv cs.LG·Jun 11

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

Activation steering to reduce sycophancy in Llama-3-8B-Instruct also suppresses agreement with factually correct statements. Dual-stance evaluation reveals sycophantic and factual agreement occupy distinct geometric subspaces, yet the steering direction projects equally onto both, illustrating a gap between readable and writable representations.

Llama Alignment AI safety

SIG

HYP

arXiv cs.AI·Jun 11

Can AI Agents Synthesize Scientific Conclusions?

SciConBench, a benchmark of 9.11K questions from systematic reviews, evaluates AI agents' ability to synthesize scientific conclusions. Among 8 frontier models tested in controlled settings, the best agent achieves only 0.337 factual F1. Consumer-facing agents (Google AI Overview, OpenEvidence) frequently generate incomplete or contradictory conclusions.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·Jun 11

Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

HORMA organizes LLM agent experience into file-system-like hierarchical structure to improve long-horizon tasks. Two modules: structured memory construction and RL-based navigation retrieval. Reduces token usage by 22% on ALFWorld/LoCoMo/LongMemEval while improving performance.

AI Agents Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 11

Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning

Lung-R1 is a pulmonary diagnostic LLM guided by LungKG, a structured knowledge graph containing 59,038 nodes and 164,308 edges. Trained via KG-constrained reasoning-chain construction and reinforcement learning, Lung-R1-14B achieves 4.3583 on EMR Diagnosis, outperforming baselines by 0.1476 points.

Reasoning RAG Reinforcement learning

SIG

HYP

arXiv cs.CL·Jun 11

The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content

RAG systems suffer from a structural bias: knowledge graph triples capture 2-3x more attention per token than semantically equivalent natural language, compressing demonstration attention by up to 42%, regardless of relevance. Authors formalise this 'structural attention tax' and propose five mitigation strategies, validated on Mistral-7B and LLaMA-3-8B.

RAG Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 11

One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection

MLJailDe, a multilingual jailbreak detection framework, uses back-translation data augmentation across 11 languages (2,232 benign, 1,239 jailbreak samples) and relative-distance constraints to reduce cross-lingual representation dispersion. Achieves F1=98.5% and F1=97.1% on unseen languages.

AI safety Alignment Benchmarks

SIG

HYP

arXiv cs.LG·Jun 11

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

Evaluation framework for LLM adversarial robustness based on computational cost (FLOPs) rather than query count. Study of 10 models with 3 attack strategies reveals: alignment has non-monotonic effects, scaling reduces gradient attacks but not template attacks, transfer possible across models, cost varies up to 5× across harm categories.

AI safety Alignment Evals

SIG

HYP

arXiv cs.LG·Jun 11

SirenFNO: Efficient and Full Frequency Learning of Fourier Neural Operators

SirenFNO combines Fourier neural operators (FNO) with sinusoidal representation networks (SIREN) to learn full-spectrum without frequency truncation. The framework reduces parameters by 4–15× on PDE benchmarks while eliminating spectral bias toward low frequencies; functional tensor decomposition variants achieve 73× parameter reduction.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·Jun 11

Counterexample Guided Learning in the Large using Reasoning Agents

Study on counterexample-guided learning to improve LLMs on regex induction tasks. Researchers propose refinement strategies (regularization, symbolic counterexample clustering) and reflection/repair loops. Results: success rates improve from 3.2% to 38.1% and 38.9% to 74.1% on hardest task groups.

AI Agents Reasoning Code generation

SIG

HYP

arXiv cs.LG·Jun 11

FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse

FlowBank optimizes LLM-based multi-agent workflows through a reusable bank of complementary workflows. The three-stage framework (DiverseFlow for diversification, CuraFlow for compression, adaptive matching) improves over baselines by 4.26% to 14.92% across five benchmarks while remaining cost-competitive.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.LG·Jun 11

Energy-Conserved Neural Pipelines: Attenuating Error Propagation in Modular Neural Networks via Physical Conservation Constraints

Researchers propose enforcing energy conservation as a hard physical constraint in modular neural network pipelines to mitigate error propagation across module boundaries. On CIFAR-10, this approach retains 77.4% accuracy at sigma=0.2 versus 35.1% for baselines. The advantage generalizes to robotic systems (Franka Panda, MuJoCo) with +18.9 pp gain.

Papers Benchmarks AI safety

SIG

HYP

arXiv cs.LG·Jun 11

Least-Action-Guided Diffusion for Physical Extrapolation

LAPG combines conditional score-based diffusion with action-derived physical guidance to improve extrapolation in computational physics. Tested on ODE/PDE systems (free fall, springs, vortices, airfoil flows), the method reduces phase drift and preserves physical consistency outside training distribution.

Papers Reasoning

SIG

HYP

arXiv cs.LG·Jun 11

SwiftCTS: Fast Cross-Design Prediction and Pareto Optimization of Clock Tree Metrics via Few-Shot Calibration

SwiftCTS is a physics-informed surrogate framework for Clock Tree Synthesis. Trained in under 5 seconds on CPU with sub-millisecond inference, it uses K-shot multiplicative calibration to adapt predictions to unseen architectures without retraining. Evaluates 100,000 CTS configurations in 10 seconds with <0.5% errors on power and wirelength.

Benchmarks Papers Open source

SIG

HYP

arXiv cs.LG·Jun 11

Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

Theoretical paper proves compression-based reward (sealed-audit loss decrease) is Goodhart-resistant: if r_t = E(θ_{t-1}) - E(θ_t), cumulative reward telescopes exactly to true audit improvement. For finite panels, empirical deviation bounded by 2Δ_n(F,δ). Authors mechanize proof in Lean 4 and validate on ARC-TGI grid-transformation tasks.

Reinforcement learning Alignment Evals

SIG

HYP

arXiv cs.LG·Jun 11

GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs

GraphInfer-Bench is a benchmark of 42,000 samples across 6 real-world graphs assessing whether LLMs can perform complex graph inference (open-ended answers requiring multiple nodes). Four method families tested: graph-token alignment, frontier LLMs, Graph2Text fine-tuning, GNNs. None closes the gap; GNNs outperform LLMs on most tasks.

Benchmarks Reasoning RAG

SIG

HYP

arXiv cs.CL·Jun 11

When More Documents Hurt RAG: Mitigating Vector Search Dilution with Domain-Scoped, Model-Agnostic Retrieval

RAG degrades on heterogeneous collections: dense search loses discriminative power (Wyoming DOT: 75% → 40% scaling 54 to 1,128 docs). MASDR-RAG proposes domain scoping via organizational metadata, improving P@10 from 0.77 to 0.86 (p<0.05). Multi-agent orchestration creates precision-faithfulness paradox.

RAG Vector search Multi-agent

SIG

HYP

arXiv cs.CL·Jun 11

Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

WebGraphMix selects pretraining data by analyzing Common Crawl web graph topology. The method computes centrality scores without model training or labeled data, then mixes central and peripheral documents. At 400M–1B parameters, 1:1 ratio achieves 41.4% average (+1.6pp vs uniform sampling), 43.8% combined with quality scores.

Benchmarks Papers Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 11

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

Context Window Lifecycle (CWL) manages long-horizon LLM agent memory via structured semantic eviction. The agent annotates its trajectory as typed, dependency-linked episodes; a deterministic policy evicts content by priority when token budget is exceeded. CWL completes 89 sequential tasks across 80M tokens with no accuracy degradation, avoiding summarization compaction limitations.

AI Agents Reasoning Papers

SIG

HYP

arXiv cs.CL·Jun 11

When Roleplaying, Do Models Believe What They Say?

Study distinguishing what language models say from what they internally believe. Using linear truth probes on Claude, Qwen, and Llama role-playing historical personas, authors show persona adoption changes outputs more than internal truth representations. Contrasts with Emergent Misalignment where false claims shift toward true belief space.

Reasoning Alignment Evals

SIG

HYP

arXiv cs.CL·Jun 11

BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

BioDivergence is a benchmark and evaluation framework for hidden contextual contradictions in biomedical abstracts. It proposes a six-class conflict taxonomy, a 13-axis divergence ontology, and four structured outputs per claim pair. The silver benchmark contains 11,865 claim pairs across five biomedical domains. Mistral-7B-Instruct-v0.3 achieves 0.5523 accuracy and 0.3894 contextual-F1.

Benchmarks Papers Mistral

SIG

HYP

arXiv cs.LG·Jun 11

Beyond the Golden Teacher: Enhancing Graph Learning through LLM-GNN Co-teaching

LLM-GNN Co-Teaching introduces a bidirectional co-teaching framework for few-shot learning on text-attributed graphs. Instead of designating one model as teacher, GNN and LLM exchange their most confident pseudo-labels and update mutually. RPL-PO mines DPO preference pairs from convergence trajectories. Achieves 7.86% 3-shot gains on Cora and 7.73% on ogbn-arxiv.

RAG Reasoning Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 11

SOMA-SQL: Resolving Multi-Source Ambiguity in NL-to-SQL via Synthetic Log and Execution Probing

SOMA-SQL resolves multi-source ambiguity in NL-to-SQL translation via synthetic query logs and execution probing. The method constructs synthetic logs to ground schema interpretation, generates SQL candidates, then executes targeted probing queries driven by an ambiguity taxonomy. Results: +13.0% execution accuracy on average across 6 benchmarks, up to +16.7% on ambiguous questions.

Code generation Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 11

APEX: A Network-Native Time-Series Foundation Model for Forecasting and Anomaly Detection for Wireless Edge Operations

APEX is a transformer foundation model specialized for wireless network telemetry. Pre-trained on 10-channel multivariate data from ~4,500 production networks (100K time series), it reduces MAE by 18% vs Toto and 38% vs SARIMA on 192-step DHCP forecasting. Two versions: APEX-Large (269M, cloud) and APEX-Edge (10.5M, edge) with F1=0.93 for anomaly detection.

Papers Benchmarks Reasoning

SIG

HYP