Page 7 of 192

AllHigh signalRecent

7679 articles

ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

ASPI is a benchmark of 728 task-attack scenarios measuring how clarification amplifies prompt injection vulnerability. Testing on 10 frontier LLMs shows attack success rates rise from 1.8% to 34.0% for o3 and 2.2% to 35.7% for Gemini-3-Flash in clarification mode. Code and data released.

AI Agents AI safety Benchmarks

SIG

HYP

arXiv cs.AI·May 19

OProver: A Unified Framework for Agentic Formal Theorem Proving

OProver is a unified framework for agentic formal theorem proving in Lean 4. The system iteratively revises failed proof attempts using retrieved compiler-verified proofs and Lean compiler feedback. Trained via continued pretraining and iterative post-training, OProver-32B achieves 93.3% Pass@32 on MiniF2F and 58.2% on ProverBench.

AI Agents Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

ContractBench: Can LLM Agents Preserve Observation Contracts?

ContractBench benchmarks LLM agents' ability to preserve observation contracts (temporally valid, byte-level intact artifacts) in API calls. Of 38 models tested, none exceed 80%: Claude-Opus-4.6 leads at 77.8%. Results show integrity and validity failures uncorrelated with model size, and non-monotonic regression in the GPT-5 family despite larger scale.

AI Agents Benchmarks Claude

SIG

HYP

arXiv cs.AI·May 19

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

Unified study of LLM distillation showing SFT, DAgger, offline RL, and OPD decouple two orthogonal axes: prefix source and token-level KL direction. Authors propose KL mixing and entropy-gated length curriculum, improving Pass@k by 5.8 points and reducing average response length by 3x on math reasoning.

Fine-tuning Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 19

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

CHI-Bench evaluates AI agents' ability to automate complex healthcare workflows (prior authorization, utilization management, care management) across 87 MCP tools and 20 applications. Best agent resolves only 28% of tasks; none exceed 20% on strict pass. Performance drops to 3.8% in single-session mode.

AI Agents MCP Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

Injecting mathematically wrong drafts from a smaller model (Qwen2.5-Math-1.5B) into stronger learner (Mathstral-7B) GRPO training improves performance on MATH-500 (+1.62pp) and AIME 2025/2026 (+14.2pp at pass@1024). Intentional mismatch between problems and drafts is critical: 71.98% on MATH-500, highest published result for this model.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

How Do Electrocardiogram Models Scale?

Systematic study of scaling laws for ECG models: 120 models (20K–200M parameters) pre-trained on CODE (2.3M records). SSL models outperform SL on out-of-distribution generalization; ResNets 1.3–2.5× more parameter-efficient than Transformers; SSL 16× more data-efficient. Architecture and paradigm choice matter more than brute-force scaling.

Benchmarks

SIG

HYP

arXiv cs.AI·May 19

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

MM-ToolBench is a benchmark for omni-modal tool-using agents in real-world workflows. 100 executable tasks (customer service, intelligent creation), 27 MCP servers, 324 tools. Closed-loop multimodal verification: agents execute, inspect, and self-correct. Claude Opus 4.6 achieves 32% success vs 94% human baseline.

AI Agents MCP Benchmarks

SIG

HYP

arXiv cs.LG·May 19

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

ProxyKV introduces a cross-model proxy pruning framework to accelerate long-context LLM inference. A lightweight in-family small model evaluates KV cache importance asynchronously via HybridAxialMapper and Multi-Granularity Hybrid Loss. On Llama-3.1, Qwen-2.5, and Qwen-3, recovers 98.7% of KVZip accuracy with up to 3.21× prefilling speedup (Llama-3.1-8B, dual-GPU) and sustains speedup at contexts up to 170k tokens.

Llama Qwen Reasoning

SIG

HYP

arXiv cs.AI·May 19

The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure

Study on multi-agent systems: 'semantic hijacking' attacks exploit agent confidence. Paradox identified: increasing Worker capability raises attack success rate from 18.4% to 63.9%. Mediation analysis reveals 'linguistic certainty' of stronger agents drives vulnerability. Proposed solution: heterogeneous ensemble verification reduces attack success rate to 2%.

Multi-agent AI Agents AI safety

SIG

HYP

arXiv cs.CL·May 19

Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

Researchers localize 'entity cells' in MLP neurons across language models (Qwen2.5-7B, etc.). These selectively activated neurons encode entity-specific facts. Suppressing one cell erases recall for that entity alone; activating it recovers knowledge even without context. Cells remain stable across aliases, acronyms, and multilingual forms.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Tongyi DeepResearch Technical Report

Tongyi DeepResearch is an agentic LLM with 30.5 billion parameters (3.3 billion activated per token) designed for long-horizon deep research tasks. Trained via agentic mid-training and post-training with automatic data synthesis, it achieves state-of-the-art on 7 benchmarks including Humanity's Last Exam and BrowseComp. Model and framework are open-sourced.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

SEDD: Scalable and Efficient Dataset Deduplication with GPUs

SEDD is a GPU-accelerated deduplication framework using MinHash LSH. It outperforms SlimPajama's CPU tool by 158× and NVIDIA NeMo Curator's GPU tool by 7.8× on 30M documents. MinHash signature generation 375× faster. Deduplicates 1.2T tokens in 3 hours on 32-GPU V100 cluster.

Benchmarks Infrastructure Open source

SIG

HYP

arXiv cs.CL·May 19

MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair

MemRepair is a memory-augmented agentic framework for repository-level vulnerability repair. It combines three memory layers (History-Fix, Security-Pattern, Refinement-Trajectory) with an iterative refinement loop. Evaluated on SEC-Bench, PatchEval, and Multi-SWE-bench, MemRepair achieves 58.0%, 58.2%, and 30.58% resolution rates, outperforming OpenHands, SWE-agent, and InfCode-C++.

AI Agents Code generation AI safety

SIG

HYP

arXiv cs.CL·May 19

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

Study of 38 models on 8,900 scholarly references: factual recall quality follows a sigmoid combining model size and topic frequency in training data. These two variables explain 60-94% of variance. Model proposes recall is gated by signal-to-noise ratio scaling with concept frequency and model capacity.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.CL·May 19

OProver: A Unified Framework for Agentic Formal Theorem Proving

OProver is a unified framework for agentic formal theorem proving in Lean 4. The 32B model achieves 93.3% Pass@32 on MiniF2F and 58.2% on ProverBench. Training combines pretraining, SFT on repair trajectories, and RL on hard cases. OProofs contains 1.77M Lean statements and 6.86M compiler-verified proofs.

AI Agents Reasoning Reinforcement learning

SIG

HYP

arXiv cs.CL·May 19

The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning

Study on when language models commit to deception. Using counterfactual localization across 5 environments (bluffing, mazes, financial advice, used-car sales, negotiation), authors analyze 1.46M sentences and 91.5B tokens. Lexical cues don't generalize, but attention-based features transfer across domains.

Reasoning AI safety Alignment

SIG

HYP

arXiv cs.CL·May 19

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

CHI-Bench evaluates AI agent automation of complex healthcare workflows. Benchmark spans 3 domains (prior authorization, utilization management, care management) with 87 MCP tools and 1,290+ policy documents. Best result: 28% task resolution, 3.8% in single session.

AI Agents Multi-agent MCP

SIG

HYP

arXiv cs.AI·May 19

SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition

SurgicalMamba, a Mamba2-based model, performs online surgical phase recognition with O(d) per-frame cost. Three components address domain-specific challenges: dual-path SSD separating long/short-term regimes, intensity-modulated stepping adapting effective rate, and state regramming enabling cross-channel mixing. SOTA results: 94.6%/82.7% on Cholec80, 89.5%/68.9% on AutoLaparo, 238.74 fps on single GPU.

Reasoning Benchmarks Vision

SIG

HYP

arXiv cs.AI·May 19

WriteSAE: Sparse Autoencoders for Recurrent State

WriteSAE introduces the first sparse autoencoder decomposing and editing matrix cache writes in Gated DeltaNet, Mamba-2, and RWKV-7 recurrent models. Factored atoms expose closed-form logit shifts per token, achieving 92.4% successful substitutions across 4,851 firings on Qwen3.5-0.8B and 88.1% on Mamba-2-370M.

Papers Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

Researchers localize 'entity cells'—selective MLP neurons encoding entity-specific facts—across seven language models. On Qwen2.5-7B, suppressing a cell selectively erases recall for its matched entity while activating a single cell recovers knowledge even without context. These cells remain stable under aliases, acronyms, and multilingual forms.

Benchmarks

SIG

HYP

arXiv cs.AI·May 19

OxyGen: Unified KV Cache Management for VLA Inference under Multi-Task Parallelism

OxyGen proposes unified KV cache management for Vision-Language-Action (VLA) model inference under multi-task parallelism. Implemented on π₀.₅, the system achieves 3.7× speedup on RTX 4090 and Jetson AGX Thor, delivering 200+ tokens/s and 70 Hz simultaneously without quality degradation.

Vision AI Agents Robotics

SIG

HYP

arXiv cs.CL·May 19

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Soohak is a 439-problem research-level math benchmark authored by 64 mathematicians. Gemini-3-Pro reaches 30.4%, GPT-5 26.4%, Claude-Opus-4.5 10.4%. The benchmark introduces a refusal subset evaluating the ability to recognize ill-posed problems: no model exceeds 50%.

Benchmarks Reasoning GPT

SIG

HYP

arXiv cs.AI·May 19

Reverse-Engineering Model Editing on Language Models

Researchers reveal a critical vulnerability in locate-then-edit model editing methods: parameter updates enable attackers to recover edited data via KSTER attack exploiting low-rank structure. A defense using subspace camouflage is proposed to obfuscate fingerprints without compromising editing utility.

AI safety Alignment Papers

SIG

HYP

arXiv cs.CL·May 19

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Med-V1 is a family of 3-billion-parameter language models trained on synthetic data for biomedical evidence attribution and fact verification. It outperforms base models by +27% to +71% on five benchmarks and rivals GPT-5 while being far more efficient. The study quantifies hallucinations in LLM-generated answers under different citation instructions.

Benchmarks Fine-tuning Evals

SIG

HYP

Reddit r/LocalLLaMA·May 18

llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig

MTP (speculative decoding) support merged into llama.cpp (PR #22673, May 16). Qwen 3.6 27B benchmarks: 1.81×–2.44× speedup on Strix Halo (ROCm), 1.54×–2.17× on RTX 3090. MoE 35B-A3B shows smaller gains (1.24×–1.40×). Enable with --spec-type draft-mtp --spec-draft-n-max N.

Llama Code generation Benchmarks

SIG

HYP

OpenAI Blog·Sep 30

Sora 2 System Card

OpenAI releases Sora 2, a video-audio generation model with improved physics accuracy, sharper realism, synchronized audio, enhanced steerability, and expanded stylistic range. Direct successor to Sora addressing longstanding challenges in video synthesis.

OpenAI Video generation Image generation

SIG

HYP

OpenAI Blog·Aug 28

Introducing gpt-realtime and Realtime API updates

OpenAI releases gpt-realtime, an advanced speech-to-speech model, with new API capabilities: MCP server support, image input, and SIP phone calling. Major Realtime API update enabling voice and multimodal integrations.

OpenAI GPT Voice

SIG

HYP

OpenAI Blog·May 16

Introducing Codex

OpenAI introduces Codex, a GPT-3-based model specialized in code generation. Trained on public code, it understands 12+ programming languages and translates natural language to executable code. Available in limited access via API.

OpenAI Code generation GPT

SIG

HYP

OpenAI Blog·Oct 1

Introducing vision to the fine-tuning API

OpenAI adds vision to the fine-tuning API. Developers can now fine-tune GPT-4o with images and text to improve the model's visual capabilities.

GPT OpenAI Fine-tuning

SIG

HYP

OpenAI Blog·May 31

Improving mathematical reasoning with process supervision

OpenAI trains a model using process supervision (rewarding each correct reasoning step) instead of outcome supervision (rewarding final answers). This approach achieves state-of-the-art mathematical problem solving and improves alignment by directly training models to produce human-endorsed chain-of-thought reasoning.

OpenAI Reasoning Reinforcement learning

SIG

HYP

OpenAI Blog·Oct 31

Reinforcement learning with prediction-based rewards

OpenAI introduces Random Network Distillation (RND), a prediction-based reinforcement learning method that encourages exploration through curiosity. RND exceeds average human performance on Montezuma's Revenge for the first time.

OpenAI Reinforcement learning Reasoning

SIG

HYP

OpenAI Blog·Jul 4

Learning Montezuma’s Revenge from a single demonstration

OpenAI trains an agent to achieve 74,500 on Montezuma's Revenge from a single human demonstration, surpassing all published results. The algorithm replays sequences from key states in the demo and optimizes score using PPO.

Reinforcement learning AI Agents Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

MCompassRAG improves RAG systems by using topic-level metadata as a semantic compass for paragraph-level retrieval. The method enriches chunk representations with topic signals in the same embedding space and trains a lightweight retriever via LLM-teacher distillation. Across six benchmarks, it gains 8.24% in information efficiency with 5× lower latency than efficient RAG baselines.

RAG Embeddings Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

Output Vector Editing for Memorization Mitigation in Large Language Models

Memorization suppression method in LLMs via output vector editing of MLP neurons. Tested on 4 models (360M-7B parameters), achieves 87.9% suppression on OLMo-7B with 6831 memorized sequences. Complementary approach to existing neuron ablation methods.

AI safety Alignment Papers

SIG

HYP

arXiv cs.CL·Jun 18

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

Local de-identification framework for educational dialogues. Two-stage cascade: union proposer (lightweight encoders + deterministic rules) generates PII candidates, then binary Redact/Keep reviewer uses dialogue context and speaker role. Achieves 0.958 macro F1 on math tutoring transcripts, outperforms commercial API (0.706) and local LLM baseline (0.767), runs on single laptop.

RAG AI safety Papers

SIG

HYP

arXiv cs.CL·Jun 18

VISUALSKILL: Multimodal Skills for Computer-Use Agents

VISUALSKILL introduces hierarchical multimodal skills for computer-use agents. Combining authored documentation with live UI exploration, the system improves Claude Opus 4.6 performance by +15.3 points on CUA-World and OSExpert-Eval (0.456 vs 0.303 baseline). Visual figures outperform text-only descriptions (+8.3 points).

Claude AI Agents MCP

SIG

HYP

arXiv cs.CL·Jun 18

TW-LegalBench: Measuring Taiwanese Legal Understanding

TW-LegalBench evaluates 13 LLMs on Taiwanese law using 16,000+ multiple-choice questions, 117 open-ended essays, and 14,000+ legal judgment prediction cases. Top models exceed lawyer qualification threshold (11%) but fall short for judges/prosecutors (1-2%). Models struggle to cite exact legal articles.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

Dual Dimensionality for Local and Global Attention

Researchers propose Distance-Adaptive Representation (DAR): reduce key/value dimensionality beyond a local window in decoder-only Transformers. Nearby tokens require full representations for next-token prediction, while distant tokens can use 1/4 original dimensionality without performance loss. Tested on 70M–410M models and 1B fine-tuning.

Reasoning Infrastructure Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

RedactionBench

RedactionBench is a manually annotated benchmark of 200 documents across 11 domains for evaluating PII redaction in context. Introduced with R-Score, a character-level metric, it shows 35 models (NER, SLM, frontier models) fail on contextual redactions: human consensus 89.4% for mandatory redactions, 47.7% for contextual ones.

Benchmarks AI safety Evals

SIG

HYP