Page 5 of 137

AllHigh signalRecent
5460 articles
arXiv cs.AI·

The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure

Study on multi-agent systems: 'semantic hijacking' attacks exploit agent confidence. Paradox identified: increasing Worker capability raises attack success rate from 18.4% to 63.9%. Mediation analysis reveals 'linguistic certainty' of stronger agents drives vulnerability. Proposed solution: heterogeneous ensemble verification reduces attack success rate to 2%.

Multi-agentAI AgentsAI safety
SIG
82
HYP
15
arXiv cs.LG·

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

ProxyKV introduces a cross-model proxy pruning framework to accelerate long-context LLM inference. A lightweight in-family small model evaluates KV cache importance asynchronously via HybridAxialMapper and Multi-Granularity Hybrid Loss. On Llama-3.1, Qwen-2.5, and Qwen-3, recovers 98.7% of KVZip accuracy with up to 3.21× prefilling speedup (Llama-3.1-8B, dual-GPU) and sustains speedup at contexts up to 170k tokens.

LlamaQwenReasoning
SIG
82
HYP
18
arXiv cs.AI·

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

FML-Bench is a benchmark of 18 ML tasks across 10 domains evaluating 6 AI research agents. Key findings: strategy complexity alone does not ensure performance (greedy hill-climber matches tree-search); effectiveness depends on improvement opportunity structure; an adaptive agent detecting stagnation outperforms others. Includes 12 process-level behavioral metrics.

AI AgentsBenchmarksReasoning
SIG
82
HYP
15
arXiv cs.AI·

MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair

MemRepair is a memory-augmented agentic framework for repository-level vulnerability repair. It combines three memory layers (History-Fix, Security-Pattern, Refinement-Trajectory) with an iterative refinement loop. Evaluated on SEC-Bench, PatchEval, and Multi-SWE-bench, MemRepair achieves 58.0%, 58.2%, and 30.58% resolution rates, outperforming OpenHands, SWE-agent, and InfCode-C++.

AI AgentsCode generationAI safety
SIG
82
HYP
18
arXiv cs.AI·

Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents

Large-scale study of 64,380 SWE-bench runs across 126 agent configurations (43 frameworks × LLMs). Behavioral rules derived from single frameworks do not transfer: the same signal (e.g., error rate) correlates positively with issue resolution in 47 configs and negatively in 48. Framework identity explains 64% of variance vs. 10% for LLM family.

AI AgentsBenchmarksCode generation
SIG
82
HYP
15
arXiv cs.AI·

Prompts Don't Protect: Architectural Enforcement via MCP Proxy for LLM Tool Access Control

LLMs used as autonomous agents select unauthorized tools despite explicit instructions. Study across Qwen 2.5 7B, Llama 3.1 8B, and Claude Haiku 3.5 shows an MCP proxy with attribute-based access control (ABAC) reduces unauthorized invocation rate to 0%, versus 11-18% for prompt-based restrictions. Architectural enforcement, not prompting, is required for reliable tool access control.

AI AgentsMCPAI safety
SIG
82
HYP
15
arXiv cs.AI·

Can LLM Agents Be CFOs? Benchmarking Long-Horizon Resource Allocation in an Uncertain Enterprise Environment

EnterpriseArena, a 132-month CFO simulator, benchmarks LLM agents' ability to allocate resources over long horizons under uncertainty. Tests across 23 models and 4 frameworks: only 15.4% of trials complete the full horizon. Larger models do not reliably outperform smaller ones. Reveals critical capability gap in managing binding commitments under partial observability.

AI AgentsBenchmarksReasoning
SIG
82
HYP
18
arXiv cs.CL·

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

PARALLAX reveals that 4 of 6 major hallucination detection benchmarks embed the ground-truth answer in the prompt, allowing a naive baseline (TxTemb) to achieve near-perfect detection without access to model internals. Evaluation of 22 methods across 12 open-source models: most fail under controlled conditions, except SAPLMA and DRIFT (supervised probes on upper-layer hidden states).

BenchmarksEvalsAI safety
SIG
82
HYP
15