arXiv cs.CL·19 May 2026

Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

Signal

Hype

In three linesStudy of 'cognitive poisoning': malicious tools accumulate trust through benign feedback before becoming harmful. TRUST-Bench (1,970 episodes) and VISTA-Guard propose defense via final-action risk scoring from interaction trajectory. Prompt-centric heuristics fail; trajectory-aware scoring achieves 84.2% in-domain performance.

Read source

Your take?

AI Agents AI safety Benchmarks Papers

Summary generated by Claude — human-verified

Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

Other angles on this story