Back to feed
arXiv cs.CL·

Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

Signal
78
Hype
15
In three linesStudy of 'cognitive poisoning': malicious tools accumulate trust through benign feedback before becoming harmful. TRUST-Bench (1,970 episodes) and VISTA-Guard propose defense via final-action risk scoring from interaction trajectory. Prompt-centric heuristics fail; trajectory-aware scoring achieves 84.2% in-domain performance.
Read source
Your take?
AI AgentsAI safetyBenchmarksPapers

Summary generated by Claude — human-verified