Back to feed
arXiv cs.CL·

Universal Adversarial Triggers

Signal
72
Hype
25
In three linesStudy on universal adversarial attacks in NLP. Authors propose a method combining POS filtering and perplexity-based loss to generate natural-sounding triggers. On SST (sentiment analysis), triggers achieve 0.04-0.12 accuracy. Adversarial training improves model robustness from 0.12 to 0.48.
Read source
Your take?
AI safetyAlignment

Summary generated by Claude — human-verified