Back to feed
arXiv cs.AI·

When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

Signal
78
Hype
22
In three linesarXiv study reveals aligned language models fail to adapt safety behavior when context flips ("brittle safety"). Testing 12 models shows safety-commonsense gap of +17.4 pp. Current guardrails miss consequence-flips; state-aware validator catches all without false alarms.
Read source
Your take?
AI safetyAlignmentEvalsBenchmarks

Summary generated by Claude — human-verified