Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling
Signal
75
Hype
35
In three linesBabel is a black-box jailbreak method exploiting a vulnerability in LLM safety alignment: safety relies on sparse attention heads, leaving representational space weakly monitored. Through optimized obfuscation and iterative refinement, Babel achieves 82.67% success on GPT-4o and 78.33% on Claude-3-5-haiku within ~40 queries.Read source
Your take?
Summary generated by Claude — human-verified