Back to feed
arXiv cs.AI·

Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling

Signal
75
Hype
35
In three linesBabel is a black-box jailbreak method exploiting a vulnerability in LLM safety alignment: safety relies on sparse attention heads, leaving representational space weakly monitored. Through optimized obfuscation and iterative refinement, Babel achieves 82.67% success on GPT-4o and 78.33% on Claude-3-5-haiku within ~40 queries.
Read source
Your take?
AI safetyAlignmentGPTClaudeEvals

Summary generated by Claude — human-verified