arXiv cs.CL·28 May 2026

PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI

Signal

Hype

In three linesPAST2HARM is an adaptive jailbreak attack exploiting past tense reformulation to bypass safeguards in multimodal text-to-image models. Tested on Gemini Nano, GPT Image 2, and SD XL, it achieves 83%, 67%, and 100% success rates. The attack generates explicit sexual content, political disinformation, and hate speech.

Read source

Your take?

AI safety Alignment Vision Evals Benchmarks

Summary generated by Claude — human-verified

PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI

Other angles on this story