arXiv cs.CL·5 June 2026

CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

Signal

Hype

In three linesCHASE is a co-evolutionary red-blue teaming framework training an attacker and defender via GRPO to improve LLM robustness against prompt-rewriting attacks (persona modulation, fictional framing). Evaluated on BeaverTails and JailbreakBench, it reduces StrongREJECT score by 43.2% with 0% false refusals on benign prompts.

Read source

Your take?

AI safety Alignment Reinforcement learning Evals

Summary generated by Claude — human-verified

CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

Other angles on this story