CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning
Signal
82
Hype
18
In three linesCHASE is a co-evolutionary red-blue teaming framework training an attacker and defender via GRPO to improve LLM robustness against prompt-rewriting attacks (persona modulation, fictional framing). Evaluated on BeaverTails and JailbreakBench, it reduces StrongREJECT score by 43.2% with 0% false refusals on benign prompts.Read source
Your take?
Summary generated by Claude — human-verified