Curriculum Learning for Safety Alignment
Signal
78
Hype
25
In three linesStaged-Competence, a curriculum learning framework, improves robustness of DPO-based safety alignment. Across three model families, it reduces out-of-distribution harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities. The framework achieves baseline safety with 75% of training data.Read source
Your take?
Summary generated by Claude — human-verified