Back to feed
arXiv cs.LG·

Curriculum Learning for Safety Alignment

Signal
78
Hype
25
In three linesStaged-Competence, a curriculum learning framework, improves robustness of DPO-based safety alignment. Across three model families, it reduces out-of-distribution harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities. The framework achieves baseline safety with 75% of training data.
Read source
Your take?
AI safetyAlignmentReinforcement learningPapers

Summary generated by Claude — human-verified