Configurable Reward Model for Balanced Safety Alignment
Signal
78
Hype
22
In three linesCSRM (Configurable Safety Reward Model) jointly optimizes calibrated safety compliance and reward modeling to adapt LLMs to heterogeneous and evolving safety requirements. Achieves 94.6% F1 on CoSApien and 75.8% F1 on DynaBench without additional human annotation.Read source
Your take?
Summary generated by Claude — human-verified