Back to feed
arXiv cs.CL·

Configurable Reward Model for Balanced Safety Alignment

Signal
78
Hype
22
In three linesCSRM (Configurable Safety Reward Model) jointly optimizes calibrated safety compliance and reward modeling to adapt LLMs to heterogeneous and evolving safety requirements. Achieves 94.6% F1 on CoSApien and 75.8% F1 on DynaBench without additional human annotation.
Read source
Your take?
AI safetyAlignmentReinforcement learningBenchmarks

Summary generated by Claude — human-verified