Back to feed
arXiv cs.AI·

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

Signal
72
Hype
25
In three linesBiasGRPO introduces Group Relative Policy Optimization (GRPO) to mitigate social bias in LLMs. By normalizing rewards across sampled completions, the method stabilizes training compared to DPO and PPO. Authors release a compute-efficient bias reward model and extended dataset for multi-objective RLHF.
Read source
Your take?
Reinforcement learningAlignmentAI safetyBenchmarks

Summary generated by Claude — human-verified