BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization
Signal
72
Hype
25
In three linesBiasGRPO introduces Group Relative Policy Optimization (GRPO) to mitigate social bias in LLMs. By normalizing rewards across sampled completions, the method stabilizes training compared to DPO and PPO. Authors release a compute-efficient bias reward model and extended dataset for multi-objective RLHF.Read source
Your take?
Summary generated by Claude — human-verified