BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization
BiasGRPO introduces Group Relative Policy Optimization (GRPO) to mitigate social bias in LLMs. By normalizing rewards across sampled completions, the method stabilizes training compared to DPO and PPO. Authors release a compute-efficient bias reward model and extended dataset for multi-objective RLHF.