Back to feed
arXiv cs.CL·

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

Signal
72
Hype
25
In three linesNew arXiv paper proposing HRC (Hybrid Reward-Cyclic), a reward model decomposing human preferences into transitive (scalar) and cyclic (vector) components via game theory. Introduces DSPPO (Dynamic Self-Play Preference Optimization) for dynamic alignment. Improves RewardBench 2 (+1.23% on Gemma-2B-it) and achieves 44.75% on AlpacaEval 2.0.
Read source
Your take?
Reinforcement learningAlignmentPapersBenchmarks

Summary generated by Claude — human-verified