arXiv cs.AI·19 May 2026

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

Signal

Hype

In three linesNew arXiv paper proposing HRC (Hybrid Reward-Cyclic), a reward model explicitly decomposing human preferences into transitive (scalar) and cyclic (vector) components via game theory. Introduces DSPPO (Dynamic Self-Play Preference Optimization) for alignment. Results: +1.23% on RewardBench 2 vs GPM, 44.75% win-rate on AlpacaEval 2.0 with Gemma-2B-it.

Read source

Your take?

Reinforcement learning Alignment Papers Benchmarks

Summary generated by Claude — human-verified

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

Other angles on this story