Token-weighted Direct Preference Optimization with Attention
Signal
78
Hype
25
In three linesToken-weighted DPO (TwDPO) and AttentionPO introduce preference optimization that weights tokens by importance. AttentionPO uses the model's own attention to estimate weights without a separate reward model. Results: improvements on AlpacaEval, MT-Bench, ArenaHard.Read source
Your take?
Summary generated by Claude — human-verified