Back to feed
arXiv cs.CL·

Token-weighted Direct Preference Optimization with Attention

Signal
78
Hype
25
In three linesToken-weighted DPO (TwDPO) and AttentionPO introduce preference optimization that weights tokens by importance. AttentionPO uses the model's own attention to estimate weights without a separate reward model. Results: improvements on AlpacaEval, MT-Bench, ArenaHard.
Read source
Your take?
Reinforcement learningAlignmentBenchmarksPapers

Summary generated by Claude — human-verified