EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA
Signal
72
Hype
18
In three linesEAPO is an adaptive policy optimization method for training reasoning models in open-ended QA. It dynamically adjusts positive/negative sample weights based on current-to-initial entropy ratio to preserve exploration and stability. Tests on two medical QA datasets show improvements in diversity and stability versus fixed-weight baselines.Read source
Your take?
Summary generated by Claude — human-verified