Back to feed
arXiv cs.AI·

EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

Signal
72
Hype
18
In three linesEAPO is an adaptive policy optimization method for training reasoning models in open-ended QA. It dynamically adjusts positive/negative sample weights based on current-to-initial entropy ratio to preserve exploration and stability. Tests on two medical QA datasets show improvements in diversity and stability versus fixed-weight baselines.
Read source
Your take?
Reinforcement learningReasoningEvals

Summary generated by Claude — human-verified