arXiv cs.AI·28 May 2026

EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

Signal

Hype

In three linesEAPO is an adaptive policy optimization method for training reasoning models in open-ended QA. It dynamically adjusts positive/negative sample weights based on current-to-initial entropy ratio to preserve exploration and stability. Tests on two medical QA datasets show improvements in diversity and stability versus fixed-weight baselines.

Read source

Your take?

Reinforcement learning Reasoning Evals

Summary generated by Claude — human-verified

EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

Other angles on this story