Back to feed
arXiv cs.CL·

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

Signal
78
Hype
15
In three linesDISA is an offline RL method for LLMs that decouples partition-function estimation (via importance sampling) from policy optimization. On 9 benchmarks (math and code), it matches or exceeds FlowRL, outperforms GRPO/GSPO, and retains substantially more strategy-level diversity than reward-maximization baselines.
Read source
Your take?
Reinforcement learningReasoningCode generationPapersBenchmarks

Summary generated by Claude — human-verified