arXiv cs.CL·19 May 2026

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

Signal

Hype

In three linesDISA is an offline RL method for LLMs that decouples partition-function estimation (via importance sampling) from policy optimization. On 9 benchmarks (math and code), it matches or exceeds FlowRL, outperforms GRPO/GSPO, and retains substantially more strategy-level diversity than reward-maximization baselines.

Read source

Your take?

Reinforcement learning Reasoning Code generation Papers Benchmarks

Summary generated by Claude — human-verified

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

Other angles on this story