Back to feed
arXiv cs.AI·

Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

Signal
75
Hype
25
In three linesDecomposeR, a deep research framework, trains Qwen3-8B in two RL stages: planner RL learns typed DAG structures and query decomposition, then answerer RL learns branch execution and synthesis. Achieves 5.1-8.0 point improvements on long-form benchmarks through explicit planning and structured rewards.
Read source
Your take?
QwenReinforcement learningReasoningRAGBenchmarks

Summary generated by Claude — human-verified