arXiv cs.AI·1 June 2026

Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

Signal

Hype

In three linesDecomposeR, a deep research framework, trains Qwen3-8B in two RL stages: planner RL learns typed DAG structures and query decomposition, then answerer RL learns branch execution and synthesis. Achieves 5.1-8.0 point improvements on long-form benchmarks through explicit planning and structured rewards.

Read source

Your take?

Qwen Reinforcement learning Reasoning RAG Benchmarks

Summary generated by Claude — human-verified

Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

Other angles on this story