arXiv cs.AI·19 May 2026

Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning

Signal

Hype

In three linesR-AIRL (Reasoning Adversarial Inverse Reinforcement Learning) infers process-level reward functions from expert Chain-of-Thoughts without explicit reward definitions. Tested on GSM8K, MMLU-Pro, and MedReason: improves pass@1 by 17.4 points via inference-time reranking, outperforms SFT in post-training, localizes reasoning failures with 86.1% accuracy.

Read source

Your take?

Reinforcement learning Reasoning Evals

Summary generated by Claude — human-verified

Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning

Other angles on this story