Back to feed
arXiv cs.AI·

Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning

Signal
78
Hype
25
In three linesR-AIRL (Reasoning Adversarial Inverse Reinforcement Learning) infers process-level reward functions from expert Chain-of-Thoughts without explicit reward definitions. Tested on GSM8K, MMLU-Pro, and MedReason: improves pass@1 by 17.4 points via inference-time reranking, outperforms SFT in post-training, localizes reasoning failures with 86.1% accuracy.
Read source
Your take?
Reinforcement learningReasoningEvals

Summary generated by Claude — human-verified