Improving mathematical reasoning with process supervision
OpenAI trains a model using process supervision (rewarding each correct reasoning step) instead of outcome supervision (rewarding final answers). This approach achieves state-of-the-art mathematical problem solving and improves alignment by directly training models to produce human-endorsed chain-of-thought reasoning.