arXiv cs.AI·19 May 2026

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Reward Models

Signal

Hype

In three linesCodeScaler is a reward model for training and inference scaling of code LLMs. Trained on verified preference data, it outperforms execution-based RL by +1.55 points on Qwen3-8B and +4.23 on Qwen3-14B. At inference, it reduces latency 10× while maintaining performance comparable to unit test approaches.

Read source

Your take?

Code generation Reinforcement learning Qwen Benchmarks Evals

Summary generated by Claude — human-verified

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Reward Models

Other angles on this story