Back to feed
arXiv cs.CL·

Scaling Laws for Code: A More Data-Hungry Regime

Signal
82
Hype
15
In three linesEmpirical study of 117 experiments (0.2B–3.8B parameters, 2B–128B tokens) on scaling laws for Code LLMs. Code requires higher data-to-parameter ratio than natural language. Farseer law outperforms Chinchilla. Code-NL mixtures benefit NL under resource constraints but harm it at higher compute budgets.

## Scaling Laws for Code: Why Chinchilla Falls Short

### 1. What Was Measured and How

The study covers 117 training runs, models ranging from 0.2B to 3.8B parameters, tokens from 2B to 128B — a systematic grid that was previously missing for code-specific scaling analysis. The authors fit two scaling laws: the Chinchilla law (DeepMind, 2022), which predicts optimal loss as a function of compute budget via L(N,D) = E + A/N^α + B/D^β, and the Farseer law, a more expressive formulation that adds an interaction term between N and D. Across all runs, Farseer achieves better predictive accuracy than Chinchilla. This matters directly: compute allocation decisions for the next generation of code models depend on the quality of these fits.

### 2. The Data-to-Parameter Ratio: The Core Finding

Chinchilla prescribes, for natural language, an optimal ratio of roughly 20 tokens per parameter (70B model → ~1.4T tokens). This study's results indicate that code requires a substantially higher ratio. The authors do not publish a single normalized figure, but the direction is unambiguous: at equal compute budget, an optimally trained code model should be trained on more data with a smaller model than Chinchilla would recommend for text.

Why? Code has strict syntax, formal semantics, and a token distribution radically different from natural text. Identifiers, control structures, and function calls create long-range dependencies the model must memorize precisely, not approximate. The statistical redundancy that NL scaling laws exploit is lower in code: every token carries more weight.

Direct consequence: labs that calibrated their code runs on Chinchilla have likely under-trained their models on data, even if they hit their FLOPs target. CodeLlama, StarCoder 2, DeepSeek-Coder — all trained on corpora of 500B to 2T tokens — could be retrospectively repositioned as suboptimal if the correct ratio sits significantly above 20 tokens/parameter.

### 3. Code-NL Mixtures: The High-Budget Trap

Experiments on code + natural language mixtures produce a counterintuitive but actionable result. In resource-constrained regimes (low compute budget, limited tokens available), adding NL to the training corpus improves performance on code benchmarks. NL acts as a regularizer, transfers general reasoning capabilities, and compensates for the scarcity of high-quality code data.

But at high compute budgets, the relationship reverses: NL becomes a drag. The model would have performed better consuming more pure code tokens. This has direct implications for curriculum decisions in current models. Models in the 7B–70B range trained with large-scale code/NL mixtures — Llama 3 with code fine-tuning, Qwen2.5-Coder — likely operate in the regime where NL penalizes, unless their code corpus is large enough to compensate.

### 4. Potential Losers and Practical Implications

**Chinchilla-calibrated code runs**: any organization that sized its code training runs on the standard Chinchilla law has potentially misallocated compute. The cost is real — a 1B model trained on 20B tokens when 60B+ tokens were needed represents either wasted FLOPs or suboptimal performance.

**Current benchmarks**: if the dominant models on HumanEval, MBPP, or SWE-bench are data-undertrained according to this new law, current leaderboard rankings reflect a generalized suboptimal regime. Performance gaps between models could redistribute significantly under recalibrated training.

**Synthetic data providers**: the finding that code is data-hungry reinforces demand for high-quality code data and code synthesis pipelines (OSS-Instruct, Magicoder-style approaches). Synthetic data generation for code becomes even more strategically valuable.

One limitation worth flagging: the study caps at 3.8B parameters. Extrapolation to 70B or 405B assumes the fitted laws remain valid outside the observation window — a standard assumption, but unverified here. The results are nonetheless robust enough to guide allocation decisions at mid-size lab scale.

Read source
Your take?
Code generationBenchmarksPapers

Summary generated by Claude — human-verified