OpenAI Blog·23 January 2020

Scaling laws for neural language models

Signal

Hype

In three linesOpenAI publishes research on scaling laws for neural language models, establishing predictable relationships between model size, training data, and performance. Results enable optimization of compute resource allocation.

## Scaling Laws: OpenAI Formalizes the Physics of LLM Training

### 1. What Is Established

Kaplan et al. (OpenAI) establish precise empirical relationships between three variables: parameter count N, training corpus size D (in tokens), and compute budget C (in FLOPs). The central finding is that cross-entropy loss follows stable power laws across several orders of magnitude — typically 6 to 7 decades for both N and D. The key formulation: L(N) ∝ N^(-0.076), L(D) ∝ D^(-0.095), with exponents that remain remarkably stable across tested Transformer architectures.

Before this paper, compute allocation was largely artisanal empiricism. Teams scaled model size or training data by intuition or exhaustive benchmarking. Scaling laws convert this into analytical optimization: for a fixed budget C, there exists an optimal N/D ratio that minimizes final loss.

### 2. Why the Signal Is High

This paper is not an incremental improvement — it is a predictive framework. Three concrete implications for practitioners:

**Compute-optimal allocation.** The relation C ≈ 6ND (approximate cost of a forward+backward pass) combined with power laws allows computing, before any training run, the point of maximum efficiency. The counterintuitive result: for a fixed budget, training a smaller model on more data outperforms training a large under-trained model. This result would be formalized two years later by Hoffmann et al. (DeepMind) in the Chinchilla paper, which corrects Kaplan's exponents toward even more data-favorable ratios (≈20 tokens per parameter).

**Predictability of gains.** Scaling curves allow extrapolating the performance of a 10× larger model from cheaper runs. In practice, this reduces the cost of architecture search: no need to run full training to evaluate a variant.

**Partial architectural independence.** The laws hold across different depths, widths, and attention head counts — as long as N remains the control parameter. This suggests fine-grained architecture matters less than raw scale, which will justify massive investments in GPT-3 (175B parameters, 2020) and subsequent models.

### 3. Losers and Blind Spots

This framework steered the industry toward a parameter race that proved partially miscalibrated. GPT-3 (175B) was trained on approximately 300B tokens — a ratio far below the Chinchilla-optimal level. In other words, Kaplan's laws were correctly understood in structure but misapplied in ratios: the industry over-invested in parameters and under-invested in data for 2-3 years.

Direct losers are teams that sized their models on Kaplan's original exponents without waiting for the 2022 corrections. Models like Gopher (280B, DeepMind, 2021) or early Megatron-Turing NLG (530B, Microsoft/NVIDIA) illustrate this relative over-parameterization.

Another blind spot: Kaplan's scaling laws are measured on validation loss (cross-entropy), not on downstream task performance (MMLU, HumanEval, reasoning). Later work would show that certain capabilities appear discontinuously — the "emergent abilities" documented by Wei et al. (2022) — partially breaking the smooth predictability of power laws for applied tasks.

### 4. Structural Impact

This paper is the theoretical foundation justifying billion-dollar infrastructure investments. Without a credible predictive framework, convincing investors or executives to spend $10-100M in compute on a single training run is difficult. Scaling laws provide the expected ROI curve.

For practitioners today: Chinchilla (2022) exponents supersede Kaplan's for compute-optimal sizing. But the reasoning structure — identify power laws, compute the efficiency point, extrapolate — remains the industry standard for any serious LLM project.

Read source

Your take?

OpenAI Benchmarks Papers

Summary generated by Claude — human-verified

Scaling laws for neural language models

Other angles on this story