OpenAI Blog·23 April 2019

Generative modeling with sparse transformers

Signal

Hype

In three linesOpenAI introduces the Sparse Transformer, a deep neural network setting new records in sequence prediction (text, images, sound). Its improved attention mechanism processes sequences 30x longer than previously possible.

## Sparse Transformer: what the 30x figure actually means

### 1. The structural problem being solved

Standard Transformer attention (Vaswani et al., 2017) carries O(n²) memory and compute complexity relative to sequence length n. Doubling sequence length quadruples cost. At 1,024 tokens, manageable. At 8,000 tokens, prohibitive on standard hardware. This ceiling constrained generative models to short context windows for two years — with the long-range coherence degradation that implies for text, and more severely for audio and high-resolution images.

OpenAI's Sparse Transformer attacks this bottleneck directly: by making attention sparse, each position attends to a structured subset of other positions rather than all of them. Complexity drops to O(n√n), enabling sequences up to 30x longer than dense attention at equivalent compute budget.

### 2. What this means on benchmarks

OpenAI claims new records in sequential prediction across three modalities:

- **Text**: state-of-the-art on language modeling benchmarks (exact perplexity figures not fully disclosed in the release) - **Images**: modeling longer pixel sequences directly maps to higher resolvable resolution in autoregressive image generation — 30x longer sequences means substantially larger images without arbitrary tiling - **Audio**: the most practically significant gain. One second of 24 kHz audio is 24,000 samples. Dense attention at that length was computationally intractable; sparse attention makes it feasible

The improvement is both quantitative (longer sequences) and qualitative: long-range dependencies are better captured, translating to more coherent generated outputs.

### 3. The mechanism: what is actually new

Sparse attention is not a novel concept — prior work explored local or random attention patterns. What the Sparse Transformer introduces is a **factorized, structured** attention combining: - **Local** attention patterns (nearby positions) - **Strided** attention patterns (regularly spaced positions, capturing periodic structure in audio and images)

This factorization lets each layer cover the full sequence in O(n√n) operations while maintaining stable gradients at depth — a non-trivial problem that naive sparse attention implementations do not solve.

Efficient implementation requires custom CUDA kernels, creating a meaningful replication barrier for teams without low-level GPU infrastructure.

### 4. Winners and losers

**Immediate winners**: long-form audio generation (music, extended TTS), high-resolution autoregressive image generation, and any NLP use case requiring long context windows (document summarization, long-form code generation).

**Potential losers**: competing long-sequence modeling approaches — improved RNN/LSTMs, Transformer-XL (which handles length via segment-level recurrence, published by Google Brain months earlier) — see their comparative advantage eroded. Transformer-XL addressed the same problem differently; the Sparse Transformer offers a more general, architecturally less constrained alternative.

Teams invested in CNN-based image generation pipelines (PixelCNN and variants) now face an autoregressive competitor capable of operating at higher resolutions.

**Key caveat**: implementation complexity and dependence on custom kernels mean the real-world benefit is conditioned on specific infrastructure. This is not a drop-in replacement for standard Transformers in most frameworks at this stage.

Read source

Your take?

OpenAI Reasoning Benchmarks

Summary generated by Claude — human-verified

Generative modeling with sparse transformers

Other angles on this story