OpenAI Blog·12 September 2024

Introducing OpenAI o1

Signal

Hype

In three linesOpenAI introduces o1, a reasoning model capable of solving complex problems in mathematics, coding, and science. The model uses internal reflection before responding, improving performance on difficult benchmarks.

## OpenAI o1: What "Internal Reasoning" Actually Changes

### 1. What's Being Announced

OpenAI is releasing o1, a model that breaks from GPT-4's direct-response architecture. Before producing output, o1 runs a hidden internal chain of thought — a reasoning process the user never sees, whose length scales with problem difficulty. This is not classic chain-of-thought prompted in the context window: it's reinforcement-trained reasoning, opaque by design.

The published numbers are specific: on AIME 2024 (a high-level US math competition), GPT-4o solves 12% of problems. o1 solves 74%. On Codeforces, o1 reaches the 89th percentile among human competitive programmers. On GPQA Diamond (PhD-level questions in chemistry, biology, physics), o1 exceeds average human expert performance at 78% vs. ~70% for a PhD-level expert. On the MATH olympiad benchmark, o1 hits 94.8% vs. 60.3% for GPT-4o.

### 2. Why This Is Structurally Different

The previous paradigm — scaling parameters and training data — showed diminishing returns on multi-step reasoning tasks. o1 introduces a second scaling axis: **test-time compute** (inference-time compute). The more internal reasoning tokens allocated, the better the performance. This means the performance ceiling is no longer fixed at training time: it's adjustable at runtime based on the compute budget granted.

OpenAI also releases o1-mini, a lighter version optimized for STEM reasoning at reduced inference cost, targeting use cases where speed matters more than depth. o1-mini is priced at $3/M input tokens and $12/M output; o1-preview at $15/M and $60/M — roughly 3–6× more expensive than GPT-4o depending on direction.

### 3. Potential Losers

**Anthropic and Claude 3.5 Sonnet**: Claude 3.5 had been positioned as the top coding and reasoning model since June 2024. o1 surpasses it on pure reasoning benchmarks, even if Claude retains advantages on long-context and extended document tasks. Anthropic's leadership window is closing faster than anticipated.

**Google DeepMind**: Gemini Ultra 1.5 had MATH scores comparable to GPT-4o. The GPT-4o-to-o1 jump on that benchmark (+34 points) puts Google on the defensive in scientific and academic segments.

**Reasoning augmentation startups**: Tools like Cognition (Devin) and multi-agent scaffolding frameworks (AutoGPT, LangGraph) that compensated for base LLM reasoning limitations see their value proposition partially absorbed. If the model reasons better natively, external orchestration layers become less differentiating.

**Users paying for speed**: o1 is significantly slower than GPT-4o. Response latency can reach tens of seconds on complex problems. For real-time conversational applications, it is not a drop-in replacement.

### 4. What to Watch

First critical point: **the opacity of internal reasoning is a deliberate safety choice**. OpenAI explicitly states that exposing the full chain of thought would create alignment risks — the model could learn to conceal its intentions in the visible portion. This decision will have regulatory implications, particularly in Europe where explainability of high-risk AI systems is a requirement under the AI Act.

Second: o1 is currently available in preview for ChatGPT Plus and Team subscribers with rate limits (30 messages/week for o1-preview, 50 for o1-mini). API access is open to Tier 5 developers. The final version — simply called "o1" without a suffix — has not yet been deployed.

Third: OpenAI references an upcoming o1 family, suggesting the inference-time scaling paradigm will extend to multimodal models and longer contexts. The real test will be whether reasoning gains transfer to real-world tasks (agents, production code, assisted scientific research) rather than remaining confined to academic benchmarks — a distinction the industry has learned to make since GPT-4.

Read source

Your take?

OpenAI GPT Reasoning Benchmarks

Summary generated by Claude — human-verified

Introducing OpenAI o1

Other angles on this story