OpenAI Blog·31 January 2025

OpenAI o3-mini

Signal

Hype

In three linesOpenAI releases o3-mini, a compact reasoning model optimized for efficiency. Designed for complex tasks with reduced latency and lower costs, it delivers o3-comparable performance on code and math benchmarks.

## o3-mini: High-Performance Reasoning Without the o3 Price Tag

### 1. What's Being Released OpenAI is shipping o3-mini, a reasoning model from the o3 family optimized for latency and cost. It replaces o1-mini in the compact model tier but inherits the chained reasoning architecture from o3. Three reasoning effort levels are exposed to users — low, medium, high — enabling an explicit tradeoff between speed and inference depth.

### 2. The Numbers That Matter On AIME 2024 (advanced math competition), o3-mini (high) reaches 87.3% vs. 63.6% for o1 and 60.0% for o1-mini — a +23.7 point gain over the model it effectively replaces. On Codeforces, o3-mini (high) posts an Elo rating of 2073, surpassing o1 (1891) and approaching full o3 (estimated >2100 in internal evals). On GPQA Diamond (expert scientific reasoning), o3-mini (high) scores 79.7% vs. 75.7% for o1. Median latency in low mode is below o1-mini on equivalent coding tasks, per OpenAI's published data. API pricing is set at $1.10/M input tokens and $4.40/M output — compared to $15/M and $60/M for o1. That's a ~13.6x reduction in output cost.

### 3. Why This Is Structurally Significant This isn't an incremental update. o3-mini validates a thesis OpenAI has been testing since o1: chained reasoning (internal chain-of-thought, not exposed) can be distilled into a smaller model without capability collapse on formal domains — math, code, science. Before this release, practitioners had to choose between o1-mini (fast, weaker on competitive math) or full o1/o3 (accurate, expensive, slow). o3-mini at medium effort now covers the majority of professional use cases at a marginal cost close to GPT-4o-mini.

Deployment is simultaneous across ChatGPT (Plus, Team, Pro users) and the API, with function calling, structured outputs, and streaming supported — features absent from o1-mini at launch. This signals the model is built for production integration, not just benchmark positioning.

### 4. Losers and Tensions **Anthropic / Claude 3.5 Sonnet**: on coding benchmarks, o3-mini (high) outperforms Sonnet on HumanEval and SWE-bench lite per published figures. Sonnet's core value proposition — strong reasoning at reasonable cost — is directly undercut.

**Google Gemini Flash Thinking**: similar positioning (economical reasoning model), but o3-mini publishes materially higher AIME and Codeforces numbers. Google will need to respond before Gemini 2.0 Pro reaches GA.

**Existing o1-mini users**: implicitly forced to migrate. OpenAI isn't deprecating o1-mini immediately, but the performance gap makes staying on o1-mini hard to justify for any new project.

**Internal OpenAI tension**: o3-mini partially cannibalizes o1 standard ($15/M input). If teams migrate to o3-mini high for reasoning tasks, o1's per-token revenue erodes. This is a deliberate bet on volume over unit margin — consistent with a mass-adoption strategy, but compressing near-term margins.

One critical caveat for practitioners: o3-mini has no vision capability, unlike GPT-4o. For multimodal pipelines, o3-mini does not replace GPT-4o — it complements it on the text/code/formal reasoning branch. Teams that built hybrid workflows will need to maintain two models in parallel.

Read source

Your take?

OpenAI GPT Reasoning Code generation Benchmarks

Summary generated by Claude — human-verified

OpenAI o3-mini

Other angles on this story