Back to feed
OpenAI Blog·

Fine-tuning now available for GPT-4o

Signal
85
Hype
15
In three linesOpenAI makes fine-tuning available for GPT-4o. Users can now customize the model for specific use cases through the API.

## GPT-4o Fine-tuning: What Actually Changes

### 1. Technical Context

Until this announcement, OpenAI fine-tuning was limited to the GPT-3.5 family (davinci-002, babbage-002, gpt-3.5-turbo) and a handful of GPT-4 variants in restricted access. GPT-4o — the multimodal model launched in May 2024, positioned as OpenAI's flagship with ~50% lower latency than GPT-4 Turbo — was previously only accessible via standard inference or system prompting. Fine-tuning opens a third path: modifying model weights on proprietary data, without exposing that data on every inference call.

The distinction matters. A system prompt injects context on every request (token cost, latency, prompt injection surface). Fine-tuning encodes that context into the weights — zero additional tokens at runtime, more stable behavior on strict output formats (JSON, code, business taxonomies).

### 2. What This Unlocks Operationally

For teams already running fine-tuned GPT-3.5-turbo, moving to fine-tuned GPT-4o is a meaningful quality jump: GPT-4o scores ~88.7% on MMLU versus ~70% for GPT-3.5, and its reasoning capabilities on complex tasks (code, structured extraction, multilingual) are in a different tier. Fine-tuning now lets you combine that base capability with domain specialization.

Use cases that become directly viable: - **Business entity extraction** on legal, medical, or financial corpora with tightly constrained output schemas, without paying the cost of a long system prompt per call - **Tone-of-voice and editorial style** for consumer applications where stylistic consistency must hold across thousands of requests - **Multi-label classification** on proprietary taxonomies (SKUs, diagnostic codes, internal categories) where base GPT-4o hallucinates out-of-taxonomy labels - **Specialized agents** that must follow strict action protocols without drifting on ambiguous instructions

### 3. Real Constraints and Likely Losers

GPT-4o fine-tuning is neither free nor trivial. Training costs are billed per token processed — OpenAI historically charges ~$0.008/1K tokens for GPT-3.5 fine-tuning; GPT-4o rates will be significantly higher (likely 3-8x). For datasets of 50K-100K examples, you're looking at several thousand dollars per training run, plus inference costs on a fine-tuned model that remain above the base model.

**Direct losers:** Third-party fine-tuning providers that had positioned themselves on the "fine-tuning powerful models" niche (Together AI, Fireworks AI, Anyscale) lose a key differentiator. Their likely response is to double down on Llama 3, Mistral, and open-weight models where OpenAI can't follow them on weight control.

**Indirect losers:** Teams that had invested in complex RAG pipelines to compensate for GPT-4o's limitations on specialized domains will need to reassess their architecture. Fine-tuning doesn't replace RAG (no dynamic knowledge updates, no access to large corpora), but it reduces RAG dependency for cases where the problem is behavioral rather than informational.

### 4. OpenAI's Strategic Positioning

This opening fits a clear enterprise retention logic. Anthropic (Claude 3.5 Sonnet) and Google (Gemini 1.5 Pro) already offer fine-tuning or advanced adaptation mechanisms. By keeping GPT-4o inference-only, OpenAI left a gap for enterprise teams needing deep customization. That gap is now closing.

The practical question for practitioners: at what request volume does fine-tuning become economically rational versus a long system prompt? The generally accepted rule of thumb is ~100K requests/month to amortize training costs. Below that threshold, system prompting remains more flexible and cheaper. Above it, fine-tuning wins on per-token cost and behavioral stability.

Read source
Your take?
GPTOpenAIFine-tuning

Summary generated by Claude — human-verified