OpenAI Blog·24 April 2024

Introducing ChatGPT and Whisper APIs

Signal

Hype

In three linesOpenAI releases ChatGPT and Whisper APIs, enabling developers to integrate conversational AI and speech recognition into applications. The APIs provide programmatic access to ChatGPT's conversation capabilities and Whisper's audio transcription features.

## ChatGPT and Whisper APIs: The Structural Shift

### 1. The core facts and immediate context

OpenAI is opening programmatic access to `gpt-3.5-turbo` — the model powering ChatGPT — at **$0.002 per 1,000 tokens**, roughly 10× cheaper than the previously available `text-davinci-003` endpoint. Simultaneously, Whisper v2-large becomes available via API at **$0.006 per minute** of audio transcription. Before this announcement, developers wanting ChatGPT-grade conversational capabilities had to either use expensive Davinci models, maintain their own fine-tunes, or negotiate enterprise access on a case-by-case basis.

### 2. Why pricing is the most important signal

A 90% cost reduction on the conversational model isn't a marginal adjustment — it crosses several critical economic thresholds. At $0.002/1k tokens, a 500-turn conversation (~750k tokens total) costs roughly **$1.50**. That's the territory where previously uneconomical use cases — automated customer support, personalized tutors, assistants embedded in B2C freemium apps — become viable without subsidy. At ~$0.02/1k tokens, `text-davinci-003` made those same scenarios economically fragile at scale.

For Whisper, the API removes the main operational friction: until now, running Whisper in production meant managing your own GPU infrastructure (cost, latency, maintenance). At $0.006/minute, one hour of transcription costs **$0.36** — competitive against Google Speech-to-Text (~$0.064/minute standard) and well below AWS Transcribe at mid-range volumes.

### 3. Technical architecture: ChatML format and context management

The `gpt-3.5-turbo` endpoint introduces a structured message format (`system`, `user`, `assistant`) that fundamentally differs from classic prompt completion. This `ChatML` format imposes design discipline: developers must explicitly manage conversational context, inject system instructions separately from user content, and manually construct history on each call. It's more verbose than a simple prompt, but enforces a clean separation between instruction and content — reducing naive prompt injection risks and making behavior auditing easier.

The context window remains **4,096 tokens** for `gpt-3.5-turbo` at launch, identical to GPT-3. For applications requiring long conversations or large documents, context management (sliding summaries, retrieval augmentation) remains an engineering constraint the API itself doesn't solve.

### 4. Potential losers and forced repositioning

**Cohere and AI21 Labs** see their core value proposition — accessible LLMs via API at reasonable cost — directly attacked. Cohere had been pricing its `command` models comparably to Davinci; the new OpenAI pricing creates immediate pressure on their rate cards.

**Audio transcription startups** (AssemblyAI, Rev AI, Deepgram) now face a competitor combining Whisper quality (better multilingual accuracy than most alternatives on public benchmarks) with aggressive pricing. AssemblyAI charges ~$0.015/minute standard — 2.5× more expensive than Whisper API.

**Integrators who built abstractions on top of GPT-3** (LangChain, Dust, etc.) must adapt their pipelines to the new message format. Not a catastrophic breaking change, but unplanned migration work.

Finally, **Microsoft** — partner and investor — benefits directly via Azure OpenAI Service, but direct availability through api.openai.com reduces the differentiated advantage Azure could pitch to hesitant enterprise customers. The coexistence of both channels creates a go-to-market tension both parties will need to manage carefully.

Read source

Your take?

OpenAI GPT Voice Tools

Summary generated by Claude — human-verified

Introducing ChatGPT and Whisper APIs

Other angles on this story