<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> openai /</span> whisper
In three linesOpenAI Whisper is a speech recognition model trained on 680,000 hours of multilingual weakly supervised data. The GitHub repository includes code, pre-trained models, and performance benchmarks across multiple languages and acoustic conditions.
## OpenAI Whisper: What 680,000 Hours of Weak Supervision Actually Means
### 1. The ASR landscape before Whisper
Before Whisper, state-of-the-art automatic speech recognition (ASR) split into two camps: models trained on tightly supervised corpora (LibriSpeech, Common Voice) and proprietary systems from Google, Microsoft, and Amazon with opaque training data. The best open-source models hit Word Error Rates (WER) of 2-3% on LibriSpeech clean, but degraded sharply outside ideal acoustic conditions or native English. Multilingual and cross-domain robustness was the sector's persistent weak point.
### 2. What Whisper changes structurally
Whisper is trained on 680,000 hours of web-sourced audio with transcriptions derived from automatic subtitles — hence "weak supervision." This volume exceeds classical supervised corpora by an order of magnitude: LibriSpeech is 960 hours, under 0.15% of Whisper's training set.
The model family spans five sizes: tiny (39M parameters), base (74M), small (244M), medium (769M), and large (1,550M). The large model achieves 2.7% WER on LibriSpeech test-clean, matching top supervised systems, but with substantially better generalization on out-of-distribution data.
Language coverage is a key differentiator: 99 languages supported, with documented but variable performance. French, Spanish, German, and Japanese remain competitive. Low-resource languages show more heterogeneous results — Whisper mitigates rather than solves the low-resource language problem.
The architecture is a standard Transformer encoder-decoder with no major architectural novelty. The core contribution is empirical: proof that weak supervision at massive scale outperforms strong supervision at small scale for ASR robustness.
### 3. Practical implications for practitioners
The repo ships pretrained weights installable via pip. The large model requires ~10 GB VRAM. The small model (244M) runs on CPU with acceptable latency for non-real-time use cases.
Automatic multilingual transcription (without specifying language) works via a built-in language detection mechanism. Native translation to English is supported for all 99 languages — a distinct use case from transcription alone.
For developers building ASR pipelines on Google Speech-to-Text or AWS Transcribe, Whisper offers an on-premise alternative with no per-request cost. The tiny or base model covers most low-latency use cases on standard hardware.
Robustness to accents, background noise, and specialized domains (medical, legal, technical) is meaningfully better than models trained on LibriSpeech alone. That is where the real operational advantage lies.
### 4. Losers and hard limits
Cloud ASR vendors — Google, Amazon, Microsoft, Rev.ai, Deepgram — face direct competition on standard use cases. Whisper large matches their API quality with zero marginal cost post-deployment.
Multilingual ASR startups (Speechmatics, Verbit, Sonix) are particularly exposed in the mid-market segment that can absorb self-hosted infrastructure costs.
Real limitations: Whisper is not optimized for real-time streaming. The encoder-decoder architecture processes fixed 30-second windows, making latency on the large model incompatible with conversational applications. Projects like faster-whisper (built on CTranslate2) have partially addressed this with ~4x speed gains, but the core architectural constraint remains.
The MIT license allows unrestricted commercial use, accelerating adoption but meaning OpenAI captures no direct revenue from this model — it functions as a technical credibility signal and ecosystem infrastructure supporting their paid products.
Summary generated by Claude — human-verified