Back to feed
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> openai /</span> whisper

Signal
85
Hype
15
In three linesOpenAI Whisper is a speech recognition model trained on 680,000 hours of multilingual weakly supervised data. The GitHub repository includes code, pre-trained models, and benchmarks for robust speech transcription across 99 languages.

## OpenAI Whisper: What 680,000 Hours of Training Actually Means

### 1. The ASR landscape before Whisper

Before Whisper's release (September 2022), automatic speech recognition was split between supervised models trained on expensive annotated corpora (Librispeech, Common Voice) and self-supervised approaches like Meta's wav2vec 2.0 or HuBERT, which learned representations without labels before fine-tuning. Top commercial systems — Google Speech-to-Text, AWS Transcribe, Azure Speech — remained black boxes with pay-per-minute APIs and degraded performance outside English or studio conditions.

The reference Word Error Rate (WER) on Librispeech test-clean hovered around 2-3% for the best models, but those numbers collapsed on noisy audio, non-standard accents, or low-resource languages.

### 2. What Whisper changes structurally

OpenAI trained Whisper on **680,000 hours** of multilingual audio collected from the web — roughly 117x the volume of Librispeech (585 hours). The methodological key is **weak supervision**: transcriptions paired with audio come from unverified sources (auto-generated captions, web pages), making collection massively scalable without human annotation.

The architecture is a standard **encoder-decoder Transformer**. Five sizes are available: tiny (39M parameters), base (74M), small (244M), medium (769M), large (1,550M). The large model achieves a WER of **2.7%** on Librispeech test-clean — on par with the best supervised systems — but critically, it generalizes across **99 languages** with documented performance on benchmarks like Fleurs.

**Automatic language detection**, **zero-shot translation to English**, and **word-level timestamping** are natively integrated without additional fine-tuning. This complete package is what separates Whisper from academic alternatives.

### 3. Why the signal is high right now

Whisper trending on GitHub in 2024-2025 is not accidental: it has become the de facto infrastructure for dozens of open-source projects — whisper.cpp for CPU inference in C++, faster-whisper built on CTranslate2 delivering ~4x latency reduction, WhisperX for forced alignment. The community has patched initial gaps — no speaker diarization, high CPU latency — through specialized wrappers.

Use cases that took off: local transcription without cloud data transfer (GDPR compliance), automatic video subtitling, RAG pipelines over audio content, embedded voice assistants. The `large-v3` model, released late 2023, reduced WER by an additional ~10-20% across several languages compared to `large-v2`.

### 4. The losers and real limitations

**Direct losers**: per-minute commercial ASR APIs. Whisper large-v3 runs locally on an A100 GPU at approximately 60x real-time speed — marginal transcription cost approaches zero for anyone with the hardware. Services like Rev.ai, Sonix, or Otter.ai see their value proposition shrink to human post-editing and UX integration.

**Documented limitations**: Whisper is not a streaming model — it operates on audio segments (typically 30 seconds), making it unsuitable for real-time applications without adaptation (whisper.cpp and faster-whisper offer chunking modes). WER degrades significantly on heavy accents and low-resource languages despite the advertised 99 languages — OpenAI's internal benchmarks show WER >20% on several African languages. **Hallucination** is a known issue: on silent or noisy segments, the model can generate non-existent text.

The MIT license allows free commercial use, but OpenAI has not released the training data — full reproducibility remains out of reach, and questions about the provenance of those 680,000 hours (copyright, consent) have not been publicly resolved.

For practitioners: faster-whisper + large-v3 is today's rational starting point for any offline ASR pipeline. For real-time use cases, alternatives like Deepgram Nova-2 or AssemblyAI Universal-2 retain a measurable latency advantage.

Read source
Your take?
OpenAIVoiceBenchmarksOpen source

Summary generated by Claude — human-verified