I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python
In three linesNVIDIA Parakeet speech-to-text ported to C++/ggml without Python or PyTorch. Byte-for-byte identical output to NeMo, up to 5x faster on GPU for larger models, 600x realtime on audio clips. Quantized GGUFs (f16, q8_0, q6_k, q5_k, q4_k), flat C API, integrated in LocalAI with OpenAI-compatible endpoint.
## Parakeet.cpp: NVIDIA's STT without Python, without PyTorch, without trade-offs
### What was built
mudler (LocalAI maintainer) ported NVIDIA's Parakeet models — FastConformer TDT, CTC, RNNT, and hybrid — to pure C++ via ggml, the same runtime powering llama.cpp and whisper.cpp. The result: byte-for-byte identical output to NeMo on f32/f16 paths (WER 0 measured), with substantial speed gains and half the memory footprint.
Concrete numbers: - **~5x faster** than PyTorch/NeMo runtime on large TDT/hybrid models on GPU - **~1.86x faster** on CPU with quantization - **~2x less memory** across the board - **~600x realtime** on GPU on a 23-second clip — one hour of audio transcribed in roughly 6 seconds
GGUF files cover five precision levels: f16, q8_0, q6_k, q5_k, q4_k. Each file is self-contained: tokenizer and vocabulary are baked in, no external files required.
### Why this matters
Before this port, deploying Parakeet required NeMo, which means Python, PyTorch, and a NVIDIA dependency stack that makes integration into non-Python production environments nearly impossible. whisper.cpp demonstrated in 2022 that this pattern works for Whisper; parakeet.cpp applies the same logic to a different, more recent architecture (FastConformer), with models that outperform Whisper large-v3 on several English benchmarks according to NVIDIA's own evaluations.
The flat C-API is the most important engineering detail: it enables embedding STT in any language with FFI (Rust, Go, C#, Swift), in mobile applications or firmware, with no Python runtime. This is exactly what the ggml ecosystem delivered for LLMs over the past 18 months, now available for audio transcription.
The LocalAI integration adds an OpenAI-compatible `/v1/audio/transcriptions` endpoint, meaning any existing code targeting OpenAI's Whisper API can switch to local Parakeet without modification. Cache-aware streaming with real-time end-of-utterance detection and word-level timestamps with confidence scores are features whisper.cpp doesn't expose as cleanly natively.
### Potential losers
**AssemblyAI, Deepgram, Rev.ai**: their value proposition partly relies on the friction of self-hosting performant STT models. A q4_k quantized Parakeet pipeline running at 600x realtime on a consumer GPU directly erodes that argument. For high-volume use cases — call centers, medical transcription, captioning — the marginal cost per hour of locally transcribed audio becomes negligible.
**whisper.cpp itself**: Parakeet TDT 1.1B shows lower WER than Whisper large-v3 (2x larger) on English per NVIDIA benchmarks. If parakeet.cpp reaches the same ecosystem maturity as whisper.cpp — bindings, integrations, documentation — it becomes the rational choice for English in local production.
**NeMo as a deployment runtime**: NeMo remains relevant for training and fine-tuning, but its production inference role is directly challenged. The PyTorch overhead (5x on GPU) is hard to justify when an MIT-licensed alternative exists.
### What to watch
The port covers English Parakeet models. NVIDIA's multilingual models (Canary, notably) are not yet ported. Quantization quality on q4_k deserves independent evaluation on noisy corpora — current benchmarks use clean clips.
The MIT license covers the code, not the models, which remain under NVIDIA's license. Parakeet weights are available under CC-BY-4.0, which is permissive but distinct from MIT — worth checking for commercial use cases.
CUDA/HIP/Vulkan/Metal support is announced, but published benchmarks are primarily CUDA. Performance on Metal (Apple Silicon) and Vulkan (AMD, Intel Arc) remains to be validated by the community. That's precisely where open-source data will emerge over the coming weeks.
Summary generated by Claude — human-verified