I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU
In three linesFine-tuned Parakeet 0.6B for medical transcription, open weights (CC-BY-4.0). Omi Med STT v1 achieves 2.37% M-WER (clinical term errors) vs 8.36% baseline, 145× RTFx. Multi-platform runtime (MLX/NeMo/GGUF). Benchmark on 1,513 medical clips: outperforms Whisper Large v3 Turbo and Qwen3 ASR on clinical accuracy.
## Omi Med STT v1: anatomy of a medical fine-tune that delivers
### What happened
Omi Health's founder released open weights for a medical ASR model under CC-BY-4.0: Omi Med STT v1, fine-tuned from NVIDIA's Parakeet TDT 0.6B v2. The benchmark covers 1,513 clips / 7.18 hours of held-out medical audio, with M-WER (Word Error Rate restricted to clinical terms) as the primary metric — the only one that matters for an automated scribe. Result: 2.37% M-WER vs. 8.36% for the base model, a 3.5× reduction. Spurious drug name mentions drop from 131 to 9 on the test set.
### Why the size/performance ratio is the real signal
The only open-source model that beats Omi on M-WER is VibeVoice-ASR at 9B parameters: 1.78% vs. 2.37%. But VibeVoice is ~15× larger, runs at 11× RTFx vs. 145× for Omi on an A10, and posts 11.10% general WER vs. 8.30%. On clinical terms, VibeVoice wins by a hair; on everything else, Omi wins clearly while being 13× faster on GPU.
The cloud comparison is telling: Omi beats Deepgram Nova-3 Medical (2.44% M-WER) and Corti Transcripts (5.12%) by a wide margin, while trailing AssemblyAI Universal-3 Pro Medical (1.81%). The 145× RTFx is structurally incomparable to cloud figures (which include network round-trips), but the local latency advantage is real for on-device deployments.
### The technical detail worth noting
Not shipping the q4 quantization is an honest engineering call: 4-bit quantization degraded drug name accuracy too much. q8 is the default. The runtime auto-selects the backend: MLX on Apple Silicon, NeMo on CUDA, GGUF/parakeet.cpp on CPU. Training used ~127 hours of audio; the exact mix isn't fully disclosed but the author offers to discuss it publicly.
The main weakness is documented without spin: 4.75% Drug M-WER, the model's worst axis, flagged as priority #1 for v2. This is consistent with the 131→9 drop in spurious drug mentions — progress was made, but complex molecule names remain the weak link.
### Potential losers
**Deepgram Nova-3 Medical** is directly challenged on its paid segment: a free local open-source model beats it on M-WER (2.37% vs. 2.44%) with structurally lower latency for on-premise deployments. **Corti** (5.12% M-WER, RTFx 0.9×) is in a difficult position — slower than real-time and less accurate than a free 0.6B model. **Google MedASR** posts 13.86% M-WER, 5.8× worse than Omi, raising questions about the maturity of that offering.
The Gemini case deserves separate mention: on 420 benign non-diagnostic clips, Gemini 3.5 Flash fabricates entire consultations in 87/420 cases (20.7%), Gemini 3.1 Pro in 33/420 (7.9%). Zero other dedicated ASR models exhibit this behavior. This isn't a WER problem — it's structural clinical hallucination that makes these models unusable for medical transcription without downstream detection post-processing.
### What this changes for practitioners
Before this release, local open-source medical ASR meant Whisper Large v3 Turbo (3.93% M-WER, 46× RTFx) or untuned Parakeet models (8%+ M-WER). Omi closes the gap with specialized cloud APIs while staying on-device. For teams building medical scribes under patient data privacy constraints — HIPAA, GDPR, hospital policies — this is the first sub-1B model that enters genuine competitive range with commercial solutions.
The CC-BY-4.0 license is permissive for commercial use, distinguishing this release from the restrictive licenses typical in the medical AI space. One-line install (`pip install omi-med-stt`) and a multi-platform runtime reduce adoption friction to near zero for any developer.
Summary generated by Claude — human-verified