Back to feed
arXiv cs.AI·

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

Signal
82
Hype
18
In three linesOpenMedQ is a medical vision-language model pretrained on 14 datasets (~3.35M samples) covering pathology, radiology, microscopy, and clinical QA. It achieves 75.9 BLEU-1 on PathVQA (outperforming Med-PaLM M 562B) and 0.757 average macro-F1 on 8 unseen medical classification benchmarks.

## OpenMedQ: How 3.35M Open Samples Beat 562B Closed Parameters

### 1. What Actually Happened

OpenMedQ is a medical vision-language model pretrained on a fully open-source mix of 14 datasets (~3.35M samples) spanning four modalities: pathology, radiology, microscopy, and text-only clinical QA. The team releases both code and an interactive demo, making it the most reproducible broad baseline in the field to date.

The headline result: **75.9 BLEU-1 on PathVQA**, outperforming all Med-PaLM M variants including the 562B-parameter version — roughly 80× larger. On VQA-MED, OpenMedQ hits 64.5 BLEU-1, matching the best reported score in the literature. These are not niche benchmarks: PathVQA and VQA-MED are the two canonical references for multimodal medical QA.

### 2. Why This Result Is Structurally Significant

The Med-PaLM M comparison deserves unpacking. Med-PaLM M is a Google model, closed-source, trained at massive scale (up to 562B parameters). It is inaccessible to most research teams and entirely non-reproducible. OpenMedQ beats it on PathVQA using a fully public data mix and available code — shifting the question from "which organization has the most GPUs" to "which data curation strategy is most effective."

Pretraining breadth is the key variable. Prior models like BiomedCLIP, PMC-CLIP, and PubMedCLIP focused primarily on image-text pairs from PubMed Central. OpenMedQ integrates 14 heterogeneous sources across distinct modalities, which likely explains the stronger generalization observed across 8 unseen classification benchmarks.

### 3. The Transfer Benchmark: The Strongest Signal

The PathVQA score could be inflated by distribution overlap between pretraining data and test set. The real generalization test is zero-shot transfer to 8 medical classification benchmarks **unseen during pretraining**, evaluated under an identical downstream recipe across all compared models.

Average macro-F1 results: - OpenMedQ: **0.757** - PubMedCLIP: 0.746 - BiomedCLIP: 0.745 - PMC-CLIP: 0.745 - From-scratch baseline: 0.616

The gap between OpenMedQ and the BiomedCLIP/PMC-CLIP/PubMedCLIP trio is ~0.011-0.012 macro-F1 points. Not dramatic in absolute terms, but **consistent across 8 heterogeneous tasks**, suggesting a genuinely more robust visual encoder rather than optimization on a specific distribution. The gap versus the from-scratch baseline (+0.141) confirms that domain-specific medical pretraining remains non-negotiable.

### 4. Potential Losers and Caveats to Watch

**Direct losers:** BiomedCLIP (Microsoft), PMC-CLIP, and PubMedCLIP lose their status as the open-source reference for medical visual encoders. Teams that built downstream pipelines on these models now have a stronger baseline to justify not switching to.

**Med-PaLM M and large closed multimodal models** see their "only scale matters" argument weakened on structured medical QA tasks. If a reasonably sized open-source model beats 562B parameters on PathVQA, the inference cost justification for very large models on this task class becomes hard to defend.

**Caveats not to overlook:** The paper does not report OpenMedQ's exact model size or pretraining compute, making a full efficiency comparison impossible. BLEU-1 as the primary metric on PathVQA is a weak proxy for clinical accuracy — it does not capture factual correctness. The macro-F1 across 8 unnamed benchmarks in the abstract warrants checking whether any are close to pretraining sources. Finally, "fully open" refers to data availability and code, not necessarily model weights in all configurations.

For practitioners building medical vision systems, OpenMedQ is now the baseline to beat — and its full reproducibility makes it a legitimate starting point for specialized fine-tuning.

Read source
Your take?
VisionBenchmarksOpen sourcePapers

Summary generated by Claude — human-verified