When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering
Signal
78
Hype
25
In three linesOGCaReBench is a retrieval-focused benchmark evaluating LLMs on off-guideline clinical questions extracted from published medical case reports. GPT-5.2 achieves 56% without retrieval, 82% with retrieved medical articles. Specialized models reach only 42%.Read source
Your take?
Summary generated by Claude — human-verified