Back to feed
arXiv cs.CL·

When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

Signal
78
Hype
25
In three linesOGCaReBench is a retrieval-focused benchmark evaluating LLMs on off-guideline clinical questions extracted from published medical case reports. GPT-5.2 achieves 56% without retrieval, 82% with retrieved medical articles. Specialized models reach only 42%.
Read source
Your take?
BenchmarksRAGReasoningGPT

Summary generated by Claude — human-verified