Back to feed
arXiv cs.CL·

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

Signal
78
Hype
15
In three linesEvaluation of semantic stability in 16 LLMs (general-purpose and medical) under clinically equivalent prompt reformulations. Proposes NLI-based verification framework and three sensitivity metrics (MVS, ΔC, WCI). Finding: domain specialization does not consistently improve robustness to meaning-preserving variations.
Read source
Your take?
EvalsAI safetyReasoningBenchmarks

Summary generated by Claude — human-verified