When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening
Signal
72
Hype
25
In three linesSCID-anchored benchmark of 555 semi-structured interviews evaluates 5 LLMs (GPT-4.1 Mini, GPT-5 Mini) on psychiatric screening (anxiety, depression, PTSD). Accuracy 0.49–0.86, MCC 0.16–0.38. False negatives reveal models downweight symptoms when functioning is preserved or social support present, requiring clinical validation before deployment.Read source
Your take?
Summary generated by Claude — human-verified