HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next!
Signal
75
Hype
25
In three linesHalBench: open-source benchmark measuring sycophancy and hallucinations across 3,200 false-premise prompts tested on 4 models (Sonnet 4.6, Grok 4.3, GPT-5.4, Gemini 3.1 Pro). Sonnet 4.6 scores 0.565/1, Grok 4.3 0.498, GPT-5.4 0.381, Gemini 3.1 Pro 0.339. Dataset, code, and results public.Read source
Your take?
Summary generated by Claude — human-verified