Reddit r/LocalLLaMA·20 May 2026

HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next!

Signal

Hype

In three linesHalBench: open-source benchmark measuring sycophancy and hallucinations across 3,200 false-premise prompts tested on 4 models (Sonnet 4.6, Grok 4.3, GPT-5.4, Gemini 3.1 Pro). Sonnet 4.6 scores 0.565/1, Grok 4.3 0.498, GPT-5.4 0.381, Gemini 3.1 Pro 0.339. Dataset, code, and results public.

Read source

Your take?

Benchmarks Evals AI safety Alignment

Summary generated by Claude — human-verified

HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next!

Other angles on this story