Reddit r/LocalLLaMA·20 mai 2026

HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next!

Signal

Hype

En 3 lignesHalBench : benchmark open-source mesurant la sycophantie et les hallucinations sur 3 200 prompts à fausses prémisses testés sur 4 modèles (Sonnet 4.6, Grok 4.3, GPT-5.4, Gemini 3.1 Pro). Sonnet 4.6 obtient 0.565/1, Grok 4.3 0.498, GPT-5.4 0.381, Gemini 3.1 Pro 0.339. Dataset, code et résultats publics.

Lire la source

Ton avis ?

Benchmarks Évaluations Sécurité IA Alignement

Résumé généré par Claude — vérifié par l'humain

HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next!

Autres angles sur ce sujet