arXiv cs.CL·19 May 2026

Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

Signal

Hype

In three linesPCFJudge, an inference-time method, evaluates factuality by rerunning a listwise prompt across multiple candidate orderings and aggregating scores. On RewardBench 2 Factuality with K=7 permutations, top-1 accuracy improves from 86% to 91.33% (GPT-5.4) and 86.33% to 89.67% (Claude Sonnet 4.6).

Read source

Your take?

Evals GPT Claude Reasoning

Summary generated by Claude — human-verified

Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

Other angles on this story