Back to feed
arXiv cs.CL·

Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

Signal
72
Hype
18
In three linesPCFJudge, an inference-time method, evaluates factuality by rerunning a listwise prompt across multiple candidate orderings and aggregating scores. On RewardBench 2 Factuality with K=7 permutations, top-1 accuracy improves from 86% to 91.33% (GPT-5.4) and 86.33% to 89.67% (Claude Sonnet 4.6).
Read source
Your take?
EvalsGPTClaudeReasoning

Summary generated by Claude — human-verified