DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models
Signal
82
Hype
15
In three linesDeFAb is a benchmark of 372,648+ instances for evaluating defeasible abduction reasoning in language models. Best frontier models reach 65% under standard conditions but drop to 23.5% under rendering-robust evaluation, versus 100% for symbolic logic solvers. The benchmark includes three difficulty levels with polynomial-time verifiable gold standards.Read source
Your take?
Summary generated by Claude — human-verified