Back to feed
arXiv cs.AI·

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

Signal
82
Hype
15
In three linesDeFAb is a benchmark of 372,648+ instances for evaluating defeasible abduction reasoning in language models. Best frontier models reach 65% under standard conditions but drop to 23.5% under rendering-robust evaluation, versus 100% for symbolic logic solvers. The benchmark includes three difficulty levels with polynomial-time verifiable gold standards.
Read source
Your take?
BenchmarksReasoningEvals

Summary generated by Claude — human-verified