arXiv cs.AI·18 June 2026

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

Signal

Hype

In three linesDeFAb is a benchmark of 372,648+ instances for evaluating defeasible abduction reasoning in language models. Best frontier models reach 65% under standard conditions but drop to 23.5% under rendering-robust evaluation, versus 100% for symbolic logic solvers. The benchmark includes three difficulty levels with polynomial-time verifiable gold standards.

Read source

Your take?

Benchmarks Reasoning Evals

Summary generated by Claude — human-verified

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

Other angles on this story