Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
Signal
82
Hype
15
In three linesSystematic audit of FOLIO and MALLS benchmarks reveals 39% and 36% errors in FOL formalizations respectively. Authors release corrected annotations and an LLM-based framework to guide manual relabeling, achieving 90% dataset accuracy by reviewing <24% of instances versus >70% for unguided review. Testing on Gemma 31B, Qwen3-30B, and GPT-4o-mini shows +9 to +22 percentage point accuracy gains.Read source
Your take?
Summary generated by Claude — human-verified