arXiv cs.CL·3 June 2026

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

Signal

Hype

In three linesSystematic audit of FOLIO and MALLS benchmarks reveals 39% and 36% errors in FOL formalizations respectively. Authors release corrected annotations and an LLM-based framework to guide manual relabeling, achieving 90% dataset accuracy by reviewing <24% of instances versus >70% for unguided review. Testing on Gemma 31B, Qwen3-30B, and GPT-4o-mini shows +9 to +22 percentage point accuracy gains.

Read source

Your take?

Benchmarks Evals Reasoning Papers

Summary generated by Claude — human-verified

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

Other angles on this story