arXiv cs.CL·3 juin 2026

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

Signal

Hype

En 3 lignesAudit systématique des benchmarks FOLIO et MALLS révélant 39% et 36% d'erreurs dans les formalisations FOL. Les auteurs publient des annotations corrigées et un framework LLM pour guider la relabélisation manuelle, permettant d'atteindre 90% de précision en révisant <24% des instances. Tests sur Gemma 31B, Qwen3-30B et GPT-4o-mini montrent des gains de +9 à +22 points.

Lire la source

Ton avis ?

Benchmarks Évaluations Raisonnement Papers

Résumé généré par Claude — vérifié par l'humain

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

Autres angles sur ce sujet