arXiv cs.CL·19 May 2026

Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

Signal

Hype

In three linesBenchmark of LLMs on multi-label legal precedent treatment classification. Expert-annotated dataset of 239 real-world citations. Gemini 2.5 Flash achieves 79.1% on high-level classification, GPT-5-mini 67.7% on fine-grained schema. Novel Average Severity Error metric to measure practical impact of misclassifications.

Read source

Your take?

Benchmarks Gemini GPT Evals

Summary generated by Claude — human-verified

Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

Other angles on this story