arXiv cs.AI·19 May 2026

Estimating Item Difficulty with Large Language Models as Experts

Signal

Hype

In three linesStudy evaluating three off-the-shelf LLMs to estimate difficulty of educational items without response data. Across 6 primary-school mathematics domains, Spearman correlations show moderate-to-strong alignment with empirical difficulties. Pairwise comparisons outperform absolute judgements; adding token probabilities and few-shot examples improves results.

Read source

Your take?

Prompt engineering Evals Benchmarks

Summary generated by Claude — human-verified

Estimating Item Difficulty with Large Language Models as Experts

Other angles on this story