arXiv cs.AI·19 May 2026

Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

Signal

Hype

In three linesScales++ proposes benchmark subset selection based on intrinsic task properties rather than model-specific failure patterns. Using 0.25% of data on Open LLM Leaderboard, it predicts full scores with 3.2% mean absolute error, reducing selection cost by 18x.

Read source

Your take?

Benchmarks Evals

Summary generated by Claude — human-verified

Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

Other angles on this story