Reddit r/MachineLearning·10 June 2026

Routing LLMs by task verifiability: a small experiment (n=120, 3 models) inspired by Karpathy's framework [D]

Signal

Hype

In three linesExperiment on 120 tasks testing whether weaker models match frontier models on high-verifiability tasks (Karpathy framework). Claude Sonnet 4.6, GPT 5.5, Mistral 3 8B compared. Code/structured extraction: narrower gaps with retry (Mistral 87%→95% code). Multi-hop reasoning: real capability gap (Sonnet 78%, Mistral 51%). Creative summarization: expected advantage for stronger models.

Read source

Your take?

Claude GPT Mistral Evals Reasoning

Summary generated by Claude — human-verified

Routing LLMs by task verifiability: a small experiment (n=120, 3 models) inspired by Karpathy's framework [D]

Other angles on this story