Back to feed
Reddit r/MachineLearning·

Routing LLMs by task verifiability: a small experiment (n=120, 3 models) inspired by Karpathy's framework [D]

Signal
45
Hype
25
In three linesExperiment on 120 tasks testing whether weaker models match frontier models on high-verifiability tasks (Karpathy framework). Claude Sonnet 4.6, GPT 5.5, Mistral 3 8B compared. Code/structured extraction: narrower gaps with retry (Mistral 87%→95% code). Multi-hop reasoning: real capability gap (Sonnet 78%, Mistral 51%). Creative summarization: expected advantage for stronger models.
Read source
Your take?
ClaudeGPTMistralEvalsReasoning

Summary generated by Claude — human-verified