arXiv cs.AI·27 May 2026

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

Signal

Hype

In three linesLiveK12Bench is a dynamic multi-disciplinary benchmark evaluating reasoning capabilities of multimodal models on 2K+ real exam questions (Math, Physics, Chemistry, Biology). Tests reveal major performance degradation: GPT-5 drops from 79 to 53/100 under realistic exam constraints. Framework includes automated anti-contamination pipeline and end-to-end 'Mock Exam' evaluation scheme.

Read source

Your take?

Benchmarks Vision Reasoning Evals

Summary generated by Claude — human-verified

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

Other angles on this story