LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
Signal
78
Hype
25
In three linesLiveK12Bench is a dynamic multi-disciplinary benchmark evaluating reasoning capabilities of multimodal models on 2K+ real exam questions (Math, Physics, Chemistry, Biology). Tests reveal major performance degradation: GPT-5 drops from 79 to 53/100 under realistic exam constraints. Framework includes automated anti-contamination pipeline and end-to-end 'Mock Exam' evaluation scheme.Read source
Your take?
Summary generated by Claude — human-verified