Back to feed
arXiv cs.AI·

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

Signal
72
Hype
25
In three linesStudents construct QuestBench, a 256-question benchmark across humanities and social sciences, to evaluate deep research systems. Testing reveals GPT-4.5 reaches 57.58% pass rate while mean performance is 16.85% across 13 systems, exposing hidden failures. This classroom practice teaches students to judge AI output quality and remain responsible knowledge actors.
Read source
Your take?
BenchmarksEvalsGPT

Summary generated by Claude — human-verified