ProtStructQA: A Denotation Threshold in Protein Structural Reasoning
ProtStructQA is an executable benchmark for protein structural question answering. 382.2K questions generated from hidden domain-specific language, evaluated on Qwen3 (0.6B–8B) and Gemma-3. Key finding: capability threshold between Qwen3-1.7B and 4B where models transition from failing to produce executable denotations to mastering chain-of-thought reasoning.