Why do we benchmark quants on perplexity and prose but never on tool call validity?
A r/LocalLLaMA user argues that quantization benchmarks focus on perplexity and prose quality but ignore tool call validity. They hypothesize that quantization errors degrade structured outputs (JSON, schemas) earlier than free text, making current metrics inadequate for agentic use cases.