arXiv cs.AI·19 May 2026

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Signal

Hype

In three linesSystematic study of MoE model compression (Qwen3-Next-80A3B → 23A2B) via pruning and distillation at pretraining scale. Pruning outperforms training from scratch, multi-token prediction (MTP) distillation improves performance, and progressive schedules beat one-shot compression.

Read source

Your take?

Qwen Fine-tuning Benchmarks Papers

Summary generated by Claude — human-verified

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Other angles on this story