Reddit r/LocalLLaMA·25 mai 2026

We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro

Signal

Hype

En 3 lignesMininglamp AI a ajouté la quantization W8A8 (poids et activations en INT8) à MLX via Cider, un SDK avec kernels Metal custom. Sur M5 Pro, prefill passe de 2.84s à 2.52s pour un VLM 4B. Compatible avec tout modèle MLX, mais INT8 TensorOps nécessite M5+.

Lire la source

Ton avis ?

Open source Infrastructure Outils Benchmarks

Résumé généré par Claude — vérifié par l'humain

We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro

Autres angles sur ce sujet