Back to feed
arXiv cs.AI·

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

Signal
78
Hype
15
In three linesPALS is an LLM inference optimization system integrated into vLLM that treats GPU power caps as a controllable parameter. By combining offline power-performance models with feedback-driven control, it improves energy efficiency by up to 26.3% and reduces QoS violations by 4x to 7x across dense and mixture-of-experts models.
Read source
Your take?
InfrastructureBenchmarksTools

Summary generated by Claude — human-verified