arXiv cs.AI·22 May 2026

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

Signal

Hype

In three linesPALS is an LLM inference optimization system integrated into vLLM that treats GPU power caps as a controllable parameter. By combining offline power-performance models with feedback-driven control, it improves energy efficiency by up to 26.3% and reduces QoS violations by 4x to 7x across dense and mixture-of-experts models.

Read source

Your take?

Infrastructure Benchmarks Tools

Summary generated by Claude — human-verified

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

Other angles on this story