ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU
Signal
72
Hype
18
In three linesModeSwitch-LLM is a lightweight controller routing each request to an optimal inference mode (FP16, quantization, speculative decoding, GPTQ+prefix caching, INT8+continuous batching) on single GPU. Tested on Llama-3.1-8B on A100, it achieves 2.10x latency speedup and 51.7% energy reduction per token while preserving accuracy (+0.17pp vs FP16).Read source
Your take?
Summary generated by Claude — human-verified