Back to feed
arXiv cs.LG·

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

Signal
72
Hype
18
In three linesModeSwitch-LLM is a lightweight controller routing each request to an optimal inference mode (FP16, quantization, speculative decoding, GPTQ+prefix caching, INT8+continuous batching) on single GPU. Tested on Llama-3.1-8B on A100, it achieves 2.10x latency speedup and 51.7% energy reduction per token while preserving accuracy (+0.17pp vs FP16).
Read source
Your take?
LlamaBenchmarks

Summary generated by Claude — human-verified