arXiv cs.LG·25 May 2026

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

Signal

Hype

In three linesModeSwitch-LLM is a lightweight controller routing each request to an optimal inference mode (FP16, quantization, speculative decoding, GPTQ+prefix caching, INT8+continuous batching) on single GPU. Tested on Llama-3.1-8B on A100, it achieves 2.10x latency speedup and 51.7% energy reduction per token while preserving accuracy (+0.17pp vs FP16).

Read source

Your take?

Llama Benchmarks

Summary generated by Claude — human-verified

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

Other angles on this story