Back to feed
Reddit r/LocalLLaMA·

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)

Signal
82
Hype
15
In three lineshipEngine is an open source (AGPLv3) LLM inference engine optimized for RDNA3 (RX 7900 XTX, W7900). Written in Python with HIP/C++ kernels, it runs Qwen 3.6 MoE faster than llama.cpp on prefill (2718 tok/s at 512 tokens vs 2436 for GGUF Q4_K_S). Near-lossless INT8 KVCache enables 256K context in <24GB.

## hipEngine: A ROCm-Native Inference Engine That Beats llama.cpp on RDNA3

### What's Actually Happening

hipEngine is an open-source (AGPLv3) LLM inference engine from the author of FastDMS, targeting AMD RDNA3 GPUs exclusively — primarily RX 7900 XTX and Radeon Pro W7900 (gfx1100), with initial support for Strix Halo (gfx1151, Ryzen AI MAX+ 395). The architecture is Python at the surface, but the entire hot path runs through custom HIP/C++ kernels leveraging hipBLASLt, hipGraph, and AOTriton. No heavy PyTorch dependency.

The initial target model is Qwen 3.6 35B-A3B (MoE), with support for GGUF (Q4_K_M, Q4_K_S) and ParoQuant (4.68bpw) formats — the latter ported to ROCm compatibility by the same author.

### The Numbers That Matter

**Prefill (gfx1100 — W7900/7900 XTX):** - At 512 tokens: hipEngine PARO hits **2718 tok/s** vs 2436 for llama.cpp HIP (+11.6%) and 1817 for llama.cpp Vulkan (+49.6%) - At 4K tokens: **2838 tok/s** vs 2177 for llama.cpp HIP (+30.4%) - At 128K tokens: **1055 tok/s** vs 710 for llama.cpp HIP (+48.5%)

The advantage grows with context length — precisely where attention and memory management optimizations have the most leverage.

**Decode:** The picture partially reverses. llama.cpp Vulkan outperforms hipEngine in decode at short context (127 tok/s vs 103 at 512 tokens). hipEngine GGUF Q4_K_S reaches 109 tok/s, slightly ahead of hipEngine PARO (103). This matters: for interactive use cases with short prefill and long decode, llama.cpp Vulkan remains competitive.

**Memory:** At 128K context, hipEngine PARO uses **22.1 GiB** vs 25.1 for hipEngine GGUF and 23.6 for llama.cpp. The near-lossless INT8 KVCache (prefill speed loss ~1.4%: 1091 → 1076 tok/s) enables running Qwen 3.6's full 256K context window at **21.96 GiB** sampled peak — under the 24 GiB of a dedicated 7900 XTX. At 256K/INT8: 670 tok/s prefill, 40 tok/s decode.

**Strix Halo (gfx1151, iGPU):** Without dedicated optimization, hipEngine PARO already hits **1029 tok/s** at 4K tokens vs 1004 for llama.cpp HIP and 595 for Vulkan. In decode, hipEngine beats llama.cpp HIP at all context lengths (63 vs 49 tok/s at 4K). Significant for a shared CPU/GPU APU.

### Why llama.cpp Is Structurally Disadvantaged Here

llama.cpp is a generalist multi-backend engine (CUDA, Metal, Vulkan, ROCm, CPU). Its ROCm/HIP kernels are ports, not native implementations. hipEngine writes directly for gfx1100: 100+ custom kernels documented in KERNELS.md, with fused/unfused variants and CPU reference oracles. The author also publishes ROOFLINE.md — per-kernel roofline analysis — which is rare in open-source projects and signals unusual optimization rigor for a self-described "sidequest."

The ParoQuant ROCm port is non-trivial additional work: ParoQuant quantization (4.68bpw) takes days to run, but delivers higher density than standard GGUF Q4_K_S while being faster in prefill.

### Potential Losers

**llama.cpp on AMD**: For RDNA3 users doing long-context processing (RAG, document summarization), hipEngine offers a measurable edge. The llama.cpp AMD contributor and user base may fragment.

**Ollama/LM Studio on AMD**: These frontends rely on llama.cpp. They don't automatically inherit hipEngine's gains.

**NVIDIA users**: hipEngine has no CUDA support. It's explicitly AMD-first, which limits adoption but concentrates optimization.

### Limitations and Context

The project is described by its author as a "fun sidequest" inspired by DeepSeek-V4. Current support: Qwen 3.6 MoE and dense only. GGUF is in "good enough initial" state — slightly behind ParoQuant in speed. The author mentions Gemma 4 and StepFun 3.5 as potential next architectures. No built-in HTTP server or documented OpenAI-compatible API — this is a low-level engine, not a deployment-ready service.

The AGPLv3 license imposes constraints for embedded commercial use: any modification must be published if the service is exposed over a network.

For practitioners on AMD RDNA3 working with Qwen 3.6 and long contexts, hipEngine is currently the fastest open-source option available.

Read source
Your take?
QwenOpen sourceInfrastructureCode generation

Summary generated by Claude — human-verified