Back to feed
Reddit r/LocalLLaMA·

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

Signal
82
Hype
18
In three linesmistral.rs v0.8.2 achieves up to 2.8x faster CUDA inference than llama.cpp on Gemma 4 (dense and MoE) across GB10, B200, and H100. Reproducible results published with Q4K and eQ8_0 support, includes OpenAI-compatible server.

## mistral.rs v0.8.2: anatomy of a 2.8x gain over llama.cpp

### 1. What is measured and how

EricLBuehler publishes a reproducible benchmark report — a non-trivial distinction in an ecosystem where performance numbers are frequently unverifiable. Tests cover Gemma 4 in both variants (dense and MoE), across three distinct NVIDIA GPUs (GB10, B200, H100), with two quantization types (eQ8_0 and Q4K). The peak gain claimed is 2.8x in tokens/second versus llama.cpp, and the central claim is that mistral.rs is faster at **every point** in the release sweep — not just on a cherry-picked configuration.

Reproducibility is documented step-by-step in the GitHub report (`releases/v0.8.2/report.md`), allowing any operator with the relevant hardware to validate or challenge the numbers. This is an uncommon posture for announcements of this type.

### 2. Technical context: why llama.cpp was the baseline

llama.cpp established itself as the de facto local inference runtime since late 2023 through portability (CPU, Metal, CUDA, Vulkan), quantization maturity (GGUF/GGML), and a broad integration ecosystem (Ollama, LM Studio, Jan, etc.). On CUDA specifically, llama.cpp benefits from years of kernel optimizations and a large contributor base.

mistral.rs is a Rust runtime by EricLBuehler, initially positioned around flexibility (multi-model support, agentic pipeline, native OpenAI-compatible server) rather than raw CUDA throughput. v0.8.2 marks an explicit pivot toward GPU throughput, with a declared focus on Ampere/Hopper/Blackwell architectures.

### 3. What the numbers mean in practice

A 2.8x factor on H100 or B200 is not a marginal optimization detail. On an H100 at ~$2-3/hour in cloud, this translates directly to inference cost divided by 2.8 for equivalent throughput, or the ability to serve ~2.8x more concurrent requests on the same GPU budget. For operators deploying Gemma 4 MoE (a model with 80B+ effective parameters depending on configuration), the delta is economically significant.

Support for eQ8_0 (balanced 8-bit quantization) and Q4K (4-bit with grouping) indicates the gains are not limited to FP16/BF16 precision — they hold in the quantization regimes actually used in production.

The GB10 (Grace Blackwell Superchip, NVLink-C2C architecture) is particularly notable: it's the hardware inside DGX Spark and Project DIGITS, targeting edge/high-end workstation deployment. Being optimized on GB10 now positions mistral.rs on a rapidly growing hardware segment.

### 4. Potential losers and limits to watch

**llama.cpp and its ecosystem** are the direct losers if benchmarks hold at scale. Ollama, LM Studio, and Jan all rely on llama.cpp as their primary CUDA backend. If mistral.rs sustains this performance advantage, pressure to integrate an alternative backend will intensify — but migration is non-trivial given the GGUF format and integration dependencies.

**Limits to note:** (a) Benchmarks are published by the project author — the absence of independent third-party validation is a structural bias. (b) The sweep covers only Gemma 4; gains on Llama 3, Mistral, Qwen, or Phi are not documented in this release. (c) mistral.rs does not yet natively support GGUF format, creating adoption friction for users with existing GGUF models. (d) Ecosystem maturity (plugins, third-party integrations, documentation) remains below llama.cpp.

The integrated OpenAI-compatible server with agentic features (`mistralrs serve --agent`) is a functional differentiator, but it's also territory where vLLM and SGLang are already well-established on cloud GPU.

Bottom line: v0.8.2 is a technically serious demonstration with reproducible methodology, on relevant hardware (H100, B200, GB10), under realistic quantization conditions. The question is no longer whether mistral.rs can beat llama.cpp on CUDA, but whether the surrounding ecosystem will follow the performance.

Read source
Your take?
MistralBenchmarksCode generationOpen sourceInfrastructure

Summary generated by Claude — human-verified