quicktok: a faster tokenizer (exact and byte-identical with tiktoken) [P]
In three linesquicktok is a BPE tokenizer written in C++ producing byte-identical tokens to tiktoken. Encodes 2–3.6× faster than bpe-openai and 4–11× faster than tiktoken itself. Supports cl100k, o200k, GPT-OSS, Llama-3, Qwen2.5/3. Optimizations: 2-byte trie, dense caches, hand-compiled pretokenizer.
## quicktok: anatomy of a 4–11× speedup on BPE tokenization
### 1. What is actually happening
quicktok is a C++ BPE tokenizer producing byte-identical output to tiktoken while running 4–11× faster than tiktoken and 2–3.6× faster than bpe-openai, previously the fastest known alternative. Benchmarks run on Apple M1, single-thread, in MB/s, with token-for-token verification before every timing run — which removes the usual "faster but approximate" objection.
On cl100k_base, native throughput is: quicktok 121.7 MB/s (The Pile), 139.2 MB/s (Code), 71.3 MB/s (Common Crawl). bpe-openai caps at 36.6 / 38.7 / 28.9 MB/s on the same corpora. tiktoken Python runs at 13.6 / 12.8 / 12.3 MB/s. The gap between quicktok native and tiktoken Python reaches ~11× on The Pile and ~10.9× on Code.
### 2. Why these gains are structural, not cosmetic
The underlying algorithm is unchanged: exact backtracking BPE, identical to bpe-openai. All gains come from data structure engineering, making them robust and reproducible:
**2-byte trie for the longest-match walk.** Instead of traversing a trie character by character, quicktok indexes directly on byte pairs. This reduces traversal depth and improves cache locality on common ASCII sequences — the bulk of English text and source code.
**Dense exactly-keyed caches for merge-validity checks.** Each BPE merge requires repeated vocabulary lookups. Replacing generic hash maps with dense caches sized exactly for the target vocabulary (cl100k = 100,277 tokens, o200k = 200,019 tokens) eliminates collisions and cuts indirect memory accesses.
**Hand-compiled pretokenizer.** tiktoken uses a general regex engine (the GPT-2/GPT-4 pattern is a 5-alternative regex). quicktok replaces this with C++ code compiled statically per vocabulary. Regex dispatch overhead disappears entirely.
These three optimizations target the three classic bottlenecks of a high-frequency BPE tokenizer: structure traversal, merge validation, and initial segmentation.
### 3. Scope and current limitations
quicktok supports cl100k_base (GPT-3.5/4), o200k_base (GPT-4o), GPT-OSS, Llama-3, and Qwen2.5/3 — serious coverage of the most deployed vocabularies in production today. Installation via `pip install quicktok-v1` with a Python binding reaching 77.9–83.6 MB/s, still 5.7–6.5× above tiktoken Python.
Known limitations: single-thread only in published benchmarks (no parallelization data), no SentencePiece/Unigram support (excludes Gemma, Mistral v1, T5 vocabularies), and the project is at `v1` stage from a solo developer — long-term maintenance is an open question.
### 4. Who loses, who gains
**Direct losers:** tiktoken (OpenAI) and bpe-openai lose their performance reference status on their own algorithm. TokenDagger, sitting at 11.1 MB/s on The Pile, is now ~11× below quicktok native. rs-bpe (Rust) at 30.9 MB/s lands 4× below.
**Immediate winners:** large-scale preprocessing pipelines — corpus ingestion for fine-tuning, data filtering, token count computation over terabytes of text. At 121 MB/s single-thread, tokenizing 1 TB of text takes ~2.3 hours versus ~21 hours with tiktoken Python. For teams running these pipelines daily, that is an order-of-magnitude shift in CPU compute costs.
**Secondary use case:** inference servers computing token counts server-side before dispatch (for routing, billing, context window management) can absorb this workload with less CPU — or offload it from GPU to a CPU thread without creating a bottleneck.
**Structural signal:** the fact that a solo developer can outperform OpenAI's official implementation by 4–11× on their own algorithm, in open source, indicates tiktoken was never optimized for raw throughput — it was optimized for correctness and maintainability. That is a legitimate choice, but it leaves significant performance on the table for anyone who prioritizes throughput.
Summary generated by Claude — human-verified