Tiny LLM Benchmark: Jetson Orin Nano Super 8GB - Four Power Modes × Eight Models
In three linesComprehensive benchmark of 8 tiny LLMs (135M–1B) on Jetson Orin Nano Super 8GB with llama.cpp CUDA across 4 power modes (7W–MAXN). 25W mode optimal: SmolLM2-135M achieves 165 tok/s and 22.6 tok/J; LFM2.5-1.2B best in ~1B class (54.1 tok/s). 384 benchmark cells, raw datasets published.
## Jetson Orin Nano Super: 384 benchmark cells mapping energy efficiency of small LLMs
### What this benchmark actually contributes
Rigorous edge hardware benchmarks are scarce. This one covers 8 models × 4 power modes × 48 prompt/generation combos = 3,072 raw measurement points, published on HuggingFace. That's directly usable by anyone deploying local inference on Jetson or constrained Ampere hardware.
The $250 Jetson Orin Nano Super 8GB is currently the entry-level reference for NVIDIA edge inference. Its unified CPU+GPU architecture (8 GB LPDDR5 @ 204.8 GB/s, no VRAM split) differentiates it from discrete GPUs: all memory is accessible without PCIe transfers, which benefits models that fit in unified RAM. Before this benchmark, public data on this device was limited to isolated tests with no systematic power mode comparison.
### The central finding: 25W is the Pareto point, not MAXN
This is the most actionable conclusion. MAXN mode (maximum clocks, uncapped consumption) delivers 8–35% fewer tok/J than 25W mode depending on the model. Pushing the SoC to maximum measurably degrades energy efficiency. The 25W mode delivers 36–47% more throughput than 15W while maintaining 3–26% better energy efficiency. Junction temperature stayed ≤ 73°C across all runs with active cooling — no thermal throttling observed.
For battery-powered or fixed-energy-budget deployments (drones, robots, industrial sensors), this directly changes the recommended system configuration.
### Model rankings: size vs. efficiency
**Sub-1B class:** - SmolLM2-135M: 165 tok/s, 22.6 tok/J, 101 MB, ~5.4W at 25W. Best energy efficiency in the entire suite. For classification, simple extraction, or short completion tasks, this is the rational choice on this hardware. - LFM2.5-350M: 120 tok/s in 219 MB. Matches SmolLM2-360M (369 MB) at less than half the memory footprint — a direct architectural advantage of LFMs (Liquid Foundation Models) over classic transformers at this scale.
**~1B class (ctx=2048, gen=256):** - LFM2.5-1.2B: 54.1 tok/s, 5.26 tok/J, 698 MB, 8.46W. Best throughput and best raw tok/J in the class. - Gemma3-1B: slightly ahead on total tok/J (118.5 vs 116.2) thanks to lower power draw (6.87W vs 8.46W) compensating for lower throughput. If the constraint is strictly energy rather than latency, Gemma3-1B wins. - Llama3.2-1B: 47.0 tok/s, 4.67 tok/J — third on both key metrics in its class.
### The implicit losers
**Llama3.2-1B** is clearly dominated in its class by LFM2.5-1.2B on throughput (+15%) and tok/J (+12.6%). For Meta, whose 3.2 series explicitly targeted edge deployment, this positioning is uncomfortable against alternative architectures.
**MAXN mode as a default** in Jetson tutorials is called into question. Many guides recommend MAXN for "best performance" without nuancing efficiency — this benchmark shows that's a mistake for LLM inference workloads.
**Classic dense transformers** in the 350M–400M range face pressure from LFMs: LFM2.5-350M at 219 MB vs SmolLM2-360M at 369 MB for equivalent performance represents a 0.59× memory ratio.
### Limitations and what's missing
The benchmark measures generation throughput only (output tokens/s and tok/J). It does not evaluate output quality — no MMLU, HellaSwag, or code benchmark scores. A slower but more accurate model may still be preferable depending on the use case. The aiperf methodology with 20 requests per combination is sound but doesn't test concurrency scenarios (multiple simultaneous requests), which matter for edge server deployments.
The raw HuggingFace data allows the community to extend the analysis — that's the primary added value of this publication over closed benchmarks.
Summary generated by Claude — human-verified