Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU
Signal
82
Hype
15
In three linesRigel empirically characterizes Metal 4.1 tensor compute path on Apple M4 Max. Researchers find fp8 (E4M3) matmul2d is emulated, not accelerated (0.94x fp16 throughput), executes on GPU shader cores without dedicated matrix datapath, and accumulates in ≥fp32. Hand-fused GEMM+bias+GELU kernel gains +6.5-12.9% in cache-resident regime.Read source
Your take?
Summary generated by Claude — human-verified