Files
LocalAI/backend/cpp/llama-cpp
Ettore Di Giacinto ea634ee958 docs(paged): scope track B - FP4-MMA decode-GEMM roofline + parity go/no-go
Roofline at the decode batch shape (M=128, NVFP4 weights) on GB10 (sm_121):
the dense weight-read floor (~1,940 tok/s) and MoE floor (~1,590 tok/s) sit
4-6x above vLLM's 391/811, so 273 GB/s is NOT the wall. At FP4 peak the GEMM is
bandwidth-bound (crossover M*~611 >> 128); at the kernel's ~3% achieved FP4
efficiency it is compute-bound by its own inefficiency (471 ms vs a 66 ms floor).

Verdict: dense decode parity is plausibly reachable via a tuned FP4-MMA decode
M-tile (track B) + fused act-quant (track A), landing 376-394 tok/s = 90-103% of
vLLM 391, but only at the top of the demonstrated GB10 FP4 envelope (~17-21%) and
with no margin (occupancy wall is the binding constraint, not bandwidth). MoE
parity is NOT reachable from the GEMM alone (ceiling ~60-76% of 811): its floor
is the hardest grouped-GEMM regime and ~24% of its step is non-GEMM work outside
track B. GO (conditional) for dense, PARTIAL for MoE. Build-ready phased plan
included; tune the existing block_fp4_mmq path, not a W4A16 rewrite.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-24 14:09:41 +00:00
..
2026-04-12 08:51:30 +02:00