analysis: NVFP4 closes the decode gap too (547->619, ~93% of vLLM)

Measured npl=128 cold A/B: NVFP4 decode 619 vs Q4_K 547 (+13%), closing the gap to
vLLM (667) from ~22% to ~7%. NVFP4's FP4-MMA kernel is more bandwidth-efficient at
the thin n=128 decode shape than Q4_K int8-MMQ (which ran 2.1x above the floor), so
it IS the better int4 decode GEMM the diagnosis called for - no multi-day
Marlin-for-K-quants needed. With NVFP4, llama.cpp on GB10 is ahead on prefill
(1209 vs 800) and within ~7% on decode. Remaining 7% = optional FP4 kernel tuning.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-21 21:42:17 +00:00
parent 6e0b910210
commit faeb5b457c

View File

@@ -194,3 +194,22 @@ GGML_CUDA_DISABLE_GRAPHS=1 ...same... # graphs off
# GPU active (graphs off): nsys profile -t cuda --delay=6 --duration=8 ...
# nsys stats --report cuda_gpu_kern_sum -> sum/0.516 ~= 7.72s of 8s = ~96%
```
## UPDATE: NVFP4 closes most of the decode gap (no Marlin-for-K-quants needed)
The diagnosis above said the lever is "a more bandwidth-efficient int4 decode GEMM"
and feared a multi-day Marlin-for-K-quants kernel. But the FP4-MMA path is already
that kernel. Measured (npl=128, cold A/B, npp=16 ntg=128):
| quant | decode S_TG (t/s) | vs Q4_K | vs vLLM 667 |
|---|---|---|---|
| Q4_K_M | 547 (548/546) | - | 82% |
| **NVFP4** | **619 (617/622)** | **+13%** | **93%** |
NVFP4's `mul_mat_q<NVFP4>` runs closer to the GB10 bandwidth floor at the thin n=128
decode shape than Q4_K's int8-MMQ (which ran ~2.1x above it). So shipping the model
as NVFP4 closes the decode gap from ~22% to ~7% AND wins prefill (1209 vs Q4 767 /
vLLM 800). Net on GB10: llama.cpp+NVFP4 is ahead on prefill (1.5x) and within ~7% on
decode. The remaining ~7% would be incremental FP4-MMA decode-kernel tuning, NOT a
from-scratch Marlin kernel - a much smaller, optional effort. NVFP4 is the answer to
both the prefill and the decode gap.