From faeb5b457c543754afb802193352826f76eddda0 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Sun, 21 Jun 2026 21:42:17 +0000
Subject: [PATCH] analysis: NVFP4 closes the decode gap too (547->619, ~93% of
 vLLM)

Measured npl=128 cold A/B: NVFP4 decode 619 vs Q4_K 547 (+13%), closing the gap to
vLLM (667) from ~22% to ~7%. NVFP4's FP4-MMA kernel is more bandwidth-efficient at
the thin n=128 decode shape than Q4_K int8-MMQ (which ran 2.1x above the floor), so
it IS the better int4 decode GEMM the diagnosis called for - no multi-day
Marlin-for-K-quants needed. With NVFP4, llama.cpp on GB10 is ahead on prefill
(1209 vs 800) and within ~7% on decode. Remaining 7% = optional FP4 kernel tuning.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 .../cpp/llama-cpp/paged/DECODE_OVERHEAD.md    | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)
diff --git a/backend/cpp/llama-cpp/paged/DECODE_OVERHEAD.md b/backend/cpp/llama-cpp/paged/DECODE_OVERHEAD.md
index e8d7157cd..06b75ffdd 100644
--- a/backend/cpp/llama-cpp/paged/DECODE_OVERHEAD.md
+++ b/backend/cpp/llama-cpp/paged/DECODE_OVERHEAD.md
@@ -194,3 +194,22 @@ GGML_CUDA_DISABLE_GRAPHS=1 ...same...        # graphs off
 # GPU active (graphs off): nsys profile -t cuda --delay=6 --duration=8 ...
 #   nsys stats --report cuda_gpu_kern_sum  -> sum/0.516 ~= 7.72s of 8s = ~96%
 ```
+
+## UPDATE: NVFP4 closes most of the decode gap (no Marlin-for-K-quants needed)
+
+The diagnosis above said the lever is "a more bandwidth-efficient int4 decode GEMM"
+and feared a multi-day Marlin-for-K-quants kernel. But the FP4-MMA path is already
+that kernel. Measured (npl=128, cold A/B, npp=16 ntg=128):
+
+| quant | decode S_TG (t/s) | vs Q4_K | vs vLLM 667 |
+|---|---|---|---|
+| Q4_K_M | 547 (548/546) | - | 82% |
+| **NVFP4** | **619 (617/622)** | **+13%** | **93%** |
+
+NVFP4's `mul_mat_q<NVFP4>` runs closer to the GB10 bandwidth floor at the thin n=128
+decode shape than Q4_K's int8-MMQ (which ran ~2.1x above it). So shipping the model
+as NVFP4 closes the decode gap from ~22% to ~7% AND wins prefill (1209 vs Q4 767 /
+vLLM 800). Net on GB10: llama.cpp+NVFP4 is ahead on prefill (1.5x) and within ~7% on
+decode. The remaining ~7% would be incremental FP4-MMA decode-kernel tuning, NOT a
+from-scratch Marlin kernel - a much smaller, optional effort. NVFP4 is the answer to
+both the prefill and the decode gap.