Measured npl=128 cold A/B: NVFP4 decode 619 vs Q4_K 547 (+13%), closing the gap to
vLLM (667) from ~22% to ~7%. NVFP4's FP4-MMA kernel is more bandwidth-efficient at
the thin n=128 decode shape than Q4_K int8-MMQ (which ran 2.1x above the floor), so
it IS the better int4 decode GEMM the diagnosis called for - no multi-day
Marlin-for-K-quants needed. With NVFP4, llama.cpp on GB10 is ahead on prefill
(1209 vs 800) and within ~7% on decode. Remaining 7% = optional FP4 kernel tuning.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Rigorous re-measurement on pr24423: concurrent decode is GPU-compute-bound (~96%
util, sampled), CUDA graphs ARE enabled at npl=128 (94/98 calls replay a captured
graph; n_kv padded to 256 keeps topology stable), and graphs ON vs OFF is only
+1.5% at npl=128. The earlier '20% GPU util / 170ms host' read was a windowing
error (whole-run nsys vs decode-windowed). So no host/graph patch helps. The real
547->667 gap is the quantized DECODE GEMM: mul_mat_q (Q4_K/Q6_K) is ~68% of decode
GPU time and runs ~2.1x above the GB10 bandwidth floor (poorly tuned for the thin
n=128 shape); vLLM's Marlin int4 runs closer. Lever = a Marlin-style int4 decode
kernel for K-quants (or a Marlin-friendly int4 serving format), not host work.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>