# NVFP4-dense on DGX Spark (GB10, sm_121): is it the quality-preserving FP4 win MXFP4 wasn't? Test rig: DGX Spark GB10 (sm_121), `~/llama.cpp-pr24423/build` (PR #24423, FP4 MMA + NVFP4 kernel), wikitext-2-raw, clean BF16 source `q3-4b-bf16.gguf` (the same source used for the established MXFP4 / Q4_K fair test). NVFP4 and all comparison quants were produced clean from BF16, no imatrix. ## Verdict (short) YES on all the load-bearing questions, with one honest caveat: 1. llama.cpp CAN produce an NVFP4 GGUF. 2. NVFP4 quality is Q4_K-class, NOT MXFP4-class: +7.4% PPL vs BF16 (MXFP4 was +30.8%). It is slightly behind Q4_K (+4.8% relative) but in the same ballpark, not on the quality cliff. 3. NVFP4 routes onto the FP4 MMA kernel and gets the FP4 prefill speedup: ~1.29x Q4_K on the 4B, tracking MXFP4 to within 5% (MXFP4 hit 1.58x on the 32B; NVFP4 should track it there too). 4. Output is coherent. Bottom line: NVFP4-dense IS the quality-preserving FP4 win MXFP4 wasn't. It delivers essentially the full FP4 prefill speedup at roughly Q4_K quality, where MXFP4 paid a 27% quality tax for the same speed. LocalAI can support/recommend NVFP4-dense on Blackwell for prefill-bound workloads, with the caveat that it is marginally (~5%) behind Q4_K on perplexity; an imatrix-guided NVFP4 quant would likely close most of that remaining gap. ## 1. Feasibility: can llama-quantize produce an NVFP4 GGUF? YES - The type exists with a full quantize path, not just a kernel: - `GGML_TYPE_NVFP4 = 40` (`ggml.h`), `GGML_FTYPE_MOSTLY_NVFP4 = 26` - `quantize_nvfp4` / `quantize_row_nvfp4_ref` / `dequantize_row_nvfp4` registered in `ggml.c` - type_name is `"nvfp4"`, block `QK_NVFP4` (per-16 FP8/E4M3 block scale + global scale) - NVFP4 is NOT a top-level `llama-quantize` ftype (no `NVFP4` entry in the allowed-types list, no reference in `tools/quantize/quantize.cpp` or `src/llama-quant.cpp`), BUT `--tensor-type name=nvfp4` resolves it: `parse_ggml_type` matches the arg against `ggml_type_name(...)`, which returns `"nvfp4"`. This is the exact same mechanism that produced MXFP4-dense. - Recipe used (mirrors the MXFP4-dense GGUF byte-for-byte in structure: token_embd Q8_0, all norms F32, all 2D attn+ffn weights to FP4): ``` llama-quantize --tensor-type "attn=nvfp4" --tensor-type "ffn=nvfp4" \ q3-4b-bf16.gguf q3-4b-nvfp4.gguf Q8_0 ``` Result: `q3-4b-nvfp4.gguf`, 2343.93 MiB, 4.89 BPW, ~5 s. (MXFP4-dense was 2350 MiB; same shape.) Every `blk.N.attn_*` and `blk.N.ffn_*` reported `converting to nvfp4`; token_embd Q8_0; norms F32. The on-box `~/bench/q3-32b-nvfp4*` dirs are vLLM HF safetensors (already 4-bit), not GGUF, and do not feed llama.cpp - confirmed and irrelevant. ## 2. Quality (decisive): NVFP4 is Q4_K-class, not MXFP4-class `llama-perplexity -f wiki.test.raw --chunks 50 -c 512 -ngl 99`, all clean from the same BF16 4B: | Quant | PPL | vs BF16 | vs Q4_K | |---------|--------|----------|----------| | BF16 | 13.32 | - | - | | Q4_K_M | 13.66 | +2.6% | - | | NVFP4 | 14.31 | +7.4% | +4.8% | | MXFP4 | 17.42 | +30.8% | +27.6% | (NVFP4 measured this run: Final PPL = 14.3097 +/- 0.4457.) NVFP4 lands much closer to Q4_K (gap 0.65 PPL) than to MXFP4 (gap 3.11 PPL). MXFP4's finer sibling delivers: the two-level scaling (per-16 FP8 block scale + global scale) recovers almost all of the quality MXFP4's coarse per-32 E8M0 scale threw away. It is not quite Q4_K, but it is firmly in the "acceptable 4-bit" regime, not the lossy one. ## 3. Speed: NVFP4 routes onto the FP4 MMA kernel No clean BF16 32B was on the box (only the vLLM NVFP4 safetensors and the Q4_K/MXFP4 32B GGUFs), so per the brief this is the 4B speed signal - a 3-way cold A/B on the SAME 4B model, 45 s cooldowns between runs (`-npp 512 -ntg 128 -npl 8,32,64 -b 2048 -ub 2048 -ngl 99`): Prefill S_PP (t/s): | B | Q4_K | NVFP4 | MXFP4 | NVFP4 / Q4_K | NVFP4 / MXFP4 | |-----|--------|--------|--------|--------------|---------------| | 8 | 4862 | 6313 | 6602 | 1.30x | 0.96x | | 32 | 5020 | 6497 | 6836 | 1.29x | 0.95x | | 64 | 5031 | 6490 | 6831 | 1.29x | 0.95x | - NVFP4 prefill is within ~5% of MXFP4 at every batch size -> both land on the same FP4 MMA kernel. NVFP4 does NOT fall back to a slow path. - NVFP4 beats Q4_K's int8-MMQ prefill by ~1.29x on the 4B. The established 32B figures were Q4_K S_PP ~767 and MXFP4 ~1209 (1.58x); since NVFP4 tracks MXFP4 to within 5%, NVFP4 on the 32B should likewise approach ~1.5x. (The 4B shows a smaller multiplier than the 32B because a smaller model spends proportionally less time in the matmul the FP4 kernel accelerates.) - Token-gen (S_TG) is comparable across all three (memory-bound), as expected. ## 4. Coherence `llama-simple` (llama-cli hangs - avoided), NVFP4 4B: - "The capital of France is" -> "...Paris. ...Germany is in Berlin. ...Italy is in Rome. ...Spain is in Madrid. ...Netherlands is in Amsterdam." (all correct) - "Q: What is 17 plus 25? A:" -> "42." (correct) Coherent and factually accurate. ## Recommendation for LocalAI on Blackwell Support and recommend NVFP4-dense as the FP4 prefill option on Blackwell (sm_120/121), produced via `--tensor-type attn=nvfp4 --tensor-type ffn=nvfp4` over a BF16 source (token_embd Q8_0, norms F32). It gives ~the full FP4 prefill speedup (FP4 MMA kernel, ~1.3x Q4_K on 4B and expected ~1.5x on larger models) at roughly Q4_K quality (+7.4% PPL vs BF16). This is the win MXFP4 failed to deliver: MXFP4 paid a +30.8% quality tax for the same speed and was rejected. Caveats / follow-ups: - NVFP4 is still ~4.8% behind Q4_K on PPL. For quality-first deployments where the prefill win does not matter, Q4_K_M remains the better pick. - These NVFP4/Q4_K numbers are clean (no imatrix). An imatrix-guided NVFP4 quant is the obvious next step and would likely close most of the remaining gap to Q4_K - worth measuring before a blanket recommendation. - A direct 32B NVFP4-vs-Q4_K speed run (needs a clean BF16 32B GGUF, not on the box) would confirm the projected ~1.5x; the 4B signal plus the MXFP4-tracking already make this very likely.