mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 08:08:52 -04:00

Files

Ettore Di Giacinto aaf7b4112e test(llama-cpp): NVFP4-dense FP4 quality+speed eval on GB10

NVFP4-dense is producible via --tensor-type attn=nvfp4 --tensor-type ffn=nvfp4
(GGML_TYPE_NVFP4 has a full quantize path; no top-level ftype needed). Clean-from-BF16
4B PPL: NVFP4 14.31 vs Q4_K 13.66 vs MXFP4 17.42 vs BF16 13.32 - Q4_K-class, not
MXFP4-class. Prefill routes onto the FP4 MMA kernel (~1.29x Q4_K on 4B, within 5% of
MXFP4). It is the quality-preserving FP4 win MXFP4 was not.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-21 18:44:57 +00:00

6.1 KiB

Raw Blame History

NVFP4-dense on DGX Spark (GB10, sm_121): is it the quality-preserving FP4 win MXFP4 wasn't?

Test rig: DGX Spark GB10 (sm_121), ~/llama.cpp-pr24423/build (PR #24423, FP4 MMA + NVFP4 kernel), wikitext-2-raw, clean BF16 source q3-4b-bf16.gguf (the same source used for the established MXFP4 / Q4_K fair test). NVFP4 and all comparison quants were produced clean from BF16, no imatrix.

Verdict (short)

YES on all the load-bearing questions, with one honest caveat:

llama.cpp CAN produce an NVFP4 GGUF.
NVFP4 quality is Q4_K-class, NOT MXFP4-class: +7.4% PPL vs BF16 (MXFP4 was +30.8%). It is slightly behind Q4_K (+4.8% relative) but in the same ballpark, not on the quality cliff.
NVFP4 routes onto the FP4 MMA kernel and gets the FP4 prefill speedup: ~1.29x Q4_K on the 4B, tracking MXFP4 to within 5% (MXFP4 hit 1.58x on the 32B; NVFP4 should track it there too).
Output is coherent.

Bottom line: NVFP4-dense IS the quality-preserving FP4 win MXFP4 wasn't. It delivers essentially the full FP4 prefill speedup at roughly Q4_K quality, where MXFP4 paid a 27% quality tax for the same speed. LocalAI can support/recommend NVFP4-dense on Blackwell for prefill-bound workloads, with the caveat that it is marginally (~5%) behind Q4_K on perplexity; an imatrix-guided NVFP4 quant would likely close most of that remaining gap.

1. Feasibility: can llama-quantize produce an NVFP4 GGUF? YES

The type exists with a full quantize path, not just a kernel:
- GGML_TYPE_NVFP4 = 40 (ggml.h), GGML_FTYPE_MOSTLY_NVFP4 = 26
- quantize_nvfp4 / quantize_row_nvfp4_ref / dequantize_row_nvfp4 registered in ggml.c
- type_name is "nvfp4", block QK_NVFP4 (per-16 FP8/E4M3 block scale + global scale)
NVFP4 is NOT a top-level llama-quantize ftype (no NVFP4 entry in the allowed-types list, no reference in tools/quantize/quantize.cpp or src/llama-quant.cpp), BUT --tensor-type name=nvfp4 resolves it: parse_ggml_type matches the arg against ggml_type_name(...), which returns "nvfp4". This is the exact same mechanism that produced MXFP4-dense.
Recipe used (mirrors the MXFP4-dense GGUF byte-for-byte in structure: token_embd Q8_0, all norms F32, all 2D attn+ffn weights to FP4):
```
llama-quantize --tensor-type "attn=nvfp4" --tensor-type "ffn=nvfp4" \
               q3-4b-bf16.gguf q3-4b-nvfp4.gguf Q8_0
```
Result: q3-4b-nvfp4.gguf, 2343.93 MiB, 4.89 BPW, ~5 s. (MXFP4-dense was 2350 MiB; same shape.) Every blk.N.attn_* and blk.N.ffn_* reported converting to nvfp4; token_embd Q8_0; norms F32.

The on-box ~/bench/q3-32b-nvfp4* dirs are vLLM HF safetensors (already 4-bit), not GGUF, and do not feed llama.cpp - confirmed and irrelevant.

2. Quality (decisive): NVFP4 is Q4_K-class, not MXFP4-class

llama-perplexity -f wiki.test.raw --chunks 50 -c 512 -ngl 99, all clean from the same BF16 4B:

Quant	PPL	vs BF16	vs Q4_K
BF16	13.32	-	-
Q4_K_M	13.66	+2.6%	-
NVFP4	14.31	+7.4%	+4.8%
MXFP4	17.42	+30.8%	+27.6%

(NVFP4 measured this run: Final PPL = 14.3097 +/- 0.4457.)

NVFP4 lands much closer to Q4_K (gap 0.65 PPL) than to MXFP4 (gap 3.11 PPL). MXFP4's finer sibling delivers: the two-level scaling (per-16 FP8 block scale + global scale) recovers almost all of the quality MXFP4's coarse per-32 E8M0 scale threw away. It is not quite Q4_K, but it is firmly in the "acceptable 4-bit" regime, not the lossy one.

3. Speed: NVFP4 routes onto the FP4 MMA kernel

No clean BF16 32B was on the box (only the vLLM NVFP4 safetensors and the Q4_K/MXFP4 32B GGUFs), so per the brief this is the 4B speed signal - a 3-way cold A/B on the SAME 4B model, 45 s cooldowns between runs (-npp 512 -ntg 128 -npl 8,32,64 -b 2048 -ub 2048 -ngl 99):

Prefill S_PP (t/s):

B	Q4_K	NVFP4	MXFP4	NVFP4 / Q4_K	NVFP4 / MXFP4
8	4862	6313	6602	1.30x	0.96x
32	5020	6497	6836	1.29x	0.95x
64	5031	6490	6831	1.29x	0.95x

NVFP4 prefill is within ~5% of MXFP4 at every batch size -> both land on the same FP4 MMA kernel. NVFP4 does NOT fall back to a slow path.
NVFP4 beats Q4_K's int8-MMQ prefill by ~1.29x on the 4B. The established 32B figures were Q4_K S_PP ~767 and MXFP4 ~1209 (1.58x); since NVFP4 tracks MXFP4 to within 5%, NVFP4 on the 32B should likewise approach ~1.5x. (The 4B shows a smaller multiplier than the 32B because a smaller model spends proportionally less time in the matmul the FP4 kernel accelerates.)
Token-gen (S_TG) is comparable across all three (memory-bound), as expected.

4. Coherence

llama-simple (llama-cli hangs - avoided), NVFP4 4B:

"The capital of France is" -> "...Paris. ...Germany is in Berlin. ...Italy is in Rome. ...Spain is in Madrid. ...Netherlands is in Amsterdam." (all correct)
"Q: What is 17 plus 25? A:" -> "42." (correct)

Coherent and factually accurate.

Recommendation for LocalAI on Blackwell

Support and recommend NVFP4-dense as the FP4 prefill option on Blackwell (sm_120/121), produced via --tensor-type attn=nvfp4 --tensor-type ffn=nvfp4 over a BF16 source (token_embd Q8_0, norms F32). It gives ~the full FP4 prefill speedup (FP4 MMA kernel, ~1.3x Q4_K on 4B and expected ~1.5x on larger models) at roughly Q4_K quality (+7.4% PPL vs BF16). This is the win MXFP4 failed to deliver: MXFP4 paid a +30.8% quality tax for the same speed and was rejected.

Caveats / follow-ups:

NVFP4 is still ~4.8% behind Q4_K on PPL. For quality-first deployments where the prefill win does not matter, Q4_K_M remains the better pick.
These NVFP4/Q4_K numbers are clean (no imatrix). An imatrix-guided NVFP4 quant is the obvious next step and would likely close most of the remaining gap to Q4_K - worth measuring before a blanket recommendation.
A direct 32B NVFP4-vs-Q4_K speed run (needs a clean BF16 32B GGUF, not on the box) would confirm the projected ~1.5x; the 4B signal plus the MXFP4-tracking already make this very likely.

6.1 KiB Raw Blame History