mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 08:08:52 -04:00
test(llama-cpp): NVFP4-dense FP4 quality+speed eval on GB10
NVFP4-dense is producible via --tensor-type attn=nvfp4 --tensor-type ffn=nvfp4 (GGML_TYPE_NVFP4 has a full quantize path; no top-level ftype needed). Clean-from-BF16 4B PPL: NVFP4 14.31 vs Q4_K 13.66 vs MXFP4 17.42 vs BF16 13.32 - Q4_K-class, not MXFP4-class. Prefill routes onto the FP4 MMA kernel (~1.29x Q4_K on 4B, within 5% of MXFP4). It is the quality-preserving FP4 win MXFP4 was not. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
114
backend/cpp/llama-cpp/paged/NVFP4_TEST.md
Normal file
114
backend/cpp/llama-cpp/paged/NVFP4_TEST.md
Normal file
@@ -0,0 +1,114 @@
|
||||
# NVFP4-dense on DGX Spark (GB10, sm_121): is it the quality-preserving FP4 win MXFP4 wasn't?
|
||||
|
||||
Test rig: DGX Spark GB10 (sm_121), `~/llama.cpp-pr24423/build` (PR #24423, FP4 MMA + NVFP4
|
||||
kernel), wikitext-2-raw, clean BF16 source `q3-4b-bf16.gguf` (the same source used for the
|
||||
established MXFP4 / Q4_K fair test). NVFP4 and all comparison quants were produced clean from
|
||||
BF16, no imatrix.
|
||||
|
||||
## Verdict (short)
|
||||
|
||||
YES on all the load-bearing questions, with one honest caveat:
|
||||
|
||||
1. llama.cpp CAN produce an NVFP4 GGUF.
|
||||
2. NVFP4 quality is Q4_K-class, NOT MXFP4-class: +7.4% PPL vs BF16 (MXFP4 was +30.8%). It is
|
||||
slightly behind Q4_K (+4.8% relative) but in the same ballpark, not on the quality cliff.
|
||||
3. NVFP4 routes onto the FP4 MMA kernel and gets the FP4 prefill speedup: ~1.29x Q4_K on the
|
||||
4B, tracking MXFP4 to within 5% (MXFP4 hit 1.58x on the 32B; NVFP4 should track it there too).
|
||||
4. Output is coherent.
|
||||
|
||||
Bottom line: NVFP4-dense IS the quality-preserving FP4 win MXFP4 wasn't. It delivers
|
||||
essentially the full FP4 prefill speedup at roughly Q4_K quality, where MXFP4 paid a 27% quality
|
||||
tax for the same speed. LocalAI can support/recommend NVFP4-dense on Blackwell for prefill-bound
|
||||
workloads, with the caveat that it is marginally (~5%) behind Q4_K on perplexity; an imatrix-guided
|
||||
NVFP4 quant would likely close most of that remaining gap.
|
||||
|
||||
## 1. Feasibility: can llama-quantize produce an NVFP4 GGUF? YES
|
||||
|
||||
- The type exists with a full quantize path, not just a kernel:
|
||||
- `GGML_TYPE_NVFP4 = 40` (`ggml.h`), `GGML_FTYPE_MOSTLY_NVFP4 = 26`
|
||||
- `quantize_nvfp4` / `quantize_row_nvfp4_ref` / `dequantize_row_nvfp4` registered in `ggml.c`
|
||||
- type_name is `"nvfp4"`, block `QK_NVFP4` (per-16 FP8/E4M3 block scale + global scale)
|
||||
- NVFP4 is NOT a top-level `llama-quantize` ftype (no `NVFP4` entry in the allowed-types list,
|
||||
no reference in `tools/quantize/quantize.cpp` or `src/llama-quant.cpp`), BUT
|
||||
`--tensor-type name=nvfp4` resolves it: `parse_ggml_type` matches the arg against
|
||||
`ggml_type_name(...)`, which returns `"nvfp4"`. This is the exact same mechanism that produced
|
||||
MXFP4-dense.
|
||||
- Recipe used (mirrors the MXFP4-dense GGUF byte-for-byte in structure: token_embd Q8_0, all
|
||||
norms F32, all 2D attn+ffn weights to FP4):
|
||||
|
||||
```
|
||||
llama-quantize --tensor-type "attn=nvfp4" --tensor-type "ffn=nvfp4" \
|
||||
q3-4b-bf16.gguf q3-4b-nvfp4.gguf Q8_0
|
||||
```
|
||||
|
||||
Result: `q3-4b-nvfp4.gguf`, 2343.93 MiB, 4.89 BPW, ~5 s. (MXFP4-dense was 2350 MiB; same shape.)
|
||||
Every `blk.N.attn_*` and `blk.N.ffn_*` reported `converting to nvfp4`; token_embd Q8_0; norms F32.
|
||||
|
||||
The on-box `~/bench/q3-32b-nvfp4*` dirs are vLLM HF safetensors (already 4-bit), not GGUF, and
|
||||
do not feed llama.cpp - confirmed and irrelevant.
|
||||
|
||||
## 2. Quality (decisive): NVFP4 is Q4_K-class, not MXFP4-class
|
||||
|
||||
`llama-perplexity -f wiki.test.raw --chunks 50 -c 512 -ngl 99`, all clean from the same BF16 4B:
|
||||
|
||||
| Quant | PPL | vs BF16 | vs Q4_K |
|
||||
|---------|--------|----------|----------|
|
||||
| BF16 | 13.32 | - | - |
|
||||
| Q4_K_M | 13.66 | +2.6% | - |
|
||||
| NVFP4 | 14.31 | +7.4% | +4.8% |
|
||||
| MXFP4 | 17.42 | +30.8% | +27.6% |
|
||||
|
||||
(NVFP4 measured this run: Final PPL = 14.3097 +/- 0.4457.)
|
||||
|
||||
NVFP4 lands much closer to Q4_K (gap 0.65 PPL) than to MXFP4 (gap 3.11 PPL). MXFP4's finer
|
||||
sibling delivers: the two-level scaling (per-16 FP8 block scale + global scale) recovers almost
|
||||
all of the quality MXFP4's coarse per-32 E8M0 scale threw away. It is not quite Q4_K, but it is
|
||||
firmly in the "acceptable 4-bit" regime, not the lossy one.
|
||||
|
||||
## 3. Speed: NVFP4 routes onto the FP4 MMA kernel
|
||||
|
||||
No clean BF16 32B was on the box (only the vLLM NVFP4 safetensors and the Q4_K/MXFP4 32B GGUFs),
|
||||
so per the brief this is the 4B speed signal - a 3-way cold A/B on the SAME 4B model, 45 s
|
||||
cooldowns between runs (`-npp 512 -ntg 128 -npl 8,32,64 -b 2048 -ub 2048 -ngl 99`):
|
||||
|
||||
Prefill S_PP (t/s):
|
||||
|
||||
| B | Q4_K | NVFP4 | MXFP4 | NVFP4 / Q4_K | NVFP4 / MXFP4 |
|
||||
|-----|--------|--------|--------|--------------|---------------|
|
||||
| 8 | 4862 | 6313 | 6602 | 1.30x | 0.96x |
|
||||
| 32 | 5020 | 6497 | 6836 | 1.29x | 0.95x |
|
||||
| 64 | 5031 | 6490 | 6831 | 1.29x | 0.95x |
|
||||
|
||||
- NVFP4 prefill is within ~5% of MXFP4 at every batch size -> both land on the same FP4 MMA
|
||||
kernel. NVFP4 does NOT fall back to a slow path.
|
||||
- NVFP4 beats Q4_K's int8-MMQ prefill by ~1.29x on the 4B. The established 32B figures were
|
||||
Q4_K S_PP ~767 and MXFP4 ~1209 (1.58x); since NVFP4 tracks MXFP4 to within 5%, NVFP4 on the
|
||||
32B should likewise approach ~1.5x. (The 4B shows a smaller multiplier than the 32B because a
|
||||
smaller model spends proportionally less time in the matmul the FP4 kernel accelerates.)
|
||||
- Token-gen (S_TG) is comparable across all three (memory-bound), as expected.
|
||||
|
||||
## 4. Coherence
|
||||
|
||||
`llama-simple` (llama-cli hangs - avoided), NVFP4 4B:
|
||||
- "The capital of France is" -> "...Paris. ...Germany is in Berlin. ...Italy is in Rome.
|
||||
...Spain is in Madrid. ...Netherlands is in Amsterdam." (all correct)
|
||||
- "Q: What is 17 plus 25? A:" -> "42." (correct)
|
||||
|
||||
Coherent and factually accurate.
|
||||
|
||||
## Recommendation for LocalAI on Blackwell
|
||||
|
||||
Support and recommend NVFP4-dense as the FP4 prefill option on Blackwell (sm_120/121), produced
|
||||
via `--tensor-type attn=nvfp4 --tensor-type ffn=nvfp4` over a BF16 source (token_embd Q8_0,
|
||||
norms F32). It gives ~the full FP4 prefill speedup (FP4 MMA kernel, ~1.3x Q4_K on 4B and
|
||||
expected ~1.5x on larger models) at roughly Q4_K quality (+7.4% PPL vs BF16). This is the win
|
||||
MXFP4 failed to deliver: MXFP4 paid a +30.8% quality tax for the same speed and was rejected.
|
||||
|
||||
Caveats / follow-ups:
|
||||
- NVFP4 is still ~4.8% behind Q4_K on PPL. For quality-first deployments where the prefill win
|
||||
does not matter, Q4_K_M remains the better pick.
|
||||
- These NVFP4/Q4_K numbers are clean (no imatrix). An imatrix-guided NVFP4 quant is the obvious
|
||||
next step and would likely close most of the remaining gap to Q4_K - worth measuring before a
|
||||
blanket recommendation.
|
||||
- A direct 32B NVFP4-vs-Q4_K speed run (needs a clean BF16 32B GGUF, not on the box) would
|
||||
confirm the projected ~1.5x; the 4B signal plus the MXFP4-tracking already make this very likely.
|
||||
Reference in New Issue
Block a user