Files
LocalAI/backend/cpp/llama-cpp-localai-paged/docs/PREFILL_GEMM_RESULTS.md
Ettore Di Giacinto 000705321f feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold (patch 0033, default-off)
Option (a) of PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense weight
GEMMs off the decode-tuned FP4-MMQ kernel onto the dequant->bf16 cuBLAS (nvjet)
tensor-core path, wired via an M-threshold in ggml_cuda_should_use_mmq. Lands the
validated, bit-exact-gated mechanism and records the honest GB10 result: it is a
regression, so it ships default-off (== stock), mirroring the patch-0017
default-off discipline.

Three-edit scaffold (no new kernel): should_use_mmq routes NVFP4+Blackwell+dense
M>LLAMA_FP4_PREFILL_M to cuBLAS; op_mul_mat_cublas gains an NVFP4 branch that
dequants the FP4 weights to a transient bf16 pool buffer (not cached - stays
FP4-resident) and runs cublasGemmEx CUDA_R_16BF/COMPUTE_32F; ggml_get_to_bf16_cuda
gains the NVFP4 case.

Bit-exact gate PASS (benign): test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID
806/806; the forced path (LLAMA_FP4_PREFILL_M=64) is green CUDA-vs-CPU at NVFP4
large-M shapes; greedy md5 on q36-27b is byte-identical to FP4-MMQ both for
short prefill (5951a5b4, decode untouched) and for a >threshold prefill that
exercises the bf16 path (5f3967df - no greedy argmax flips).

Performance REGRESSES on GB10 (S_PP, q36-27b dense, A/B via env): M=512 958.99
-> 486.65 (-49%), M=1024 1013.65 -> 587.27 (-42%), M=2048 918.46 -> 649.42
(-29%). The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is false here:
FP4-MMQ beats bf16-cuBLAS because bf16 peak is ~half FP4 peak and the per-step
weight dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate,
only partially amortizing as M grows. Default-off keeps stock S_PP (966.98).

Phase 2 (MoE grouped large-M) not implemented: it inherits the same
bf16-peak<FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS
would regress for the same reason; a real prefill GEMM win needs option (b), a
native FP4-MMA large-M kernel. Full A/B in docs/PREFILL_GEMM_RESULTS.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 17:42:15 +00:00

4.2 KiB

PREFILL_GEMM_RESULTS - option (a) dequant->bf16 cuBLAS, measured on GB10

Companion to PREFILL_GEMM_SCOPE.md. This records the GPU A/B for the #1 prefill lever (route large-M NVFP4 dense GEMMs off FP4-MMQ onto dequant->bf16 cuBLAS / nvjet). Shipped as patch 0033, default-off because the measured result is a regression on this hardware.

Hardware: NVIDIA GB10 (sm_121), CUDA 13.0. Backend pin 9d5d882d. Models: q36-27b-nvfp4.gguf (dense), q36-35b-a3b-nvfp4.gguf (MoE). Binary: build-cuda/bin/llama-batched-bench -fa on -ngl 99, LLAMA_KV_PAGED=1. A/B is a single build toggled by LLAMA_FP4_PREFILL_M (0 = MMQ baseline, >0 = route prefill M>threshold to bf16 cuBLAS), so it isolates exactly this lever.

1. Bit-exact / numeric gate (PASS - divergence benign)

Gate Result
test-backend-ops -o MUL_MAT (default, threshold off) 1146/1146 pass
test-backend-ops -o MUL_MAT_ID (default) 806/806 pass (MoE untouched)
test-backend-ops -o MUL_MAT, path FORCED (LLAMA_FP4_PREFILL_M=64) NVFP4 large-M cases (m=2048/1600/2050, n=128, k=2048) green CUDA-vs-CPU
greedy md5, short prefill (< threshold), lever vs base identical: 5951a5b4d624ce891e22ab5fca9bc439 (== documented dense reference; decode byte-untouched)
greedy md5, long prefill (> threshold, exercises bf16 path), lever vs base identical: 5f3967df5781445feeb25762abb9eae7 (the new FP path flips no greedy argmax)

The new path (NVFP4->bf16 round, bf16 tensor cores, f32 accumulate) is a different FP path from fused FP4xQ8_1 MMQ, but it is precision-neutral-to-better: keeping activations in bf16 instead of Q8_1 is strictly more precise, and the greedy output is byte-identical. This matches the scope's prediction (KLD(dequant-bf16 || f16) <= KLD(FP4-MMQ || f16)).

2. Performance (REGRESSION - the lever loses on GB10)

S_PP (prefill tokens/s), q36-27b dense, A/B LLAMA_FP4_PREFILL_M off vs on:

prefill ubatch M npl base S_PP (MMQ) lever S_PP (bf16 cuBLAS) delta
512 32 958.99 486.65 -49%
1024 8 1013.65 587.27 -42%
2048 8 918.46 649.42 -29%

Default-off control (no env): S_PP 966.98 == base (within noise) -> the patch is inert by default.

3. Why it loses (the scope premise was wrong for GB10)

The scope assumed FP4-MMQ is register-bound to ~3% of FP4 peak at large M, so a vendor large-M kernel would win. Measured, FP4-MMQ at M=512..2048 beats dequant->bf16 cuBLAS by 29-49%. Two compounding reasons:

  1. bf16 tensor-core peak is ~half FP4 peak on GB10. Even a perfect bf16 GEMM caps at ~half the throughput the FP4-MMA path can reach.
  2. The dequant tax is an un-amortized memory pass. Per prefill step the new path reads FP4 weights (~0.5 B/elt), writes bf16 (2 B/elt), then the GEMM reads bf16 (2 B/elt) = ~8x the weight byte traffic of the FP4-MMQ read (~0.5 B/elt). The dequant write is M-independent, so it only amortizes as M grows: the gap shrinks 49% -> 42% -> 29% from M=512 -> 2048 but never crosses even at M=2048 (above the default n_ubatch).

This is also consistent with the README decode finding that the dense path was already ~96-97% of vLLM - the dense GEMM was never the bottleneck the way the prefill ground-truth (measured on the MoE decision model) implied.

4. Status of the phases

  • Phase 1 (dense): REJECTED on GB10, landed default-off as a validated, env-gated scaffold (mechanism + bit-exact gate reusable by option (b) and by non-GB10 hardware where bf16 may fare differently).
  • Phase 2 (MoE grouped large-M): NOT implemented. It inherits the same bf16-peak < FP4-peak ceiling plus a per-expert dequant, so a grouped bf16-cuBLAS would regress for the same reason; the MoE id-path also has the graph-safety catch (a false should_use_mmq falls to the host-sync sorted loop, not CUDA-graph-safe). Not worth the multi-day grouped-cuBLAS + graph work on a path the dense A/B already shows loses.
  • The only route to a real prefill GEMM win is option (b) - a native Blackwell FP4-MMA large-M kernel (multi-week), to greenlight only if the prefill regime is funded. The committed scaffold gives option (b) its M-threshold routing and its bit-exact gate for free.