feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold (patch 0033, default-off)

Option (a) of PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense weight
GEMMs off the decode-tuned FP4-MMQ kernel onto the dequant->bf16 cuBLAS (nvjet)
tensor-core path, wired via an M-threshold in ggml_cuda_should_use_mmq. Lands the
validated, bit-exact-gated mechanism and records the honest GB10 result: it is a
regression, so it ships default-off (== stock), mirroring the patch-0017
default-off discipline.

Three-edit scaffold (no new kernel): should_use_mmq routes NVFP4+Blackwell+dense
M>LLAMA_FP4_PREFILL_M to cuBLAS; op_mul_mat_cublas gains an NVFP4 branch that
dequants the FP4 weights to a transient bf16 pool buffer (not cached - stays
FP4-resident) and runs cublasGemmEx CUDA_R_16BF/COMPUTE_32F; ggml_get_to_bf16_cuda
gains the NVFP4 case.

Bit-exact gate PASS (benign): test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID
806/806; the forced path (LLAMA_FP4_PREFILL_M=64) is green CUDA-vs-CPU at NVFP4
large-M shapes; greedy md5 on q36-27b is byte-identical to FP4-MMQ both for
short prefill (5951a5b4, decode untouched) and for a >threshold prefill that
exercises the bf16 path (5f3967df - no greedy argmax flips).

Performance REGRESSES on GB10 (S_PP, q36-27b dense, A/B via env): M=512 958.99
-> 486.65 (-49%), M=1024 1013.65 -> 587.27 (-42%), M=2048 918.46 -> 649.42
(-29%). The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is false here:
FP4-MMQ beats bf16-cuBLAS because bf16 peak is ~half FP4 peak and the per-step
weight dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate,
only partially amortizing as M grows. Default-off keeps stock S_PP (966.98).

Phase 2 (MoE grouped large-M) not implemented: it inherits the same
bf16-peak<FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS
would regress for the same reason; a real prefill GEMM win needs option (b), a
native FP4-MMA large-M kernel. Full A/B in docs/PREFILL_GEMM_RESULTS.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-28 17:42:15 +00:00
parent 4bdd26a7f0
commit 000705321f
2 changed files with 250 additions and 0 deletions

View File

@@ -0,0 +1,76 @@
# PREFILL_GEMM_RESULTS - option (a) dequant->bf16 cuBLAS, measured on GB10
Companion to `PREFILL_GEMM_SCOPE.md`. This records the GPU A/B for the #1
prefill lever (route large-M NVFP4 dense GEMMs off FP4-MMQ onto dequant->bf16
cuBLAS / nvjet). Shipped as patch `0033`, **default-off** because the measured
result is a regression on this hardware.
Hardware: NVIDIA GB10 (sm_121), CUDA 13.0. Backend pin `9d5d882d`.
Models: `q36-27b-nvfp4.gguf` (dense), `q36-35b-a3b-nvfp4.gguf` (MoE).
Binary: `build-cuda/bin/llama-batched-bench -fa on -ngl 99`, `LLAMA_KV_PAGED=1`.
A/B is a single build toggled by `LLAMA_FP4_PREFILL_M` (0 = MMQ baseline, >0 =
route prefill M>threshold to bf16 cuBLAS), so it isolates exactly this lever.
## 1. Bit-exact / numeric gate (PASS - divergence benign)
| Gate | Result |
|---|---|
| `test-backend-ops -o MUL_MAT` (default, threshold off) | 1146/1146 pass |
| `test-backend-ops -o MUL_MAT_ID` (default) | 806/806 pass (MoE untouched) |
| `test-backend-ops -o MUL_MAT`, path FORCED (`LLAMA_FP4_PREFILL_M=64`) | NVFP4 large-M cases (m=2048/1600/2050, n=128, k=2048) green CUDA-vs-CPU |
| greedy md5, short prefill (< threshold), lever vs base | identical: `5951a5b4d624ce891e22ab5fca9bc439` (== documented dense reference; decode byte-untouched) |
| greedy md5, long prefill (> threshold, exercises bf16 path), lever vs base | identical: `5f3967df5781445feeb25762abb9eae7` (the new FP path flips no greedy argmax) |
The new path (NVFP4->bf16 round, bf16 tensor cores, f32 accumulate) is a
different FP path from fused FP4xQ8_1 MMQ, but it is precision-neutral-to-better:
keeping activations in bf16 instead of Q8_1 is strictly more precise, and the
greedy output is byte-identical. This matches the scope's prediction
(KLD(dequant-bf16 || f16) <= KLD(FP4-MMQ || f16)).
## 2. Performance (REGRESSION - the lever loses on GB10)
S_PP (prefill tokens/s), q36-27b dense, A/B `LLAMA_FP4_PREFILL_M` off vs on:
| prefill ubatch M | npl | base S_PP (MMQ) | lever S_PP (bf16 cuBLAS) | delta |
|---|---|---|---|---|
| 512 | 32 | 958.99 | 486.65 | -49% |
| 1024 | 8 | 1013.65 | 587.27 | -42% |
| 2048 | 8 | 918.46 | 649.42 | -29% |
Default-off control (no env): S_PP 966.98 == base (within noise) -> the patch is
inert by default.
## 3. Why it loses (the scope premise was wrong for GB10)
The scope assumed FP4-MMQ is register-bound to ~3% of FP4 peak at large M, so a
vendor large-M kernel would win. **Measured, FP4-MMQ at M=512..2048 beats
dequant->bf16 cuBLAS by 29-49%.** Two compounding reasons:
1. **bf16 tensor-core peak is ~half FP4 peak on GB10.** Even a perfect bf16 GEMM
caps at ~half the throughput the FP4-MMA path can reach.
2. **The dequant tax is an un-amortized memory pass.** Per prefill step the new
path reads FP4 weights (~0.5 B/elt), writes bf16 (2 B/elt), then the GEMM
reads bf16 (2 B/elt) = ~8x the weight byte traffic of the FP4-MMQ read
(~0.5 B/elt). The dequant write is M-independent, so it only amortizes as M
grows: the gap shrinks 49% -> 42% -> 29% from M=512 -> 2048 but never crosses
even at M=2048 (above the default n_ubatch).
This is also consistent with the README decode finding that the dense path was
already ~96-97% of vLLM - the dense GEMM was never the bottleneck the way the
prefill ground-truth (measured on the MoE decision model) implied.
## 4. Status of the phases
- **Phase 1 (dense): REJECTED on GB10**, landed default-off as a validated,
env-gated scaffold (mechanism + bit-exact gate reusable by option (b) and by
non-GB10 hardware where bf16 may fare differently).
- **Phase 2 (MoE grouped large-M): NOT implemented.** It inherits the same
bf16-peak < FP4-peak ceiling plus a per-expert dequant, so a grouped
bf16-cuBLAS would regress for the same reason; the MoE id-path also has the
graph-safety catch (a false `should_use_mmq` falls to the host-sync sorted
loop, not CUDA-graph-safe). Not worth the multi-day grouped-cuBLAS + graph
work on a path the dense A/B already shows loses.
- **The only route to a real prefill GEMM win is option (b)** - a native
Blackwell FP4-MMA large-M kernel (multi-week), to greenlight only if the
prefill regime is funded. The committed scaffold gives option (b) its
M-threshold routing and its bit-exact gate for free.

View File

@@ -0,0 +1,174 @@
From 0033003300330033003300330033003300330033 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Sun, 28 Jun 2026 19:35:00 +0200
Subject: [PATCH] feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold
(default-off, rejected on GB10) (patch 0033)
Option (a) of docs/PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense
weight GEMMs OFF the decode-tuned FP4-MMQ kernel and through the dequant->bf16
cuBLAS (nvjet) tensor-core path. This lands the validated, bit-exact-gated
mechanism and records the honest result: on GB10 (sm_121) the lever is a
REGRESSION, so it is kept default-OFF (byte-identical to stock), mirroring the
patch-0017 default-off discipline.
Mechanism (all three edits are the integration scaffold, no new kernel):
- ggml/src/ggml-cuda/mmq.cu (ggml_cuda_should_use_mmq): NVFP4 + Blackwell +
dense (n_experts==0) + M > LLAMA_FP4_PREFILL_M returns false, so the dense
dispatch falls through to ggml_cuda_op_mul_mat_cublas. -D / env
LLAMA_FP4_PREFILL_M tunable; default 0 == disabled == stock. Decode and
small batches (M <= threshold) stay on FP4-MMQ.
- ggml/src/ggml-cuda/ggml-cuda.cu (ggml_cuda_op_mul_mat_cublas): new NVFP4
branch dequants the FP4 weights to a TRANSIENT bf16 pool buffer (not cached,
so the model stays FP4-resident) and runs cublasGemmEx CUDA_R_16BF /
COMPUTE_32F (tensor cores) instead of the f32 cublasSgemm fallback (no
tensor cores) that NVFP4 would otherwise hit.
- ggml/src/ggml-cuda/convert.cu (ggml_get_to_bf16_cuda): add the NVFP4 case
(the dequant kernel is dst-type generic; bf16 preserves the model's native
activation range vs f16). nullptr-by-default for other types is unchanged.
Bit-exact / numeric gate (PASS, divergence benign):
- test-backend-ops MUL_MAT 1146/1146, MUL_MAT_ID 806/806 at default; and with
the path FORCED (LLAMA_FP4_PREFILL_M=64) the NVFP4 large-M cases are green
CUDA-vs-CPU (the bf16 path is numerically within the project tolerance).
- greedy md5 (q36-27b dense, "The capital of France is", -n 48, temp 0):
lever == base == 5951a5b4d624ce891e22ab5fca9bc439 (the documented dense
reference) for short prefill (decode byte-untouched), AND identical for a
>threshold prefill that exercises the new bf16 path (5f3967df...): the new
FP path does not flip a single greedy argmax. As predicted by the scope,
bf16 activations are strictly more precise than the FP4-MMQ Q8_1 path, so
this is precision-neutral-to-better, not a regression.
Honest performance result (S_PP t/s, q36-27b dense, llama-batched-bench
-fa on -ngl 99, A/B via env), see docs/PREFILL_GEMM_RESULTS.md:
-npp 512 -npl 32 : base(MMQ) 958.99 -> lever 486.65 (-49%)
-npp 1024 -npl 8 : base(MMQ)1013.65 -> lever 587.27 (-42%)
-npp 2048 -npl 8 : base(MMQ) 918.46 -> lever 649.42 (-29%)
The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is FALSE on GB10:
FP4-MMQ at M=512..2048 beats dequant->bf16 cuBLAS, because bf16 tensor-core peak
is ~half FP4 peak AND the per-step weight dequant + 4x bf16 weight traffic
(~8x total vs the FP4 read) dominate, only partially amortizing as M grows
(gap shrinks 49%->29%, never crosses). Default-off keeps stock S_PP (966.98,
within noise of base).
Phase 2 (MoE grouped large-M) is NOT implemented: it inherits the same
bf16-peak < FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS
would regress for the same reason. The only route to a real prefill GEMM win is
option (b) - a native FP4-MMA large-M kernel (multi-week). This patch is the
validated, env-gated scaffold that option (b) / non-GB10 hardware can reuse for
the M-threshold routing + bit-exact gate.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
diff --git a/ggml/src/ggml-cuda/convert.cu b/ggml/src/ggml-cuda/convert.cu
index 61630a3..f0273c1 100644
--- a/ggml/src/ggml-cuda/convert.cu
+++ b/ggml/src/ggml-cuda/convert.cu
@@ -704,6 +704,15 @@ to_bf16_cuda_t ggml_get_to_bf16_cuda(ggml_type type) {
return convert_unary_cont_cuda<float>;
case GGML_TYPE_F16:
return convert_unary_cont_cuda<half>;
+ // Paged prefill lever (patch 0033): NVFP4 -> bf16 dequant for the large-M
+ // dequant->bf16 cuBLAS (nvjet) prefill GEMM path in
+ // ggml_cuda_op_mul_mat_cublas. The dequant kernel is dst-type generic, so
+ // this instantiates the bf16 variant; bf16 (not f16) preserves the model's
+ // native bf16 activation range and avoids f16 overflow on large prefill
+ // activations. Only the new prefill path consumes this; nullptr-by-default
+ // for all other types is unchanged.
+ case GGML_TYPE_NVFP4:
+ return dequantize_row_nvfp4_cuda;
default:
return nullptr;
}
diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu
index 9933fa6..2dcaaab 100644
--- a/ggml/src/ggml-cuda/mmq.cu
+++ b/ggml/src/ggml-cuda/mmq.cu
@@ -321,6 +321,33 @@ bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11, int64_t
return false;
}
+ // Paged prefill lever (patch 0033): OPTION-(a) route large-M NVFP4 dense GEMMs
+ // OFF the FP4-MMQ kernel and through the dequant->bf16 cuBLAS (nvjet)
+ // tensor-core path (ggml_cuda_op_mul_mat_cublas, NVFP4 bf16 branch). The
+ // scope premise was that FP4-MMQ is register-bound to ~3% of FP4 peak at
+ // large M. MEASURED ON GB10 THIS IS FALSE: FP4-MMQ at M=512..2048 beats
+ // dequant->bf16 cuBLAS by 29-49% (S_PP A/B in docs/PREFILL_GEMM_RESULTS.md),
+ // because bf16 tensor-core peak is ~half FP4 peak AND the per-step weight
+ // dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate and
+ // only partially amortize as M grows. The path is NUMERICALLY VALID and
+ // benign (greedy md5 byte-identical to FP4-MMQ; test-backend-ops passes), so
+ // it is kept as a validated, env-gated scaffold (for option-(b) native FP4
+ // large-M kernels and non-GB10 hardware), but DEFAULT-DISABLED (== stock).
+ // Set -D LLAMA_FP4_PREFILL_M=<M> or env LLAMA_FP4_PREFILL_M=<M> to A/B it;
+ // 0 (default) disables. Dense only (n_experts == 0).
+#ifndef LLAMA_FP4_PREFILL_M
+#define LLAMA_FP4_PREFILL_M 0
+#endif // LLAMA_FP4_PREFILL_M
+ if (type == GGML_TYPE_NVFP4 && n_experts == 0 && blackwell_mma_available(cc)) {
+ static const int64_t fp4_prefill_m = [] {
+ const char * e = getenv("LLAMA_FP4_PREFILL_M");
+ return e != nullptr ? (int64_t) atoll(e) : (int64_t) LLAMA_FP4_PREFILL_M;
+ }();
+ if (fp4_prefill_m > 0 && ne11 > fp4_prefill_m) {
+ return false;
+ }
+ }
+
if (turing_mma_available(cc)) {
return true;
}
diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 0dad6e1..6476d46 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -1660,7 +1660,47 @@ static void ggml_cuda_op_mul_mat_cublas(
row_diff == src0->ne[1] &&
dst->op_params[0] == GGML_PREC_DEFAULT;
- if (supports_bf16 && src0->type == GGML_TYPE_BF16 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) {
+ if (supports_bf16 && src0->type == GGML_TYPE_NVFP4 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) {
+ // Paged prefill lever (patch 0033): NVFP4 only reaches cuBLAS when
+ // ggml_cuda_should_use_mmq() returned false (large-M dense prefill).
+ // Dequant the FP4 weights to a TRANSIENT bf16 pool buffer and run a
+ // tensor-core bf16 GEMM (nvjet) instead of the f32 cublasSgemm fallback
+ // (no tensor cores) that the final else-branch would otherwise use. The
+ // weights are NOT cached as bf16 (pool scratch, freed at step end) so the
+ // model stays FP4-resident and the backend keeps its memory advantage.
+ ggml_cuda_pool_alloc<nv_bfloat16> src0_as_bf16(ctx.pool(id), row_diff*ne00);
+ const to_bf16_cuda_t to_bf16_cuda_src0 = ggml_get_to_bf16_cuda(GGML_TYPE_NVFP4);
+ GGML_ASSERT(to_bf16_cuda_src0 != nullptr);
+ to_bf16_cuda_src0(src0_dd_i, src0_as_bf16.get(), row_diff*ne00, stream);
+
+ ggml_cuda_pool_alloc<nv_bfloat16> src1_as_bf16(ctx.pool(id));
+ if (src1->type != GGML_TYPE_BF16) {
+ const to_bf16_cuda_t to_bf16_cuda = ggml_get_to_bf16_cuda(src1->type);
+ GGML_ASSERT(to_bf16_cuda != nullptr);
+ size_t ne = src1_ncols*ne10;
+ src1_as_bf16.alloc(ne);
+ to_bf16_cuda(src1_ddf_i, src1_as_bf16.get(), ne, stream);
+ }
+ const nv_bfloat16 * src1_ptr = src1->type == GGML_TYPE_BF16 ? (const nv_bfloat16 *) src1_ddf_i : src1_as_bf16.get();
+ const nv_bfloat16 * src0_ptr = src0_as_bf16.get();
+ ggml_cuda_pool_alloc<nv_bfloat16> dst_bf16(ctx.pool(id), row_diff*src1_ncols);
+
+ const float alpha_f32 = 1.0f;
+ const float beta_f32 = 0.0f;
+
+ CUBLAS_CHECK(cublasSetStream(ctx.cublas_handle(id), stream));
+ CUBLAS_CHECK(
+ cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N,
+ row_diff, src1_ncols, ne10,
+ &alpha_f32, src0_ptr, CUDA_R_16BF, ne00,
+ src1_ptr, CUDA_R_16BF, ne10,
+ &beta_f32, dst_bf16.get(), CUDA_R_16BF, ldc,
+ CUBLAS_COMPUTE_32F,
+ CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+
+ const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(GGML_TYPE_BF16);
+ to_fp32_cuda(dst_bf16.get(), dst_dd_i, row_diff*src1_ncols, stream);
+ } else if (supports_bf16 && src0->type == GGML_TYPE_BF16 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) {
ggml_cuda_pool_alloc<nv_bfloat16> src1_as_bf16(ctx.pool(id));
if (src1->type != GGML_TYPE_BF16) {
const to_bf16_cuda_t to_bf16_cuda = ggml_get_to_bf16_cuda(src1->type);
--
2.43.0