feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold (patch 0033, default-off)

Option (a) of PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense weight GEMMs off the decode-tuned FP4-MMQ kernel onto the dequant->bf16 cuBLAS (nvjet) tensor-core path, wired via an M-threshold in ggml_cuda_should_use_mmq. Lands the validated, bit-exact-gated mechanism and records the honest GB10 result: it is a regression, so it ships default-off (== stock), mirroring the patch-0017 default-off discipline. Three-edit scaffold (no new kernel): should_use_mmq routes NVFP4+Blackwell+dense M>LLAMA_FP4_PREFILL_M to cuBLAS; op_mul_mat_cublas gains an NVFP4 branch that dequants the FP4 weights to a transient bf16 pool buffer (not cached - stays FP4-resident) and runs cublasGemmEx CUDA_R_16BF/COMPUTE_32F; ggml_get_to_bf16_cuda gains the NVFP4 case. Bit-exact gate PASS (benign): test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID 806/806; the forced path (LLAMA_FP4_PREFILL_M=64) is green CUDA-vs-CPU at NVFP4 large-M shapes; greedy md5 on q36-27b is byte-identical to FP4-MMQ both for short prefill (5951a5b4, decode untouched) and for a >threshold prefill that exercises the bf16 path (5f3967df - no greedy argmax flips). Performance REGRESSES on GB10 (S_PP, q36-27b dense, A/B via env): M=512 958.99 -> 486.65 (-49%), M=1024 1013.65 -> 587.27 (-42%), M=2048 918.46 -> 649.42 (-29%). The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is false here: FP4-MMQ beats bf16-cuBLAS because bf16 peak is ~half FP4 peak and the per-step weight dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate, only partially amortizing as M grows. Default-off keeps stock S_PP (966.98). Phase 2 (MoE grouped large-M) not implemented: it inherits the same bf16-peak<FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS would regress for the same reason; a real prefill GEMM win needs option (b), a native FP4-MMA large-M kernel. Full A/B in docs/PREFILL_GEMM_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-29 19:06:43 -04:00 · 2026-06-28 17:42:15 +00:00
parent 4bdd26a7f0
commit 000705321f
2 changed files with 250 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/PREFILL_GEMM_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PREFILL_GEMM_RESULTS.md
@@ -0,0 +1,76 @@
+# PREFILL_GEMM_RESULTS - option (a) dequant->bf16 cuBLAS, measured on GB10
+
+Companion to `PREFILL_GEMM_SCOPE.md`. This records the GPU A/B for the #1
+prefill lever (route large-M NVFP4 dense GEMMs off FP4-MMQ onto dequant->bf16
+cuBLAS / nvjet). Shipped as patch `0033`, **default-off** because the measured
+result is a regression on this hardware.
+
+Hardware: NVIDIA GB10 (sm_121), CUDA 13.0. Backend pin `9d5d882d`.
+Models: `q36-27b-nvfp4.gguf` (dense), `q36-35b-a3b-nvfp4.gguf` (MoE).
+Binary: `build-cuda/bin/llama-batched-bench -fa on -ngl 99`, `LLAMA_KV_PAGED=1`.
+A/B is a single build toggled by `LLAMA_FP4_PREFILL_M` (0 = MMQ baseline, >0 =
+route prefill M>threshold to bf16 cuBLAS), so it isolates exactly this lever.
+
+## 1. Bit-exact / numeric gate (PASS - divergence benign)
+
+| Gate | Result |
+|---|---|
+| `test-backend-ops -o MUL_MAT` (default, threshold off) | 1146/1146 pass |
+| `test-backend-ops -o MUL_MAT_ID` (default) | 806/806 pass (MoE untouched) |
+| `test-backend-ops -o MUL_MAT`, path FORCED (`LLAMA_FP4_PREFILL_M=64`) | NVFP4 large-M cases (m=2048/1600/2050, n=128, k=2048) green CUDA-vs-CPU |
+| greedy md5, short prefill (< threshold), lever vs base | identical: `5951a5b4d624ce891e22ab5fca9bc439` (== documented dense reference; decode byte-untouched) |
+| greedy md5, long prefill (> threshold, exercises bf16 path), lever vs base | identical: `5f3967df5781445feeb25762abb9eae7` (the new FP path flips no greedy argmax) |
+
+The new path (NVFP4->bf16 round, bf16 tensor cores, f32 accumulate) is a
+different FP path from fused FP4xQ8_1 MMQ, but it is precision-neutral-to-better:
+keeping activations in bf16 instead of Q8_1 is strictly more precise, and the
+greedy output is byte-identical. This matches the scope's prediction
+(KLD(dequant-bf16 || f16) <= KLD(FP4-MMQ || f16)).
+
+## 2. Performance (REGRESSION - the lever loses on GB10)
+
+S_PP (prefill tokens/s), q36-27b dense, A/B `LLAMA_FP4_PREFILL_M` off vs on:
+
+| prefill ubatch M | npl | base S_PP (MMQ) | lever S_PP (bf16 cuBLAS) | delta |
+|---|---|---|---|---|
+| 512  | 32 | 958.99  | 486.65 | -49% |
+| 1024 | 8  | 1013.65 | 587.27 | -42% |
+| 2048 | 8  | 918.46  | 649.42 | -29% |
+
+Default-off control (no env): S_PP 966.98 == base (within noise) -> the patch is
+inert by default.
+
+## 3. Why it loses (the scope premise was wrong for GB10)
+
+The scope assumed FP4-MMQ is register-bound to ~3% of FP4 peak at large M, so a
+vendor large-M kernel would win. **Measured, FP4-MMQ at M=512..2048 beats
+dequant->bf16 cuBLAS by 29-49%.** Two compounding reasons:
+
+1. **bf16 tensor-core peak is ~half FP4 peak on GB10.** Even a perfect bf16 GEMM
+   caps at ~half the throughput the FP4-MMA path can reach.
+2. **The dequant tax is an un-amortized memory pass.** Per prefill step the new
+   path reads FP4 weights (~0.5 B/elt), writes bf16 (2 B/elt), then the GEMM
+   reads bf16 (2 B/elt) = ~8x the weight byte traffic of the FP4-MMQ read
+   (~0.5 B/elt). The dequant write is M-independent, so it only amortizes as M
+   grows: the gap shrinks 49% -> 42% -> 29% from M=512 -> 2048 but never crosses
+   even at M=2048 (above the default n_ubatch).
+
+This is also consistent with the README decode finding that the dense path was
+already ~96-97% of vLLM - the dense GEMM was never the bottleneck the way the
+prefill ground-truth (measured on the MoE decision model) implied.
+
+## 4. Status of the phases
+
+- **Phase 1 (dense): REJECTED on GB10**, landed default-off as a validated,
+  env-gated scaffold (mechanism + bit-exact gate reusable by option (b) and by
+  non-GB10 hardware where bf16 may fare differently).
+- **Phase 2 (MoE grouped large-M): NOT implemented.** It inherits the same
+  bf16-peak < FP4-peak ceiling plus a per-expert dequant, so a grouped
+  bf16-cuBLAS would regress for the same reason; the MoE id-path also has the
+  graph-safety catch (a false `should_use_mmq` falls to the host-sync sorted
+  loop, not CUDA-graph-safe). Not worth the multi-day grouped-cuBLAS + graph
+  work on a path the dense A/B already shows loses.
+- **The only route to a real prefill GEMM win is option (b)** - a native
+  Blackwell FP4-MMA large-M kernel (multi-week), to greenlight only if the
+  prefill regime is funded. The committed scaffold gives option (b) its
+  M-threshold routing and its bit-exact gate for free.
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0033-fp4-prefill-large-m-bf16-cublas-scaffold.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0033-fp4-prefill-large-m-bf16-cublas-scaffold.patch
@@ -0,0 +1,174 @@
+From 0033003300330033003300330033003300330033 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Sun, 28 Jun 2026 19:35:00 +0200
+Subject: [PATCH] feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold
+ (default-off, rejected on GB10) (patch 0033)
+
+Option (a) of docs/PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense
+weight GEMMs OFF the decode-tuned FP4-MMQ kernel and through the dequant->bf16
+cuBLAS (nvjet) tensor-core path. This lands the validated, bit-exact-gated
+mechanism and records the honest result: on GB10 (sm_121) the lever is a
+REGRESSION, so it is kept default-OFF (byte-identical to stock), mirroring the
+patch-0017 default-off discipline.
+
+Mechanism (all three edits are the integration scaffold, no new kernel):
+  - ggml/src/ggml-cuda/mmq.cu (ggml_cuda_should_use_mmq): NVFP4 + Blackwell +
+    dense (n_experts==0) + M > LLAMA_FP4_PREFILL_M returns false, so the dense
+    dispatch falls through to ggml_cuda_op_mul_mat_cublas. -D / env
+    LLAMA_FP4_PREFILL_M tunable; default 0 == disabled == stock. Decode and
+    small batches (M <= threshold) stay on FP4-MMQ.
+  - ggml/src/ggml-cuda/ggml-cuda.cu (ggml_cuda_op_mul_mat_cublas): new NVFP4
+    branch dequants the FP4 weights to a TRANSIENT bf16 pool buffer (not cached,
+    so the model stays FP4-resident) and runs cublasGemmEx CUDA_R_16BF /
+    COMPUTE_32F (tensor cores) instead of the f32 cublasSgemm fallback (no
+    tensor cores) that NVFP4 would otherwise hit.
+  - ggml/src/ggml-cuda/convert.cu (ggml_get_to_bf16_cuda): add the NVFP4 case
+    (the dequant kernel is dst-type generic; bf16 preserves the model's native
+    activation range vs f16). nullptr-by-default for other types is unchanged.
+
+Bit-exact / numeric gate (PASS, divergence benign):
+  - test-backend-ops MUL_MAT 1146/1146, MUL_MAT_ID 806/806 at default; and with
+    the path FORCED (LLAMA_FP4_PREFILL_M=64) the NVFP4 large-M cases are green
+    CUDA-vs-CPU (the bf16 path is numerically within the project tolerance).
+  - greedy md5 (q36-27b dense, "The capital of France is", -n 48, temp 0):
+    lever == base == 5951a5b4d624ce891e22ab5fca9bc439 (the documented dense
+    reference) for short prefill (decode byte-untouched), AND identical for a
+    >threshold prefill that exercises the new bf16 path (5f3967df...): the new
+    FP path does not flip a single greedy argmax. As predicted by the scope,
+    bf16 activations are strictly more precise than the FP4-MMQ Q8_1 path, so
+    this is precision-neutral-to-better, not a regression.
+
+Honest performance result (S_PP t/s, q36-27b dense, llama-batched-bench
+-fa on -ngl 99, A/B via env), see docs/PREFILL_GEMM_RESULTS.md:
+  -npp 512  -npl 32 : base(MMQ) 958.99 -> lever 486.65  (-49%)
+  -npp 1024 -npl 8  : base(MMQ)1013.65 -> lever 587.27  (-42%)
+  -npp 2048 -npl 8  : base(MMQ) 918.46 -> lever 649.42  (-29%)
+The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is FALSE on GB10:
+FP4-MMQ at M=512..2048 beats dequant->bf16 cuBLAS, because bf16 tensor-core peak
+is ~half FP4 peak AND the per-step weight dequant + 4x bf16 weight traffic
+(~8x total vs the FP4 read) dominate, only partially amortizing as M grows
+(gap shrinks 49%->29%, never crosses). Default-off keeps stock S_PP (966.98,
+within noise of base).
+
+Phase 2 (MoE grouped large-M) is NOT implemented: it inherits the same
+bf16-peak < FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS
+would regress for the same reason. The only route to a real prefill GEMM win is
+option (b) - a native FP4-MMA large-M kernel (multi-week). This patch is the
+validated, env-gated scaffold that option (b) / non-GB10 hardware can reuse for
+the M-threshold routing + bit-exact gate.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+---
+diff --git a/ggml/src/ggml-cuda/convert.cu b/ggml/src/ggml-cuda/convert.cu
+index 61630a3..f0273c1 100644
+--- a/ggml/src/ggml-cuda/convert.cu
+++ b/ggml/src/ggml-cuda/convert.cu
+@@ -704,6 +704,15 @@ to_bf16_cuda_t ggml_get_to_bf16_cuda(ggml_type type) {
+             return convert_unary_cont_cuda<float>;
+         case GGML_TYPE_F16:
+             return convert_unary_cont_cuda<half>;
+        // Paged prefill lever (patch 0033): NVFP4 -> bf16 dequant for the large-M
+        // dequant->bf16 cuBLAS (nvjet) prefill GEMM path in
+        // ggml_cuda_op_mul_mat_cublas. The dequant kernel is dst-type generic, so
+        // this instantiates the bf16 variant; bf16 (not f16) preserves the model's
+        // native bf16 activation range and avoids f16 overflow on large prefill
+        // activations. Only the new prefill path consumes this; nullptr-by-default
+        // for all other types is unchanged.
+        case GGML_TYPE_NVFP4:
+            return dequantize_row_nvfp4_cuda;
+         default:
+             return nullptr;
+     }
+diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu
+index 9933fa6..2dcaaab 100644
+--- a/ggml/src/ggml-cuda/mmq.cu
+++ b/ggml/src/ggml-cuda/mmq.cu
+@@ -321,6 +321,33 @@ bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11, int64_t
+         return false;
+     }
+ 
+    // Paged prefill lever (patch 0033): OPTION-(a) route large-M NVFP4 dense GEMMs
+    // OFF the FP4-MMQ kernel and through the dequant->bf16 cuBLAS (nvjet)
+    // tensor-core path (ggml_cuda_op_mul_mat_cublas, NVFP4 bf16 branch). The
+    // scope premise was that FP4-MMQ is register-bound to ~3% of FP4 peak at
+    // large M. MEASURED ON GB10 THIS IS FALSE: FP4-MMQ at M=512..2048 beats
+    // dequant->bf16 cuBLAS by 29-49% (S_PP A/B in docs/PREFILL_GEMM_RESULTS.md),
+    // because bf16 tensor-core peak is ~half FP4 peak AND the per-step weight
+    // dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate and
+    // only partially amortize as M grows. The path is NUMERICALLY VALID and
+    // benign (greedy md5 byte-identical to FP4-MMQ; test-backend-ops passes), so
+    // it is kept as a validated, env-gated scaffold (for option-(b) native FP4
+    // large-M kernels and non-GB10 hardware), but DEFAULT-DISABLED (== stock).
+    // Set -D LLAMA_FP4_PREFILL_M=<M> or env LLAMA_FP4_PREFILL_M=<M> to A/B it;
+    // 0 (default) disables. Dense only (n_experts == 0).
+#ifndef LLAMA_FP4_PREFILL_M
+#define LLAMA_FP4_PREFILL_M 0
+#endif // LLAMA_FP4_PREFILL_M
+    if (type == GGML_TYPE_NVFP4 && n_experts == 0 && blackwell_mma_available(cc)) {
+        static const int64_t fp4_prefill_m = [] {
+            const char * e = getenv("LLAMA_FP4_PREFILL_M");
+            return e != nullptr ? (int64_t) atoll(e) : (int64_t) LLAMA_FP4_PREFILL_M;
+        }();
+        if (fp4_prefill_m > 0 && ne11 > fp4_prefill_m) {
+            return false;
+        }
+    }
+
+     if (turing_mma_available(cc)) {
+         return true;
+     }
+diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
+index 0dad6e1..6476d46 100644
+--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
+@@ -1660,7 +1660,47 @@ static void ggml_cuda_op_mul_mat_cublas(
+         row_diff == src0->ne[1] &&
+         dst->op_params[0] == GGML_PREC_DEFAULT;
+ 
+-    if (supports_bf16 && src0->type == GGML_TYPE_BF16 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) {
+    if (supports_bf16 && src0->type == GGML_TYPE_NVFP4 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) {
+        // Paged prefill lever (patch 0033): NVFP4 only reaches cuBLAS when
+        // ggml_cuda_should_use_mmq() returned false (large-M dense prefill).
+        // Dequant the FP4 weights to a TRANSIENT bf16 pool buffer and run a
+        // tensor-core bf16 GEMM (nvjet) instead of the f32 cublasSgemm fallback
+        // (no tensor cores) that the final else-branch would otherwise use. The
+        // weights are NOT cached as bf16 (pool scratch, freed at step end) so the
+        // model stays FP4-resident and the backend keeps its memory advantage.
+        ggml_cuda_pool_alloc<nv_bfloat16> src0_as_bf16(ctx.pool(id), row_diff*ne00);
+        const to_bf16_cuda_t to_bf16_cuda_src0 = ggml_get_to_bf16_cuda(GGML_TYPE_NVFP4);
+        GGML_ASSERT(to_bf16_cuda_src0 != nullptr);
+        to_bf16_cuda_src0(src0_dd_i, src0_as_bf16.get(), row_diff*ne00, stream);
+
+        ggml_cuda_pool_alloc<nv_bfloat16> src1_as_bf16(ctx.pool(id));
+        if (src1->type != GGML_TYPE_BF16) {
+            const to_bf16_cuda_t to_bf16_cuda = ggml_get_to_bf16_cuda(src1->type);
+            GGML_ASSERT(to_bf16_cuda != nullptr);
+            size_t ne = src1_ncols*ne10;
+            src1_as_bf16.alloc(ne);
+            to_bf16_cuda(src1_ddf_i, src1_as_bf16.get(), ne, stream);
+        }
+        const nv_bfloat16 * src1_ptr = src1->type == GGML_TYPE_BF16 ? (const nv_bfloat16 *) src1_ddf_i : src1_as_bf16.get();
+        const nv_bfloat16 * src0_ptr = src0_as_bf16.get();
+        ggml_cuda_pool_alloc<nv_bfloat16> dst_bf16(ctx.pool(id), row_diff*src1_ncols);
+
+        const float alpha_f32 = 1.0f;
+        const float beta_f32  = 0.0f;
+
+        CUBLAS_CHECK(cublasSetStream(ctx.cublas_handle(id), stream));
+        CUBLAS_CHECK(
+            cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N,
+                    row_diff, src1_ncols, ne10,
+                    &alpha_f32,  src0_ptr,       CUDA_R_16BF, ne00,
+                                 src1_ptr,       CUDA_R_16BF, ne10,
+                    &beta_f32,   dst_bf16.get(), CUDA_R_16BF, ldc,
+                    CUBLAS_COMPUTE_32F,
+                    CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+
+        const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(GGML_TYPE_BF16);
+        to_fp32_cuda(dst_bf16.get(), dst_dd_i, row_diff*src1_ncols, stream);
+    } else if (supports_bf16 && src0->type == GGML_TYPE_BF16 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) {
+         ggml_cuda_pool_alloc<nv_bfloat16> src1_as_bf16(ctx.pool(id));
+         if (src1->type != GGML_TYPE_BF16) {
+             const to_bf16_cuda_t to_bf16_cuda = ggml_get_to_bf16_cuda(src1->type);
+-- 
+2.43.0