diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PREFILL_GEMM_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/PREFILL_GEMM_RESULTS.md new file mode 100644 index 000000000..8dfa0ead3 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/PREFILL_GEMM_RESULTS.md @@ -0,0 +1,76 @@ +# PREFILL_GEMM_RESULTS - option (a) dequant->bf16 cuBLAS, measured on GB10 + +Companion to `PREFILL_GEMM_SCOPE.md`. This records the GPU A/B for the #1 +prefill lever (route large-M NVFP4 dense GEMMs off FP4-MMQ onto dequant->bf16 +cuBLAS / nvjet). Shipped as patch `0033`, **default-off** because the measured +result is a regression on this hardware. + +Hardware: NVIDIA GB10 (sm_121), CUDA 13.0. Backend pin `9d5d882d`. +Models: `q36-27b-nvfp4.gguf` (dense), `q36-35b-a3b-nvfp4.gguf` (MoE). +Binary: `build-cuda/bin/llama-batched-bench -fa on -ngl 99`, `LLAMA_KV_PAGED=1`. +A/B is a single build toggled by `LLAMA_FP4_PREFILL_M` (0 = MMQ baseline, >0 = +route prefill M>threshold to bf16 cuBLAS), so it isolates exactly this lever. + +## 1. Bit-exact / numeric gate (PASS - divergence benign) + +| Gate | Result | +|---|---| +| `test-backend-ops -o MUL_MAT` (default, threshold off) | 1146/1146 pass | +| `test-backend-ops -o MUL_MAT_ID` (default) | 806/806 pass (MoE untouched) | +| `test-backend-ops -o MUL_MAT`, path FORCED (`LLAMA_FP4_PREFILL_M=64`) | NVFP4 large-M cases (m=2048/1600/2050, n=128, k=2048) green CUDA-vs-CPU | +| greedy md5, short prefill (< threshold), lever vs base | identical: `5951a5b4d624ce891e22ab5fca9bc439` (== documented dense reference; decode byte-untouched) | +| greedy md5, long prefill (> threshold, exercises bf16 path), lever vs base | identical: `5f3967df5781445feeb25762abb9eae7` (the new FP path flips no greedy argmax) | + +The new path (NVFP4->bf16 round, bf16 tensor cores, f32 accumulate) is a +different FP path from fused FP4xQ8_1 MMQ, but it is precision-neutral-to-better: +keeping activations in bf16 instead of Q8_1 is strictly more precise, and the +greedy output is byte-identical. This matches the scope's prediction +(KLD(dequant-bf16 || f16) <= KLD(FP4-MMQ || f16)). + +## 2. Performance (REGRESSION - the lever loses on GB10) + +S_PP (prefill tokens/s), q36-27b dense, A/B `LLAMA_FP4_PREFILL_M` off vs on: + +| prefill ubatch M | npl | base S_PP (MMQ) | lever S_PP (bf16 cuBLAS) | delta | +|---|---|---|---|---| +| 512 | 32 | 958.99 | 486.65 | -49% | +| 1024 | 8 | 1013.65 | 587.27 | -42% | +| 2048 | 8 | 918.46 | 649.42 | -29% | + +Default-off control (no env): S_PP 966.98 == base (within noise) -> the patch is +inert by default. + +## 3. Why it loses (the scope premise was wrong for GB10) + +The scope assumed FP4-MMQ is register-bound to ~3% of FP4 peak at large M, so a +vendor large-M kernel would win. **Measured, FP4-MMQ at M=512..2048 beats +dequant->bf16 cuBLAS by 29-49%.** Two compounding reasons: + +1. **bf16 tensor-core peak is ~half FP4 peak on GB10.** Even a perfect bf16 GEMM + caps at ~half the throughput the FP4-MMA path can reach. +2. **The dequant tax is an un-amortized memory pass.** Per prefill step the new + path reads FP4 weights (~0.5 B/elt), writes bf16 (2 B/elt), then the GEMM + reads bf16 (2 B/elt) = ~8x the weight byte traffic of the FP4-MMQ read + (~0.5 B/elt). The dequant write is M-independent, so it only amortizes as M + grows: the gap shrinks 49% -> 42% -> 29% from M=512 -> 2048 but never crosses + even at M=2048 (above the default n_ubatch). + +This is also consistent with the README decode finding that the dense path was +already ~96-97% of vLLM - the dense GEMM was never the bottleneck the way the +prefill ground-truth (measured on the MoE decision model) implied. + +## 4. Status of the phases + +- **Phase 1 (dense): REJECTED on GB10**, landed default-off as a validated, + env-gated scaffold (mechanism + bit-exact gate reusable by option (b) and by + non-GB10 hardware where bf16 may fare differently). +- **Phase 2 (MoE grouped large-M): NOT implemented.** It inherits the same + bf16-peak < FP4-peak ceiling plus a per-expert dequant, so a grouped + bf16-cuBLAS would regress for the same reason; the MoE id-path also has the + graph-safety catch (a false `should_use_mmq` falls to the host-sync sorted + loop, not CUDA-graph-safe). Not worth the multi-day grouped-cuBLAS + graph + work on a path the dense A/B already shows loses. +- **The only route to a real prefill GEMM win is option (b)** - a native + Blackwell FP4-MMA large-M kernel (multi-week), to greenlight only if the + prefill regime is funded. The committed scaffold gives option (b) its + M-threshold routing and its bit-exact gate for free. diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0033-fp4-prefill-large-m-bf16-cublas-scaffold.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0033-fp4-prefill-large-m-bf16-cublas-scaffold.patch new file mode 100644 index 000000000..30060d30a --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0033-fp4-prefill-large-m-bf16-cublas-scaffold.patch @@ -0,0 +1,174 @@ +From 0033003300330033003300330033003300330033 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Sun, 28 Jun 2026 19:35:00 +0200 +Subject: [PATCH] feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold + (default-off, rejected on GB10) (patch 0033) + +Option (a) of docs/PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense +weight GEMMs OFF the decode-tuned FP4-MMQ kernel and through the dequant->bf16 +cuBLAS (nvjet) tensor-core path. This lands the validated, bit-exact-gated +mechanism and records the honest result: on GB10 (sm_121) the lever is a +REGRESSION, so it is kept default-OFF (byte-identical to stock), mirroring the +patch-0017 default-off discipline. + +Mechanism (all three edits are the integration scaffold, no new kernel): + - ggml/src/ggml-cuda/mmq.cu (ggml_cuda_should_use_mmq): NVFP4 + Blackwell + + dense (n_experts==0) + M > LLAMA_FP4_PREFILL_M returns false, so the dense + dispatch falls through to ggml_cuda_op_mul_mat_cublas. -D / env + LLAMA_FP4_PREFILL_M tunable; default 0 == disabled == stock. Decode and + small batches (M <= threshold) stay on FP4-MMQ. + - ggml/src/ggml-cuda/ggml-cuda.cu (ggml_cuda_op_mul_mat_cublas): new NVFP4 + branch dequants the FP4 weights to a TRANSIENT bf16 pool buffer (not cached, + so the model stays FP4-resident) and runs cublasGemmEx CUDA_R_16BF / + COMPUTE_32F (tensor cores) instead of the f32 cublasSgemm fallback (no + tensor cores) that NVFP4 would otherwise hit. + - ggml/src/ggml-cuda/convert.cu (ggml_get_to_bf16_cuda): add the NVFP4 case + (the dequant kernel is dst-type generic; bf16 preserves the model's native + activation range vs f16). nullptr-by-default for other types is unchanged. + +Bit-exact / numeric gate (PASS, divergence benign): + - test-backend-ops MUL_MAT 1146/1146, MUL_MAT_ID 806/806 at default; and with + the path FORCED (LLAMA_FP4_PREFILL_M=64) the NVFP4 large-M cases are green + CUDA-vs-CPU (the bf16 path is numerically within the project tolerance). + - greedy md5 (q36-27b dense, "The capital of France is", -n 48, temp 0): + lever == base == 5951a5b4d624ce891e22ab5fca9bc439 (the documented dense + reference) for short prefill (decode byte-untouched), AND identical for a + >threshold prefill that exercises the new bf16 path (5f3967df...): the new + FP path does not flip a single greedy argmax. As predicted by the scope, + bf16 activations are strictly more precise than the FP4-MMQ Q8_1 path, so + this is precision-neutral-to-better, not a regression. + +Honest performance result (S_PP t/s, q36-27b dense, llama-batched-bench +-fa on -ngl 99, A/B via env), see docs/PREFILL_GEMM_RESULTS.md: + -npp 512 -npl 32 : base(MMQ) 958.99 -> lever 486.65 (-49%) + -npp 1024 -npl 8 : base(MMQ)1013.65 -> lever 587.27 (-42%) + -npp 2048 -npl 8 : base(MMQ) 918.46 -> lever 649.42 (-29%) +The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is FALSE on GB10: +FP4-MMQ at M=512..2048 beats dequant->bf16 cuBLAS, because bf16 tensor-core peak +is ~half FP4 peak AND the per-step weight dequant + 4x bf16 weight traffic +(~8x total vs the FP4 read) dominate, only partially amortizing as M grows +(gap shrinks 49%->29%, never crosses). Default-off keeps stock S_PP (966.98, +within noise of base). + +Phase 2 (MoE grouped large-M) is NOT implemented: it inherits the same +bf16-peak < FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS +would regress for the same reason. The only route to a real prefill GEMM win is +option (b) - a native FP4-MMA large-M kernel (multi-week). This patch is the +validated, env-gated scaffold that option (b) / non-GB10 hardware can reuse for +the M-threshold routing + bit-exact gate. + +Assisted-by: Claude:opus-4.8 [Claude Code] +--- +diff --git a/ggml/src/ggml-cuda/convert.cu b/ggml/src/ggml-cuda/convert.cu +index 61630a3..f0273c1 100644 +--- a/ggml/src/ggml-cuda/convert.cu ++++ b/ggml/src/ggml-cuda/convert.cu +@@ -704,6 +704,15 @@ to_bf16_cuda_t ggml_get_to_bf16_cuda(ggml_type type) { + return convert_unary_cont_cuda; + case GGML_TYPE_F16: + return convert_unary_cont_cuda; ++ // Paged prefill lever (patch 0033): NVFP4 -> bf16 dequant for the large-M ++ // dequant->bf16 cuBLAS (nvjet) prefill GEMM path in ++ // ggml_cuda_op_mul_mat_cublas. The dequant kernel is dst-type generic, so ++ // this instantiates the bf16 variant; bf16 (not f16) preserves the model's ++ // native bf16 activation range and avoids f16 overflow on large prefill ++ // activations. Only the new prefill path consumes this; nullptr-by-default ++ // for all other types is unchanged. ++ case GGML_TYPE_NVFP4: ++ return dequantize_row_nvfp4_cuda; + default: + return nullptr; + } +diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu +index 9933fa6..2dcaaab 100644 +--- a/ggml/src/ggml-cuda/mmq.cu ++++ b/ggml/src/ggml-cuda/mmq.cu +@@ -321,6 +321,33 @@ bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11, int64_t + return false; + } + ++ // Paged prefill lever (patch 0033): OPTION-(a) route large-M NVFP4 dense GEMMs ++ // OFF the FP4-MMQ kernel and through the dequant->bf16 cuBLAS (nvjet) ++ // tensor-core path (ggml_cuda_op_mul_mat_cublas, NVFP4 bf16 branch). The ++ // scope premise was that FP4-MMQ is register-bound to ~3% of FP4 peak at ++ // large M. MEASURED ON GB10 THIS IS FALSE: FP4-MMQ at M=512..2048 beats ++ // dequant->bf16 cuBLAS by 29-49% (S_PP A/B in docs/PREFILL_GEMM_RESULTS.md), ++ // because bf16 tensor-core peak is ~half FP4 peak AND the per-step weight ++ // dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate and ++ // only partially amortize as M grows. The path is NUMERICALLY VALID and ++ // benign (greedy md5 byte-identical to FP4-MMQ; test-backend-ops passes), so ++ // it is kept as a validated, env-gated scaffold (for option-(b) native FP4 ++ // large-M kernels and non-GB10 hardware), but DEFAULT-DISABLED (== stock). ++ // Set -D LLAMA_FP4_PREFILL_M= or env LLAMA_FP4_PREFILL_M= to A/B it; ++ // 0 (default) disables. Dense only (n_experts == 0). ++#ifndef LLAMA_FP4_PREFILL_M ++#define LLAMA_FP4_PREFILL_M 0 ++#endif // LLAMA_FP4_PREFILL_M ++ if (type == GGML_TYPE_NVFP4 && n_experts == 0 && blackwell_mma_available(cc)) { ++ static const int64_t fp4_prefill_m = [] { ++ const char * e = getenv("LLAMA_FP4_PREFILL_M"); ++ return e != nullptr ? (int64_t) atoll(e) : (int64_t) LLAMA_FP4_PREFILL_M; ++ }(); ++ if (fp4_prefill_m > 0 && ne11 > fp4_prefill_m) { ++ return false; ++ } ++ } ++ + if (turing_mma_available(cc)) { + return true; + } +diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu +index 0dad6e1..6476d46 100644 +--- a/ggml/src/ggml-cuda/ggml-cuda.cu ++++ b/ggml/src/ggml-cuda/ggml-cuda.cu +@@ -1660,7 +1660,47 @@ static void ggml_cuda_op_mul_mat_cublas( + row_diff == src0->ne[1] && + dst->op_params[0] == GGML_PREC_DEFAULT; + +- if (supports_bf16 && src0->type == GGML_TYPE_BF16 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) { ++ if (supports_bf16 && src0->type == GGML_TYPE_NVFP4 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) { ++ // Paged prefill lever (patch 0033): NVFP4 only reaches cuBLAS when ++ // ggml_cuda_should_use_mmq() returned false (large-M dense prefill). ++ // Dequant the FP4 weights to a TRANSIENT bf16 pool buffer and run a ++ // tensor-core bf16 GEMM (nvjet) instead of the f32 cublasSgemm fallback ++ // (no tensor cores) that the final else-branch would otherwise use. The ++ // weights are NOT cached as bf16 (pool scratch, freed at step end) so the ++ // model stays FP4-resident and the backend keeps its memory advantage. ++ ggml_cuda_pool_alloc src0_as_bf16(ctx.pool(id), row_diff*ne00); ++ const to_bf16_cuda_t to_bf16_cuda_src0 = ggml_get_to_bf16_cuda(GGML_TYPE_NVFP4); ++ GGML_ASSERT(to_bf16_cuda_src0 != nullptr); ++ to_bf16_cuda_src0(src0_dd_i, src0_as_bf16.get(), row_diff*ne00, stream); ++ ++ ggml_cuda_pool_alloc src1_as_bf16(ctx.pool(id)); ++ if (src1->type != GGML_TYPE_BF16) { ++ const to_bf16_cuda_t to_bf16_cuda = ggml_get_to_bf16_cuda(src1->type); ++ GGML_ASSERT(to_bf16_cuda != nullptr); ++ size_t ne = src1_ncols*ne10; ++ src1_as_bf16.alloc(ne); ++ to_bf16_cuda(src1_ddf_i, src1_as_bf16.get(), ne, stream); ++ } ++ const nv_bfloat16 * src1_ptr = src1->type == GGML_TYPE_BF16 ? (const nv_bfloat16 *) src1_ddf_i : src1_as_bf16.get(); ++ const nv_bfloat16 * src0_ptr = src0_as_bf16.get(); ++ ggml_cuda_pool_alloc dst_bf16(ctx.pool(id), row_diff*src1_ncols); ++ ++ const float alpha_f32 = 1.0f; ++ const float beta_f32 = 0.0f; ++ ++ CUBLAS_CHECK(cublasSetStream(ctx.cublas_handle(id), stream)); ++ CUBLAS_CHECK( ++ cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, ++ row_diff, src1_ncols, ne10, ++ &alpha_f32, src0_ptr, CUDA_R_16BF, ne00, ++ src1_ptr, CUDA_R_16BF, ne10, ++ &beta_f32, dst_bf16.get(), CUDA_R_16BF, ldc, ++ CUBLAS_COMPUTE_32F, ++ CUBLAS_GEMM_DEFAULT_TENSOR_OP)); ++ ++ const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(GGML_TYPE_BF16); ++ to_fp32_cuda(dst_bf16.get(), dst_dd_i, row_diff*src1_ncols, stream); ++ } else if (supports_bf16 && src0->type == GGML_TYPE_BF16 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) { + ggml_cuda_pool_alloc src1_as_bf16(ctx.pool(id)); + if (src1->type != GGML_TYPE_BF16) { + const to_bf16_cuda_t to_bf16_cuda = ggml_get_to_bf16_cuda(src1->type); +-- +2.43.0