mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-29 19:06:43 -04:00
feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold (patch 0033, default-off)
Option (a) of PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense weight GEMMs off the decode-tuned FP4-MMQ kernel onto the dequant->bf16 cuBLAS (nvjet) tensor-core path, wired via an M-threshold in ggml_cuda_should_use_mmq. Lands the validated, bit-exact-gated mechanism and records the honest GB10 result: it is a regression, so it ships default-off (== stock), mirroring the patch-0017 default-off discipline. Three-edit scaffold (no new kernel): should_use_mmq routes NVFP4+Blackwell+dense M>LLAMA_FP4_PREFILL_M to cuBLAS; op_mul_mat_cublas gains an NVFP4 branch that dequants the FP4 weights to a transient bf16 pool buffer (not cached - stays FP4-resident) and runs cublasGemmEx CUDA_R_16BF/COMPUTE_32F; ggml_get_to_bf16_cuda gains the NVFP4 case. Bit-exact gate PASS (benign): test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID 806/806; the forced path (LLAMA_FP4_PREFILL_M=64) is green CUDA-vs-CPU at NVFP4 large-M shapes; greedy md5 on q36-27b is byte-identical to FP4-MMQ both for short prefill (5951a5b4, decode untouched) and for a >threshold prefill that exercises the bf16 path (5f3967df - no greedy argmax flips). Performance REGRESSES on GB10 (S_PP, q36-27b dense, A/B via env): M=512 958.99 -> 486.65 (-49%), M=1024 1013.65 -> 587.27 (-42%), M=2048 918.46 -> 649.42 (-29%). The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is false here: FP4-MMQ beats bf16-cuBLAS because bf16 peak is ~half FP4 peak and the per-step weight dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate, only partially amortizing as M grows. Default-off keeps stock S_PP (966.98). Phase 2 (MoE grouped large-M) not implemented: it inherits the same bf16-peak<FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS would regress for the same reason; a real prefill GEMM win needs option (b), a native FP4-MMA large-M kernel. Full A/B in docs/PREFILL_GEMM_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -0,0 +1,76 @@
|
||||
# PREFILL_GEMM_RESULTS - option (a) dequant->bf16 cuBLAS, measured on GB10
|
||||
|
||||
Companion to `PREFILL_GEMM_SCOPE.md`. This records the GPU A/B for the #1
|
||||
prefill lever (route large-M NVFP4 dense GEMMs off FP4-MMQ onto dequant->bf16
|
||||
cuBLAS / nvjet). Shipped as patch `0033`, **default-off** because the measured
|
||||
result is a regression on this hardware.
|
||||
|
||||
Hardware: NVIDIA GB10 (sm_121), CUDA 13.0. Backend pin `9d5d882d`.
|
||||
Models: `q36-27b-nvfp4.gguf` (dense), `q36-35b-a3b-nvfp4.gguf` (MoE).
|
||||
Binary: `build-cuda/bin/llama-batched-bench -fa on -ngl 99`, `LLAMA_KV_PAGED=1`.
|
||||
A/B is a single build toggled by `LLAMA_FP4_PREFILL_M` (0 = MMQ baseline, >0 =
|
||||
route prefill M>threshold to bf16 cuBLAS), so it isolates exactly this lever.
|
||||
|
||||
## 1. Bit-exact / numeric gate (PASS - divergence benign)
|
||||
|
||||
| Gate | Result |
|
||||
|---|---|
|
||||
| `test-backend-ops -o MUL_MAT` (default, threshold off) | 1146/1146 pass |
|
||||
| `test-backend-ops -o MUL_MAT_ID` (default) | 806/806 pass (MoE untouched) |
|
||||
| `test-backend-ops -o MUL_MAT`, path FORCED (`LLAMA_FP4_PREFILL_M=64`) | NVFP4 large-M cases (m=2048/1600/2050, n=128, k=2048) green CUDA-vs-CPU |
|
||||
| greedy md5, short prefill (< threshold), lever vs base | identical: `5951a5b4d624ce891e22ab5fca9bc439` (== documented dense reference; decode byte-untouched) |
|
||||
| greedy md5, long prefill (> threshold, exercises bf16 path), lever vs base | identical: `5f3967df5781445feeb25762abb9eae7` (the new FP path flips no greedy argmax) |
|
||||
|
||||
The new path (NVFP4->bf16 round, bf16 tensor cores, f32 accumulate) is a
|
||||
different FP path from fused FP4xQ8_1 MMQ, but it is precision-neutral-to-better:
|
||||
keeping activations in bf16 instead of Q8_1 is strictly more precise, and the
|
||||
greedy output is byte-identical. This matches the scope's prediction
|
||||
(KLD(dequant-bf16 || f16) <= KLD(FP4-MMQ || f16)).
|
||||
|
||||
## 2. Performance (REGRESSION - the lever loses on GB10)
|
||||
|
||||
S_PP (prefill tokens/s), q36-27b dense, A/B `LLAMA_FP4_PREFILL_M` off vs on:
|
||||
|
||||
| prefill ubatch M | npl | base S_PP (MMQ) | lever S_PP (bf16 cuBLAS) | delta |
|
||||
|---|---|---|---|---|
|
||||
| 512 | 32 | 958.99 | 486.65 | -49% |
|
||||
| 1024 | 8 | 1013.65 | 587.27 | -42% |
|
||||
| 2048 | 8 | 918.46 | 649.42 | -29% |
|
||||
|
||||
Default-off control (no env): S_PP 966.98 == base (within noise) -> the patch is
|
||||
inert by default.
|
||||
|
||||
## 3. Why it loses (the scope premise was wrong for GB10)
|
||||
|
||||
The scope assumed FP4-MMQ is register-bound to ~3% of FP4 peak at large M, so a
|
||||
vendor large-M kernel would win. **Measured, FP4-MMQ at M=512..2048 beats
|
||||
dequant->bf16 cuBLAS by 29-49%.** Two compounding reasons:
|
||||
|
||||
1. **bf16 tensor-core peak is ~half FP4 peak on GB10.** Even a perfect bf16 GEMM
|
||||
caps at ~half the throughput the FP4-MMA path can reach.
|
||||
2. **The dequant tax is an un-amortized memory pass.** Per prefill step the new
|
||||
path reads FP4 weights (~0.5 B/elt), writes bf16 (2 B/elt), then the GEMM
|
||||
reads bf16 (2 B/elt) = ~8x the weight byte traffic of the FP4-MMQ read
|
||||
(~0.5 B/elt). The dequant write is M-independent, so it only amortizes as M
|
||||
grows: the gap shrinks 49% -> 42% -> 29% from M=512 -> 2048 but never crosses
|
||||
even at M=2048 (above the default n_ubatch).
|
||||
|
||||
This is also consistent with the README decode finding that the dense path was
|
||||
already ~96-97% of vLLM - the dense GEMM was never the bottleneck the way the
|
||||
prefill ground-truth (measured on the MoE decision model) implied.
|
||||
|
||||
## 4. Status of the phases
|
||||
|
||||
- **Phase 1 (dense): REJECTED on GB10**, landed default-off as a validated,
|
||||
env-gated scaffold (mechanism + bit-exact gate reusable by option (b) and by
|
||||
non-GB10 hardware where bf16 may fare differently).
|
||||
- **Phase 2 (MoE grouped large-M): NOT implemented.** It inherits the same
|
||||
bf16-peak < FP4-peak ceiling plus a per-expert dequant, so a grouped
|
||||
bf16-cuBLAS would regress for the same reason; the MoE id-path also has the
|
||||
graph-safety catch (a false `should_use_mmq` falls to the host-sync sorted
|
||||
loop, not CUDA-graph-safe). Not worth the multi-day grouped-cuBLAS + graph
|
||||
work on a path the dense A/B already shows loses.
|
||||
- **The only route to a real prefill GEMM win is option (b)** - a native
|
||||
Blackwell FP4-MMA large-M kernel (multi-week), to greenlight only if the
|
||||
prefill regime is funded. The committed scaffold gives option (b) its
|
||||
M-threshold routing and its bit-exact gate for free.
|
||||
@@ -0,0 +1,174 @@
|
||||
From 0033003300330033003300330033003300330033 Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Sun, 28 Jun 2026 19:35:00 +0200
|
||||
Subject: [PATCH] feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold
|
||||
(default-off, rejected on GB10) (patch 0033)
|
||||
|
||||
Option (a) of docs/PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense
|
||||
weight GEMMs OFF the decode-tuned FP4-MMQ kernel and through the dequant->bf16
|
||||
cuBLAS (nvjet) tensor-core path. This lands the validated, bit-exact-gated
|
||||
mechanism and records the honest result: on GB10 (sm_121) the lever is a
|
||||
REGRESSION, so it is kept default-OFF (byte-identical to stock), mirroring the
|
||||
patch-0017 default-off discipline.
|
||||
|
||||
Mechanism (all three edits are the integration scaffold, no new kernel):
|
||||
- ggml/src/ggml-cuda/mmq.cu (ggml_cuda_should_use_mmq): NVFP4 + Blackwell +
|
||||
dense (n_experts==0) + M > LLAMA_FP4_PREFILL_M returns false, so the dense
|
||||
dispatch falls through to ggml_cuda_op_mul_mat_cublas. -D / env
|
||||
LLAMA_FP4_PREFILL_M tunable; default 0 == disabled == stock. Decode and
|
||||
small batches (M <= threshold) stay on FP4-MMQ.
|
||||
- ggml/src/ggml-cuda/ggml-cuda.cu (ggml_cuda_op_mul_mat_cublas): new NVFP4
|
||||
branch dequants the FP4 weights to a TRANSIENT bf16 pool buffer (not cached,
|
||||
so the model stays FP4-resident) and runs cublasGemmEx CUDA_R_16BF /
|
||||
COMPUTE_32F (tensor cores) instead of the f32 cublasSgemm fallback (no
|
||||
tensor cores) that NVFP4 would otherwise hit.
|
||||
- ggml/src/ggml-cuda/convert.cu (ggml_get_to_bf16_cuda): add the NVFP4 case
|
||||
(the dequant kernel is dst-type generic; bf16 preserves the model's native
|
||||
activation range vs f16). nullptr-by-default for other types is unchanged.
|
||||
|
||||
Bit-exact / numeric gate (PASS, divergence benign):
|
||||
- test-backend-ops MUL_MAT 1146/1146, MUL_MAT_ID 806/806 at default; and with
|
||||
the path FORCED (LLAMA_FP4_PREFILL_M=64) the NVFP4 large-M cases are green
|
||||
CUDA-vs-CPU (the bf16 path is numerically within the project tolerance).
|
||||
- greedy md5 (q36-27b dense, "The capital of France is", -n 48, temp 0):
|
||||
lever == base == 5951a5b4d624ce891e22ab5fca9bc439 (the documented dense
|
||||
reference) for short prefill (decode byte-untouched), AND identical for a
|
||||
>threshold prefill that exercises the new bf16 path (5f3967df...): the new
|
||||
FP path does not flip a single greedy argmax. As predicted by the scope,
|
||||
bf16 activations are strictly more precise than the FP4-MMQ Q8_1 path, so
|
||||
this is precision-neutral-to-better, not a regression.
|
||||
|
||||
Honest performance result (S_PP t/s, q36-27b dense, llama-batched-bench
|
||||
-fa on -ngl 99, A/B via env), see docs/PREFILL_GEMM_RESULTS.md:
|
||||
-npp 512 -npl 32 : base(MMQ) 958.99 -> lever 486.65 (-49%)
|
||||
-npp 1024 -npl 8 : base(MMQ)1013.65 -> lever 587.27 (-42%)
|
||||
-npp 2048 -npl 8 : base(MMQ) 918.46 -> lever 649.42 (-29%)
|
||||
The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is FALSE on GB10:
|
||||
FP4-MMQ at M=512..2048 beats dequant->bf16 cuBLAS, because bf16 tensor-core peak
|
||||
is ~half FP4 peak AND the per-step weight dequant + 4x bf16 weight traffic
|
||||
(~8x total vs the FP4 read) dominate, only partially amortizing as M grows
|
||||
(gap shrinks 49%->29%, never crosses). Default-off keeps stock S_PP (966.98,
|
||||
within noise of base).
|
||||
|
||||
Phase 2 (MoE grouped large-M) is NOT implemented: it inherits the same
|
||||
bf16-peak < FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS
|
||||
would regress for the same reason. The only route to a real prefill GEMM win is
|
||||
option (b) - a native FP4-MMA large-M kernel (multi-week). This patch is the
|
||||
validated, env-gated scaffold that option (b) / non-GB10 hardware can reuse for
|
||||
the M-threshold routing + bit-exact gate.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
---
|
||||
diff --git a/ggml/src/ggml-cuda/convert.cu b/ggml/src/ggml-cuda/convert.cu
|
||||
index 61630a3..f0273c1 100644
|
||||
--- a/ggml/src/ggml-cuda/convert.cu
|
||||
+++ b/ggml/src/ggml-cuda/convert.cu
|
||||
@@ -704,6 +704,15 @@ to_bf16_cuda_t ggml_get_to_bf16_cuda(ggml_type type) {
|
||||
return convert_unary_cont_cuda<float>;
|
||||
case GGML_TYPE_F16:
|
||||
return convert_unary_cont_cuda<half>;
|
||||
+ // Paged prefill lever (patch 0033): NVFP4 -> bf16 dequant for the large-M
|
||||
+ // dequant->bf16 cuBLAS (nvjet) prefill GEMM path in
|
||||
+ // ggml_cuda_op_mul_mat_cublas. The dequant kernel is dst-type generic, so
|
||||
+ // this instantiates the bf16 variant; bf16 (not f16) preserves the model's
|
||||
+ // native bf16 activation range and avoids f16 overflow on large prefill
|
||||
+ // activations. Only the new prefill path consumes this; nullptr-by-default
|
||||
+ // for all other types is unchanged.
|
||||
+ case GGML_TYPE_NVFP4:
|
||||
+ return dequantize_row_nvfp4_cuda;
|
||||
default:
|
||||
return nullptr;
|
||||
}
|
||||
diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu
|
||||
index 9933fa6..2dcaaab 100644
|
||||
--- a/ggml/src/ggml-cuda/mmq.cu
|
||||
+++ b/ggml/src/ggml-cuda/mmq.cu
|
||||
@@ -321,6 +321,33 @@ bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11, int64_t
|
||||
return false;
|
||||
}
|
||||
|
||||
+ // Paged prefill lever (patch 0033): OPTION-(a) route large-M NVFP4 dense GEMMs
|
||||
+ // OFF the FP4-MMQ kernel and through the dequant->bf16 cuBLAS (nvjet)
|
||||
+ // tensor-core path (ggml_cuda_op_mul_mat_cublas, NVFP4 bf16 branch). The
|
||||
+ // scope premise was that FP4-MMQ is register-bound to ~3% of FP4 peak at
|
||||
+ // large M. MEASURED ON GB10 THIS IS FALSE: FP4-MMQ at M=512..2048 beats
|
||||
+ // dequant->bf16 cuBLAS by 29-49% (S_PP A/B in docs/PREFILL_GEMM_RESULTS.md),
|
||||
+ // because bf16 tensor-core peak is ~half FP4 peak AND the per-step weight
|
||||
+ // dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate and
|
||||
+ // only partially amortize as M grows. The path is NUMERICALLY VALID and
|
||||
+ // benign (greedy md5 byte-identical to FP4-MMQ; test-backend-ops passes), so
|
||||
+ // it is kept as a validated, env-gated scaffold (for option-(b) native FP4
|
||||
+ // large-M kernels and non-GB10 hardware), but DEFAULT-DISABLED (== stock).
|
||||
+ // Set -D LLAMA_FP4_PREFILL_M=<M> or env LLAMA_FP4_PREFILL_M=<M> to A/B it;
|
||||
+ // 0 (default) disables. Dense only (n_experts == 0).
|
||||
+#ifndef LLAMA_FP4_PREFILL_M
|
||||
+#define LLAMA_FP4_PREFILL_M 0
|
||||
+#endif // LLAMA_FP4_PREFILL_M
|
||||
+ if (type == GGML_TYPE_NVFP4 && n_experts == 0 && blackwell_mma_available(cc)) {
|
||||
+ static const int64_t fp4_prefill_m = [] {
|
||||
+ const char * e = getenv("LLAMA_FP4_PREFILL_M");
|
||||
+ return e != nullptr ? (int64_t) atoll(e) : (int64_t) LLAMA_FP4_PREFILL_M;
|
||||
+ }();
|
||||
+ if (fp4_prefill_m > 0 && ne11 > fp4_prefill_m) {
|
||||
+ return false;
|
||||
+ }
|
||||
+ }
|
||||
+
|
||||
if (turing_mma_available(cc)) {
|
||||
return true;
|
||||
}
|
||||
diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
|
||||
index 0dad6e1..6476d46 100644
|
||||
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
|
||||
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
|
||||
@@ -1660,7 +1660,47 @@ static void ggml_cuda_op_mul_mat_cublas(
|
||||
row_diff == src0->ne[1] &&
|
||||
dst->op_params[0] == GGML_PREC_DEFAULT;
|
||||
|
||||
- if (supports_bf16 && src0->type == GGML_TYPE_BF16 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) {
|
||||
+ if (supports_bf16 && src0->type == GGML_TYPE_NVFP4 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) {
|
||||
+ // Paged prefill lever (patch 0033): NVFP4 only reaches cuBLAS when
|
||||
+ // ggml_cuda_should_use_mmq() returned false (large-M dense prefill).
|
||||
+ // Dequant the FP4 weights to a TRANSIENT bf16 pool buffer and run a
|
||||
+ // tensor-core bf16 GEMM (nvjet) instead of the f32 cublasSgemm fallback
|
||||
+ // (no tensor cores) that the final else-branch would otherwise use. The
|
||||
+ // weights are NOT cached as bf16 (pool scratch, freed at step end) so the
|
||||
+ // model stays FP4-resident and the backend keeps its memory advantage.
|
||||
+ ggml_cuda_pool_alloc<nv_bfloat16> src0_as_bf16(ctx.pool(id), row_diff*ne00);
|
||||
+ const to_bf16_cuda_t to_bf16_cuda_src0 = ggml_get_to_bf16_cuda(GGML_TYPE_NVFP4);
|
||||
+ GGML_ASSERT(to_bf16_cuda_src0 != nullptr);
|
||||
+ to_bf16_cuda_src0(src0_dd_i, src0_as_bf16.get(), row_diff*ne00, stream);
|
||||
+
|
||||
+ ggml_cuda_pool_alloc<nv_bfloat16> src1_as_bf16(ctx.pool(id));
|
||||
+ if (src1->type != GGML_TYPE_BF16) {
|
||||
+ const to_bf16_cuda_t to_bf16_cuda = ggml_get_to_bf16_cuda(src1->type);
|
||||
+ GGML_ASSERT(to_bf16_cuda != nullptr);
|
||||
+ size_t ne = src1_ncols*ne10;
|
||||
+ src1_as_bf16.alloc(ne);
|
||||
+ to_bf16_cuda(src1_ddf_i, src1_as_bf16.get(), ne, stream);
|
||||
+ }
|
||||
+ const nv_bfloat16 * src1_ptr = src1->type == GGML_TYPE_BF16 ? (const nv_bfloat16 *) src1_ddf_i : src1_as_bf16.get();
|
||||
+ const nv_bfloat16 * src0_ptr = src0_as_bf16.get();
|
||||
+ ggml_cuda_pool_alloc<nv_bfloat16> dst_bf16(ctx.pool(id), row_diff*src1_ncols);
|
||||
+
|
||||
+ const float alpha_f32 = 1.0f;
|
||||
+ const float beta_f32 = 0.0f;
|
||||
+
|
||||
+ CUBLAS_CHECK(cublasSetStream(ctx.cublas_handle(id), stream));
|
||||
+ CUBLAS_CHECK(
|
||||
+ cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N,
|
||||
+ row_diff, src1_ncols, ne10,
|
||||
+ &alpha_f32, src0_ptr, CUDA_R_16BF, ne00,
|
||||
+ src1_ptr, CUDA_R_16BF, ne10,
|
||||
+ &beta_f32, dst_bf16.get(), CUDA_R_16BF, ldc,
|
||||
+ CUBLAS_COMPUTE_32F,
|
||||
+ CUBLAS_GEMM_DEFAULT_TENSOR_OP));
|
||||
+
|
||||
+ const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(GGML_TYPE_BF16);
|
||||
+ to_fp32_cuda(dst_bf16.get(), dst_dd_i, row_diff*src1_ncols, stream);
|
||||
+ } else if (supports_bf16 && src0->type == GGML_TYPE_BF16 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) {
|
||||
ggml_cuda_pool_alloc<nv_bfloat16> src1_as_bf16(ctx.pool(id));
|
||||
if (src1->type != GGML_TYPE_BF16) {
|
||||
const to_bf16_cuda_t to_bf16_cuda = ggml_get_to_bf16_cuda(src1->type);
|
||||
--
|
||||
2.43.0
|
||||
Reference in New Issue
Block a user