mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-24 16:49:06 -04:00
docs(paged): mirror FP4 decode-GEMM track-B P0 gate + P1 kill-gate results (patch 0017)
Mirror of llama.cpp dev-tree commit 089f78d. Track B P0 (bit-exact NVFP4 dense decode-shape MUL_MAT parity gate) + P1 (default-off occupancy levers) for the GB10 dense FP4 weight GEMM. P1 kill-gate TRIPPED: the cheap host/occupancy levers do not lift decode_agg on GB10 (sm_121). DENSE q36-27b-nvfp4 @npl128 149.5 -> minblocks2 147.9 (-1.1%) -> dense mmq_x=64 144.3 (-3.5%); MoE q36-35b-a3b mmq_x-down regresses (TILE16 -3.7%, TILE8 -5.9%, reproduces patch 0015). nsys: the FP4 GEMM mul_mat_q<NVFP4,128,0> went 2.782s->3.025s (+8.7% slower) under register-capping (spilling). The dense M=128 tile is already weight-read/one-read-optimal; the only untested lever is the structural mmq_y-down (nwarps=4 warp-remap, blocked by nwarps*tile_C::I==mmq_y), deferred to P2. All levers default-off => default build byte-identical to stock. See THROUGHPUT_B_P1_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -0,0 +1,245 @@
|
||||
From 089f78d2a2c04465a566d499dbe0a67c008435a8 Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Wed, 24 Jun 2026 19:56:05 +0200
|
||||
Subject: [PATCH] feat(paged): FP4 decode GEMM track-B P0 gate + default-off
|
||||
occupancy instrumentation (patch 0017)
|
||||
|
||||
Track B targets the dense NVFP4 weight GEMM (~59% of the GB10 decode step). This lands the P0
|
||||
bit-exact parity gate and the P1 occupancy levers (default-off / byte-identical) and records the
|
||||
honest P1 result: the cheap host/occupancy tuning does NOT lift decode_agg on GB10 (sm_121) - the
|
||||
kill-gate tripped - so nothing is enabled by default.
|
||||
|
||||
P0 gate (tests/test-backend-ops.cpp): NVFP4/MXFP4 dense decode-shape MUL_MAT cases at the weight-
|
||||
row tiling boundary (m in {2048,1600,2050} = exact + ragged vs mmq_y 64/128, n in {32,128} = decode
|
||||
M, k=2048), so the bit-exact CPU-vs-CUDA oracle covers the mmq_y / min-blocks paths. Green at
|
||||
default and with every lever on: MUL_MAT 1115/1115, MUL_MAT_ID 805/805, NVFP4 0 fail.
|
||||
|
||||
P1 levers (ggml/src/ggml-cuda/mmq.cuh), all default-off => default build byte-identical to stock:
|
||||
- GGML_CUDA_FP4_MMQ_Y (default 128): type-aware get_mmq_y_host/device plumbing for an NVFP4
|
||||
weight-row tile override. mmq_y is rigidly nwarps*tile_C::I (=8*16=128, the mmq.cuh static_
|
||||
assert), so mmq_y<128 also needs nwarps-down (a warp-remap through the shared vec_dot/loader),
|
||||
left as the P2 kernel change; the host/device plumbing is in place and inert.
|
||||
- GGML_CUDA_FP4_MINBLOCKS (default 1): NVFP4-only __launch_bounds__ min-resident-CTAs lever
|
||||
(register-cap the FP4-MMA kernel so >1 CTA co-resides) - the bounded occupancy probe.
|
||||
- GGML_CUDA_FP4_DENSE_MMQ_X (env, default off): dense col-tile re-read occupancy diagnostic.
|
||||
|
||||
Measured GB10 (llama-batched-bench -fa on -npp 128 -ntg 128 -npl 32,128), decode_agg (S_TG):
|
||||
DENSE q36-27b-nvfp4 @npl128: P0 149.5 -> MINBLOCKS=2 147.9 (-1.1%) -> DENSE_MMQ_X=64 144.3
|
||||
(-3.5%) -> =32 141.7 (-5.2%). Every occupancy probe regresses.
|
||||
MoE q36-35b-a3b-nvfp4 @npl128: stock 336.3, MINBLOCKS=2 337.7 (+0.4%, noise), TILE16 324.0
|
||||
(-3.7%), TILE8 316.6 (-5.9%). mmq_x-down regresses (reproduces patch 0015; GDN/BW-bound).
|
||||
|
||||
nsys (kill-gate evidence): the decode FP4 GEMM mul_mat_q<NVFP4,128,0> went 2.782s -> 3.025s
|
||||
(avg 608us -> 661us, +8.7% slower) under MINBLOCKS=2 - register-capping spills, so occupancy did
|
||||
not usefully rise. Verdict: the dense M=128 tile is already weight-read/one-read-optimal at
|
||||
mmq_x=128, NOT occupancy-starved via the cheap levers; the only untested lever is the structural
|
||||
mmq_y-down (nwarps=4 warp-remap), deferred to P2. Bit-exact gate holds throughout.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
---
|
||||
ggml/src/ggml-cuda/mmq.cuh | 85 ++++++++++++++++++++++++++++++++++----
|
||||
tests/test-backend-ops.cpp | 16 +++++++
|
||||
2 files changed, 92 insertions(+), 9 deletions(-)
|
||||
|
||||
diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
|
||||
index 9718b12..b53e38a 100644
|
||||
--- a/ggml/src/ggml-cuda/mmq.cuh
|
||||
+++ b/ggml/src/ggml-cuda/mmq.cuh
|
||||
@@ -140,7 +140,24 @@ static constexpr __device__ int get_mmq_x_max_device() {
|
||||
#endif // defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
|
||||
}
|
||||
|
||||
-static int get_mmq_y_host(const int cc) {
|
||||
+// [paged patch 0017 / track B] Dense NVFP4 decode mmq_y (weight-row tile) override.
|
||||
+// mmq_y tiles the N (weight-row) dimension of the FP4-MMA weight GEMM. Lowering it raises the
|
||||
+// number of resident CTAs (smaller per-CTA shared footprint + smaller per-thread accumulator) to
|
||||
+// hide LPDDR5x weight-load latency at the M=128 decode tile, WITHOUT re-reading weights: every
|
||||
+// weight row lives in exactly one row-tile, so total weight traffic is unchanged (bandwidth-
|
||||
+// neutral) - the dense-decode occupancy lever from FP4_GEMM_SCOPE_B.md s3/s4.1. mmq_y is a PURE
|
||||
+// N-row tiling knob: the per-output reduction over K is identical for any mmq_y, so the result
|
||||
+// stays BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4 decode shapes). Default 128 == exact
|
||||
+// stock behaviour (a default build is byte-identical to stock); build -DGGML_CUDA_FP4_MMQ_Y=64
|
||||
+// (or 96) to enable the tune. Applies ONLY to NVFP4 on Blackwell; every other type/arch untouched.
|
||||
+#ifndef GGML_CUDA_FP4_MMQ_Y
|
||||
+#define GGML_CUDA_FP4_MMQ_Y 128
|
||||
+#endif
|
||||
+
|
||||
+static int get_mmq_y_host(const int cc, const ggml_type type = GGML_TYPE_COUNT) {
|
||||
+ if (GGML_CUDA_FP4_MMQ_Y != 128 && type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc)) {
|
||||
+ return GGML_CUDA_FP4_MMQ_Y;
|
||||
+ }
|
||||
return GGML_CUDA_CC_IS_AMD(cc) ? (GGML_CUDA_CC_IS_RDNA1(cc) ? 64 : 128) :
|
||||
((GGML_CUDA_CC_IS_NVIDIA(cc) && ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_VOLTA) ? 128 : 64);
|
||||
}
|
||||
@@ -154,7 +171,13 @@ if (type == GGML_TYPE_NVFP4 || type == GGML_TYPE_MXFP4) {
|
||||
return MMQ_ITER_K;
|
||||
}
|
||||
|
||||
+template <ggml_type type = GGML_TYPE_COUNT>
|
||||
static constexpr __device__ int get_mmq_y_device() {
|
||||
+#if defined(BLACKWELL_MMA_AVAILABLE)
|
||||
+ if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MMQ_Y != 128) {
|
||||
+ return GGML_CUDA_FP4_MMQ_Y;
|
||||
+ }
|
||||
+#endif // defined(BLACKWELL_MMA_AVAILABLE)
|
||||
#if defined(GGML_USE_HIP)
|
||||
#if defined(RDNA1)
|
||||
return 64;
|
||||
@@ -170,6 +193,28 @@ static constexpr __device__ int get_mmq_y_device() {
|
||||
#endif // defined(GGML_USE_HIP)
|
||||
}
|
||||
|
||||
+// [paged patch 0017 / track B] Dense NVFP4 decode occupancy lever: min resident CTAs per SM.
|
||||
+// The FP4-MMA mul_mat_q is REGISTER-bound to 1 CTA/SM (__launch_bounds__(256,1) => ~255 regs/thread
|
||||
+// => one resident block, the under-occupancy that strands the kernel at ~3% of FP4 peak at M=128).
|
||||
+// Raising the __launch_bounds__ min-blocks operand register-caps the compiler so N CTAs co-reside,
|
||||
+// hiding LPDDR5x weight-load latency by CTA-parallelism (the scope s4.1 occupancy goal) WITHOUT a
|
||||
+// structural mmq_y/nwarps change and WITHOUT extra weight reads (each weight tile still read once).
|
||||
+// Register allocation cannot change results => BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4).
|
||||
+// Default 1 == exact stock behaviour (byte-identical); build -DGGML_CUDA_FP4_MINBLOCKS=2 to enable.
|
||||
+// Applies ONLY to NVFP4 on Blackwell; every other type/arch keeps the stock min-blocks.
|
||||
+#ifndef GGML_CUDA_FP4_MINBLOCKS
|
||||
+#define GGML_CUDA_FP4_MINBLOCKS 1
|
||||
+#endif
|
||||
+template <ggml_type type = GGML_TYPE_COUNT>
|
||||
+static constexpr __device__ int mmq_get_min_blocks_device(const int stock) {
|
||||
+#if defined(BLACKWELL_MMA_AVAILABLE)
|
||||
+ if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MINBLOCKS != 1) {
|
||||
+ return GGML_CUDA_FP4_MINBLOCKS;
|
||||
+ }
|
||||
+#endif // defined(BLACKWELL_MMA_AVAILABLE)
|
||||
+ return stock;
|
||||
+}
|
||||
+
|
||||
// Decouple shared memory tile sizes from WARP_SIZE to allow for different warp sizes.
|
||||
// The K dimension of the tiles has either,
|
||||
// 1*MMQ_TILE_NE_K==32 (always for TILE_Y_K) or 2*MMQ_TILE_NE_K==64 (typically for TILE_X_K),
|
||||
@@ -3454,7 +3499,7 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
|
||||
constexpr int warp_size = ggml_cuda_get_physical_warp_size();
|
||||
constexpr int nwarps = mmq_get_nwarps_device();
|
||||
constexpr int qk = ggml_cuda_type_traits<type>::qk;
|
||||
- constexpr int mmq_y = get_mmq_y_device();
|
||||
+ constexpr int mmq_y = get_mmq_y_device<type>();
|
||||
constexpr load_tiles_mmq_t load_tiles = mmq_type_traits<mmq_x, mmq_y, need_check, type>::load_tiles;
|
||||
|
||||
extern __shared__ int data_mul_mat_q[];
|
||||
@@ -3531,13 +3576,13 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
|
||||
template <ggml_type type, int mmq_x, bool need_check>
|
||||
#if defined(GGML_USE_HIP)
|
||||
#if defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN)
|
||||
- __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2)
|
||||
+ __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(2))
|
||||
#endif // defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN)
|
||||
#else
|
||||
#if __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
|
||||
- __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 1)
|
||||
+ __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(1))
|
||||
#else
|
||||
- __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2)
|
||||
+ __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(2))
|
||||
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
|
||||
#endif // defined(GGML_USE_HIP)
|
||||
static __global__ void mul_mat_q(
|
||||
@@ -3558,7 +3603,7 @@ static __global__ void mul_mat_q(
|
||||
constexpr int warp_size = ggml_cuda_get_physical_warp_size();
|
||||
|
||||
constexpr int qk = ggml_cuda_type_traits<type>::qk;
|
||||
- constexpr int mmq_y = get_mmq_y_device();
|
||||
+ constexpr int mmq_y = get_mmq_y_device<type>();
|
||||
|
||||
const uint32_t nty = (nrows_x + mmq_y - 1) / mmq_y; // Number of tiles y
|
||||
|
||||
@@ -3790,7 +3835,7 @@ static __global__ void mul_mat_q_stream_k_fixup(
|
||||
float * __restrict__ tmp_last_tile, const uint3 blocks_per_ne00, const int nrows_x, const int ncols_dst,
|
||||
const int stride_col_dst, const uint3 nchannels_y, const int stride_channel_dst, const uint3 nsamples_y,
|
||||
const int stride_sample_dst, const uint3 ntx) {
|
||||
- constexpr int mmq_y = get_mmq_y_device();
|
||||
+ constexpr int mmq_y = get_mmq_y_device<type>();
|
||||
constexpr int qk = ggml_cuda_type_traits<type>::qk;
|
||||
constexpr int ITER_K = get_iter_k(type);
|
||||
constexpr int blocks_per_iter = ITER_K / qk;
|
||||
@@ -3947,7 +3992,7 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
|
||||
const int nsm = ggml_cuda_info().devices[id].nsm;
|
||||
const int warp_size = ggml_cuda_info().devices[id].warp_size;
|
||||
const int nwarps = mmq_get_nwarps_host(cc, warp_size);
|
||||
- const int mmq_y = get_mmq_y_host(cc);
|
||||
+ const int mmq_y = get_mmq_y_host(cc, type);
|
||||
|
||||
const dim3 block_dims(warp_size, nwarps, 1);
|
||||
|
||||
@@ -4103,6 +4148,21 @@ static inline int ggml_cuda_moe_density_max() {
|
||||
return d;
|
||||
}
|
||||
|
||||
+// [paged patch 0017 / track B] DENSE NVFP4 decode mmq_x re-read occupancy DIAGNOSTIC (env, default off).
|
||||
+// GGML_CUDA_FP4_DENSE_MMQ_X=<n> caps the dense (non-MoE) NVFP4 col-tile to <n>, splitting the M=128
|
||||
+// decode ubatch into ceil(128/n) col-tiles. Each col-tile re-reads the full weight set (fatal cost
|
||||
+// in the BW-bound regime) but multiplies resident CTAs. This is the scope s4.1 A/B probe: if
|
||||
+// decode_agg RISES with cap=64 despite the 2x weight read, occupancy is badly broken (the kernel is
|
||||
+// compute/occupancy-bound, so mmq_y-down / min-blocks has large upside); if it FALLS, the tile is
|
||||
+// already bandwidth-saturated and the occupancy ceiling is lower. Unset/<=0 => stock selection.
|
||||
+static inline int ggml_cuda_fp4_dense_mmq_x_cap() {
|
||||
+ static const int c = []() -> int {
|
||||
+ const char * s = getenv("GGML_CUDA_FP4_DENSE_MMQ_X");
|
||||
+ return s ? atoi(s) : 0;
|
||||
+ }();
|
||||
+ return c;
|
||||
+}
|
||||
+
|
||||
template <ggml_type type>
|
||||
void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
|
||||
const int id = ggml_cuda_get_device();
|
||||
@@ -4112,7 +4172,7 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
|
||||
const int nwarps = mmq_get_nwarps_host(cc, warp_size);
|
||||
|
||||
const int mmq_x_max = get_mmq_x_max_host(cc);
|
||||
- const int mmq_y = get_mmq_y_host(cc);
|
||||
+ const int mmq_y = get_mmq_y_host(cc, type);
|
||||
|
||||
// [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON).
|
||||
// On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens
|
||||
@@ -4145,6 +4205,13 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
|
||||
// - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection).
|
||||
// - LLAMA_MOE_DECODE_TILE=<n>, LLAMA_MOE_DENSITY_MAX=<n> : tune the tile / threshold.
|
||||
int mmq_x_lim = mmq_x_max;
|
||||
+ if (args.expert_bounds == nullptr && type == GGML_TYPE_NVFP4) {
|
||||
+ // dense NVFP4 decode mmq_x re-read occupancy diagnostic (see ggml_cuda_fp4_dense_mmq_x_cap).
|
||||
+ const int cap = ggml_cuda_fp4_dense_mmq_x_cap();
|
||||
+ if (cap > 0 && cap < mmq_x_max) {
|
||||
+ mmq_x_lim = cap < 8 ? 8 : cap;
|
||||
+ }
|
||||
+ }
|
||||
if (args.expert_bounds != nullptr) {
|
||||
const int moe_cap = ggml_cuda_moe_mmq_x_cap();
|
||||
if (moe_cap > 0) {
|
||||
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
|
||||
index f219309..291c275 100644
|
||||
--- a/tests/test-backend-ops.cpp
|
||||
+++ b/tests/test-backend-ops.cpp
|
||||
@@ -8591,6 +8591,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
|
||||
}
|
||||
}
|
||||
|
||||
+ // [paged P0 / track B] NVFP4/MXFP4 dense decode-shape mmq_y-down bit-exact gate.
|
||||
+ // The dense FP4 weight GEMM is the track-B target; P1 lowers mmq_y (the weight-row tile) on the
|
||||
+ // NVFP4 decode path to raise resident-CTA occupancy. mmq_y is a pure N-row tiling knob, so a
|
||||
+ // smaller mmq_y must stay BIT-EXACT (identical per-output reduction over K) - this gate proves
|
||||
+ // it. m = weight rows (N, tiled by mmq_y): 2048 (exact at mmq_y 64 & 128), 1600 (ragged vs 128),
|
||||
+ // 2050 (ragged vs both 64 & 128 -> exercises the need_check last-row-tile at both). n = decode
|
||||
+ // token count M = 32 and 128 (the scope decode shapes, tiled by mmq_x). k = 2048 hidden. Must
|
||||
+ // pass with the default build (mmq_y=128) AND a mmq_y=64 build, CUDA-vs-CPU oracle, bit-exact.
|
||||
+ for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) {
|
||||
+ for (int64_t m : {2048, 1600, 2050}) {
|
||||
+ for (int64_t n : {32, 128}) {
|
||||
+ test_cases.emplace_back(new test_mul_mat(type_a, GGML_TYPE_F32, m, n, 2048, {1, 1}, {1, 1}));
|
||||
+ }
|
||||
+ }
|
||||
+ }
|
||||
+
|
||||
for (ggml_type type_a : all_types) {
|
||||
test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a)));
|
||||
}
|
||||
--
|
||||
2.43.0
|
||||
|
||||
126
backend/cpp/llama-cpp/patches/paged/THROUGHPUT_B_P1_RESULTS.md
Normal file
126
backend/cpp/llama-cpp/patches/paged/THROUGHPUT_B_P1_RESULTS.md
Normal file
@@ -0,0 +1,126 @@
|
||||
# Track B P0 + P1 results: the FP4-MMA decode-GEMM occupancy tune (GB10, sm_121)
|
||||
|
||||
Measured on the DGX (GB10 / DGX Spark, sm_121, `~/llama-paged-dev`, branch `paged`). Implements
|
||||
`FP4_GEMM_SCOPE_B.md` P0 (baseline + bit-exact gate) and P1 (the cheap host/occupancy tile tune).
|
||||
Dev-tree commit: **089f78d** (`feat(paged): FP4 decode GEMM track-B P0 gate + default-off occupancy
|
||||
instrumentation`). Patch artifact: `0017-fp4-gemm-decode-tile-tune.patch`.
|
||||
|
||||
**Headline verdict: the P1 occupancy kill-gate TRIPPED.** None of the cheap host/occupancy levers
|
||||
lift dense or MoE decode_agg on GB10; every dense probe regresses and the nsys evidence shows the
|
||||
FP4 GEMM kernel gets *slower* under register-capping. Nothing is enabled by default (the levers are
|
||||
compile-time/env gated and the default build is byte-identical to stock). The one untested lever is
|
||||
the structural `mmq_y`-down, which is **not** a host switch: it is coupled to `nwarps` by the
|
||||
`nwarps*tile_C::I == mmq_y` static_assert, so it requires an `nwarps=4` warp-remap (P2 kernel work).
|
||||
|
||||
All benches: `llama-batched-bench -fa on -c 32768 -ngl 99 -npp 128 -ntg 128 -npl 32,128`.
|
||||
`decode_agg = S_TG` (aggregate decode tok/s). 3 reps dense, 2 reps MoE; medians below.
|
||||
|
||||
## P0 baseline (mmq_y=128, minblocks=1 — stock)
|
||||
|
||||
### Bit-exact parity gate (CPU oracle vs CUDA, deterministic)
|
||||
- `test-backend-ops -o MUL_MAT -b CUDA0`: **1115/1115** (1103 stock + 12 new NVFP4/MXFP4 dense
|
||||
decode-shape cases), NVFP4 0 fail.
|
||||
- `test-backend-ops -o MUL_MAT_ID -b CUDA0`: **805/805**, NVFP4 0 fail.
|
||||
- New P0 cases exercise the weight-row (`mmq_y`) tiling boundary: `type_a ∈ {NVFP4, MXFP4}`,
|
||||
`m ∈ {2048 (exact at mmq_y 64/128), 1600 (ragged vs 128), 2050 (ragged vs both 64 & 128 →
|
||||
need_check last row-tile)}`, `n ∈ {32, 128}` (decode M), `k = 2048`. They make the oracle cover
|
||||
the `mmq_y`/min-blocks changes and stay bit-exact with every lever on.
|
||||
|
||||
### Decode throughput (decode_agg = S_TG)
|
||||
| model | npl32 | npl128 |
|
||||
|---|---:|---:|
|
||||
| DENSE q36-27b-nvfp4 | 117.3 | **149.5** |
|
||||
| MoE q36-35b-a3b-nvfp4 (stock mmq_x=128/expert) | 262.6 | **336.3** |
|
||||
|
||||
(For reference the scope §6 cites dense 161 / MoE 333 from a server harness; this is the cleaner
|
||||
batched-bench A/B baseline. The relative P0→P1 deltas below are what the kill-gate turns on.)
|
||||
|
||||
### nsys FP4 GEMM efficiency (dense, `-npp 64 -ntg 48 -npl 128`)
|
||||
The decode FP4 weight GEMM kernel = `mul_mat_q<NVFP4(40), mmq_x=128, need_check=0>`:
|
||||
- **33.2 %** of GPU kernel time, total **2.782 s** / 4576 inst, **avg 608 µs/launch**.
|
||||
- Plus `quantize_mmq_nvfp4` 9.1 % (the act-quant bucket — track A's target), `mul_mat_q<…,16,…>`
|
||||
5.8 % (prefill ubatch tiling), stream-k fixups ~0.5 %.
|
||||
|
||||
This is the locked baseline; P1 must lower the GEMM kernel time (raise FP4-eff) to pass.
|
||||
|
||||
## P1 — the cheap occupancy levers (all default-off, byte-identical when off)
|
||||
|
||||
Three bit-exact, gated levers were added (`mmq.cuh`):
|
||||
- `GGML_CUDA_FP4_MMQ_Y` (default 128): type-aware `get_mmq_y_host/device` plumbing for an NVFP4
|
||||
weight-row tile override. **Inert** — see "the mmq_y wall" below.
|
||||
- `GGML_CUDA_FP4_MINBLOCKS` (default 1): NVFP4-only `__launch_bounds__` min-resident-CTAs lever
|
||||
(register-caps the FP4-MMA kernel so >1 CTA co-resides). The bounded occupancy probe.
|
||||
- `GGML_CUDA_FP4_DENSE_MMQ_X` (env, default off): dense col-tile re-read occupancy diagnostic
|
||||
(the §4.1 A/B: does eating a 2× weight re-read at a smaller `mmq_x` buy net occupancy?).
|
||||
|
||||
P1 parity: with `MINBLOCKS=2` the gate stays **MUL_MAT 1115/1115, MUL_MAT_ID 805/805, NVFP4 0
|
||||
fail** — register allocation is result-neutral, so bit-exactness holds.
|
||||
|
||||
### DENSE decode_agg @ npl128 — every occupancy probe REGRESSES
|
||||
| config | npl32 | npl128 | Δ vs P0 @npl128 |
|
||||
|---|---:|---:|---:|
|
||||
| P0 stock (mmq_y=128, minblocks=1) | 117.3 | **149.5** | — |
|
||||
| MINBLOCKS=2 (2 resident CTAs via reg-cap) | 115.7 | 147.9 | **−1.1 %** |
|
||||
| DENSE_MMQ_X=64 (2 col-tiles, 2× weight re-read) | 115.3 | 144.3 | **−3.5 %** |
|
||||
| DENSE_MMQ_X=32 (4 col-tiles, 4× weight re-read) | 115.4 | 141.7 | **−5.2 %** |
|
||||
|
||||
### MoE decode_agg @ npl128 — mmq_x-down regresses; min-blocks neutral
|
||||
| config | npl32 | npl128 | Δ vs stock @npl128 |
|
||||
|---|---:|---:|---:|
|
||||
| stock (mmq_x=128/expert) | 262.6 | **336.3** | — |
|
||||
| TILE32 | 262.1 | 336.0 | −0.1 % |
|
||||
| TILE16 | 261.1 | 324.0 | **−3.7 %** |
|
||||
| TILE8 | 260.8 | 316.6 | **−5.9 %** |
|
||||
| MINBLOCKS=2 | 260.0 | 337.7 | +0.4 % (noise) |
|
||||
|
||||
The MoE result reproduces patch 0015 exactly: q36-35b-a3b (256 tiny experts, GDN linear attention)
|
||||
decode is GDN/bandwidth-bound, **not** col-tile-occupancy-bound, so tightening `mmq_x` below 64
|
||||
(the brief's "8–16 ideal") monotonically *loses*. 64 ≈ 32 ≈ stock is the floor.
|
||||
|
||||
### nsys kill-gate evidence (the decisive datum)
|
||||
`mul_mat_q<NVFP4,128,0>` under MINBLOCKS=2: **2.782 s → 3.025 s**, avg **608 µs → 661 µs
|
||||
(+8.7 % SLOWER)**. The FP4-MMA kernel needs >128 regs/thread; forcing 2 CTAs/SM register-caps it,
|
||||
which **spills to local memory**, so the GEMM does *more* work per launch — occupancy did not
|
||||
usefully rise, it inverted. FP4-eff went **down**, not up. Kill-gate tripped, with hard evidence.
|
||||
|
||||
## Why P1 can't lift it (and why mmq_y-down is P2, not P1)
|
||||
|
||||
The two orthogonal occupancy probes both regress: register-capping (minblocks↑) spills, and
|
||||
col-tile-shrinking (mmq_x↓) re-reads the 18 GB weight set. This says the **dense M=128 tile is
|
||||
already weight-read / one-read-optimal at mmq_x=128** — it is not occupancy-starved in a way the
|
||||
cheap levers can fix. This contradicts the scope's central "self-inflicted occupancy, recover it by
|
||||
raising resident CTAs" hypothesis *for the cheap levers*.
|
||||
|
||||
The only lever that raises resident CTAs **without** spilling and **without** extra weight reads is
|
||||
the structural `mmq_y`-down (smaller weight-row tile → smaller shared + smaller accumulator → more
|
||||
CTAs, weights still read once). But `mmq_y` is **rigidly** `nwarps * tile_C::I = 8 * 16 = 128`
|
||||
(the `mmq.cuh:3258` static_assert; `tile_C::I=16` is the fixed `m16n8k64` MMA shape). So
|
||||
`mmq_y=64` requires **`nwarps=4`** — a warp-remap, not a host switch. That remap threads `nwarps`
|
||||
through ~13 NVFP4-reachable sites including the **shared** `vec_dot_fp4_fp4_mma` (used by both NVFP4
|
||||
and MXFP4) and the loader/kernel nwarps lockstep, with real risk of a silent shared-mem/thread-block
|
||||
mismatch. It was scoped but **deferred to P2** (the scope's own phase table also places `mmq_y`-down
|
||||
at P2, after the P1 host-only knobs). The `get_mmq_y` host/device plumbing is committed and inert so
|
||||
P2 only has to add the `nwarps` half.
|
||||
|
||||
## Honest verdict vs the scope targets
|
||||
|
||||
- **DENSE:** P1 (host knobs + min-blocks + re-read diagnostic) does **not** move decode_agg toward
|
||||
the 391 target — it slightly *regresses* (149.5 → 147.9, 38 % of vLLM). The scope's P1 row
|
||||
(~177, "honest: small") was optimistic; on GB10 the cheap levers are net-negative. The remaining
|
||||
upside lives entirely in the P2 `mmq_y`-down (nwarps=4) kernel remap **plus** track A. Whether
|
||||
that clears the floor is now an *open, unproven* question — the cheap-lever evidence here leans
|
||||
*against* large occupancy upside (the tile already looks one-read-optimal), so the P2 ceiling is
|
||||
plausibly lower than the scope's 316–328.
|
||||
- **MoE:** the mmq_x-down lever (the brief's MoE P1) is a **confirmed dead-end on this model**
|
||||
(regresses; GDN/BW-bound, reproduces patch 0015). min-blocks is neutral. No host-level MoE win.
|
||||
|
||||
**Kill-gate: TRIPPED on both arms.** Per the brief this is *not* forced into a default-on change.
|
||||
Committed: the P0 bit-exact gate + the default-off instrumentation + this honest record. Not pushed.
|
||||
|
||||
## Reproduce
|
||||
```
|
||||
# default (byte-identical stock): build-cuda as-is -> MUL_MAT 1115/1115, MUL_MAT_ID 805/805
|
||||
# occupancy probe: cmake build with -DGGML_CUDA_FP4_MINBLOCKS=2 (or flip the macro default)
|
||||
# dense re-read A/B: GGML_CUDA_FP4_DENSE_MMQ_X=64 ./llama-batched-bench -m q36-27b-nvfp4.gguf ...
|
||||
# nsys: nsys profile --trace cuda ... ; nsys stats --report cuda_gpu_kern_sum (watch mul_mat_q<40,128,0>)
|
||||
```
|
||||
Reference in New Issue
Block a user