docs(paged): mirror FP4 decode-GEMM track-B P0 gate + P1 kill-gate results (patch 0017)

Mirror of llama.cpp dev-tree commit 089f78d. Track B P0 (bit-exact NVFP4 dense decode-shape MUL_MAT parity gate) + P1 (default-off occupancy levers) for the GB10 dense FP4 weight GEMM. P1 kill-gate TRIPPED: the cheap host/occupancy levers do not lift decode_agg on GB10 (sm_121). DENSE q36-27b-nvfp4 @npl128 149.5 -> minblocks2 147.9 (-1.1%) -> dense mmq_x=64 144.3 (-3.5%); MoE q36-35b-a3b mmq_x-down regresses (TILE16 -3.7%, TILE8 -5.9%, reproduces patch 0015). nsys: the FP4 GEMM mul_mat_q<NVFP4,128,0> went 2.782s->3.025s (+8.7% slower) under register-capping (spilling). The dense M=128 tile is already weight-read/one-read-optimal; the only untested lever is the structural mmq_y-down (nwarps=4 warp-remap, blocked by nwarps*tile_C::I==mmq_y), deferred to P2. All levers default-off => default build byte-identical to stock. See THROUGHPUT_B_P1_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-24 16:49:06 -04:00 · 2026-06-24 17:58:00 +00:00
parent 39e16cc2c4
commit 40f019e761
2 changed files with 371 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/0017-fp4-gemm-decode-tile-tune.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0017-fp4-gemm-decode-tile-tune.patch
@@ -0,0 +1,245 @@
+From 089f78d2a2c04465a566d499dbe0a67c008435a8 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Wed, 24 Jun 2026 19:56:05 +0200
+Subject: [PATCH] feat(paged): FP4 decode GEMM track-B P0 gate + default-off
+ occupancy instrumentation (patch 0017)
+
+Track B targets the dense NVFP4 weight GEMM (~59% of the GB10 decode step). This lands the P0
+bit-exact parity gate and the P1 occupancy levers (default-off / byte-identical) and records the
+honest P1 result: the cheap host/occupancy tuning does NOT lift decode_agg on GB10 (sm_121) - the
+kill-gate tripped - so nothing is enabled by default.
+
+P0 gate (tests/test-backend-ops.cpp): NVFP4/MXFP4 dense decode-shape MUL_MAT cases at the weight-
+row tiling boundary (m in {2048,1600,2050} = exact + ragged vs mmq_y 64/128, n in {32,128} = decode
+M, k=2048), so the bit-exact CPU-vs-CUDA oracle covers the mmq_y / min-blocks paths. Green at
+default and with every lever on: MUL_MAT 1115/1115, MUL_MAT_ID 805/805, NVFP4 0 fail.
+
+P1 levers (ggml/src/ggml-cuda/mmq.cuh), all default-off => default build byte-identical to stock:
+  - GGML_CUDA_FP4_MMQ_Y (default 128): type-aware get_mmq_y_host/device plumbing for an NVFP4
+    weight-row tile override. mmq_y is rigidly nwarps*tile_C::I (=8*16=128, the mmq.cuh static_
+    assert), so mmq_y<128 also needs nwarps-down (a warp-remap through the shared vec_dot/loader),
+    left as the P2 kernel change; the host/device plumbing is in place and inert.
+  - GGML_CUDA_FP4_MINBLOCKS (default 1): NVFP4-only __launch_bounds__ min-resident-CTAs lever
+    (register-cap the FP4-MMA kernel so >1 CTA co-resides) - the bounded occupancy probe.
+  - GGML_CUDA_FP4_DENSE_MMQ_X (env, default off): dense col-tile re-read occupancy diagnostic.
+
+Measured GB10 (llama-batched-bench -fa on -npp 128 -ntg 128 -npl 32,128), decode_agg (S_TG):
+  DENSE q36-27b-nvfp4 @npl128: P0 149.5 -> MINBLOCKS=2 147.9 (-1.1%) -> DENSE_MMQ_X=64 144.3
+    (-3.5%) -> =32 141.7 (-5.2%). Every occupancy probe regresses.
+  MoE q36-35b-a3b-nvfp4 @npl128: stock 336.3, MINBLOCKS=2 337.7 (+0.4%, noise), TILE16 324.0
+    (-3.7%), TILE8 316.6 (-5.9%). mmq_x-down regresses (reproduces patch 0015; GDN/BW-bound).
+
+nsys (kill-gate evidence): the decode FP4 GEMM mul_mat_q<NVFP4,128,0> went 2.782s -> 3.025s
+(avg 608us -> 661us, +8.7% slower) under MINBLOCKS=2 - register-capping spills, so occupancy did
+not usefully rise. Verdict: the dense M=128 tile is already weight-read/one-read-optimal at
+mmq_x=128, NOT occupancy-starved via the cheap levers; the only untested lever is the structural
+mmq_y-down (nwarps=4 warp-remap), deferred to P2. Bit-exact gate holds throughout.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/mmq.cuh | 85 ++++++++++++++++++++++++++++++++++----
+ tests/test-backend-ops.cpp | 16 +++++++
+ 2 files changed, 92 insertions(+), 9 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
+index 9718b12..b53e38a 100644
+--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
+@@ -140,7 +140,24 @@ static constexpr __device__ int get_mmq_x_max_device() {
+ #endif // defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
+ }
+ 
+-static int get_mmq_y_host(const int cc) {
+// [paged patch 0017 / track B] Dense NVFP4 decode mmq_y (weight-row tile) override.
+// mmq_y tiles the N (weight-row) dimension of the FP4-MMA weight GEMM. Lowering it raises the
+// number of resident CTAs (smaller per-CTA shared footprint + smaller per-thread accumulator) to
+// hide LPDDR5x weight-load latency at the M=128 decode tile, WITHOUT re-reading weights: every
+// weight row lives in exactly one row-tile, so total weight traffic is unchanged (bandwidth-
+// neutral) - the dense-decode occupancy lever from FP4_GEMM_SCOPE_B.md s3/s4.1. mmq_y is a PURE
+// N-row tiling knob: the per-output reduction over K is identical for any mmq_y, so the result
+// stays BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4 decode shapes). Default 128 == exact
+// stock behaviour (a default build is byte-identical to stock); build -DGGML_CUDA_FP4_MMQ_Y=64
+// (or 96) to enable the tune. Applies ONLY to NVFP4 on Blackwell; every other type/arch untouched.
+#ifndef GGML_CUDA_FP4_MMQ_Y
+#define GGML_CUDA_FP4_MMQ_Y 128
+#endif
+
+static int get_mmq_y_host(const int cc, const ggml_type type = GGML_TYPE_COUNT) {
+    if (GGML_CUDA_FP4_MMQ_Y != 128 && type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc)) {
+        return GGML_CUDA_FP4_MMQ_Y;
+    }
+     return GGML_CUDA_CC_IS_AMD(cc) ? (GGML_CUDA_CC_IS_RDNA1(cc) ? 64 : 128) :
+         ((GGML_CUDA_CC_IS_NVIDIA(cc) && ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_VOLTA) ? 128 : 64);
+ }
+@@ -154,7 +171,13 @@ if (type == GGML_TYPE_NVFP4 || type == GGML_TYPE_MXFP4) {
+     return MMQ_ITER_K;
+ }
+ 
+template <ggml_type type = GGML_TYPE_COUNT>
+ static constexpr __device__ int get_mmq_y_device() {
+#if defined(BLACKWELL_MMA_AVAILABLE)
+    if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MMQ_Y != 128) {
+        return GGML_CUDA_FP4_MMQ_Y;
+    }
+#endif // defined(BLACKWELL_MMA_AVAILABLE)
+ #if defined(GGML_USE_HIP)
+ #if defined(RDNA1)
+     return 64;
+@@ -170,6 +193,28 @@ static constexpr __device__ int get_mmq_y_device() {
+ #endif // defined(GGML_USE_HIP)
+ }
+ 
+// [paged patch 0017 / track B] Dense NVFP4 decode occupancy lever: min resident CTAs per SM.
+// The FP4-MMA mul_mat_q is REGISTER-bound to 1 CTA/SM (__launch_bounds__(256,1) => ~255 regs/thread
+// => one resident block, the under-occupancy that strands the kernel at ~3% of FP4 peak at M=128).
+// Raising the __launch_bounds__ min-blocks operand register-caps the compiler so N CTAs co-reside,
+// hiding LPDDR5x weight-load latency by CTA-parallelism (the scope s4.1 occupancy goal) WITHOUT a
+// structural mmq_y/nwarps change and WITHOUT extra weight reads (each weight tile still read once).
+// Register allocation cannot change results => BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4).
+// Default 1 == exact stock behaviour (byte-identical); build -DGGML_CUDA_FP4_MINBLOCKS=2 to enable.
+// Applies ONLY to NVFP4 on Blackwell; every other type/arch keeps the stock min-blocks.
+#ifndef GGML_CUDA_FP4_MINBLOCKS
+#define GGML_CUDA_FP4_MINBLOCKS 1
+#endif
+template <ggml_type type = GGML_TYPE_COUNT>
+static constexpr __device__ int mmq_get_min_blocks_device(const int stock) {
+#if defined(BLACKWELL_MMA_AVAILABLE)
+    if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MINBLOCKS != 1) {
+        return GGML_CUDA_FP4_MINBLOCKS;
+    }
+#endif // defined(BLACKWELL_MMA_AVAILABLE)
+    return stock;
+}
+
+ // Decouple shared memory tile sizes from WARP_SIZE to allow for different warp sizes.
+ // The K dimension of the tiles has either,
+ // 1*MMQ_TILE_NE_K==32 (always for TILE_Y_K) or 2*MMQ_TILE_NE_K==64 (typically for TILE_X_K),
+@@ -3454,7 +3499,7 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
+     constexpr int              warp_size  = ggml_cuda_get_physical_warp_size();
+     constexpr int              nwarps     = mmq_get_nwarps_device();
+     constexpr int              qk         = ggml_cuda_type_traits<type>::qk;
+-    constexpr int              mmq_y      = get_mmq_y_device();
+    constexpr int              mmq_y      = get_mmq_y_device<type>();
+     constexpr load_tiles_mmq_t load_tiles = mmq_type_traits<mmq_x, mmq_y, need_check, type>::load_tiles;
+ 
+     extern __shared__ int data_mul_mat_q[];
+@@ -3531,13 +3576,13 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
+ template <ggml_type type, int mmq_x, bool need_check>
+ #if defined(GGML_USE_HIP)
+ #if defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN)
+-    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2)
+    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(2))
+ #endif // defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN)
+ #else
+ #if __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
+-    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 1)
+    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(1))
+ #else
+-    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2)
+    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(2))
+ #endif // __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
+ #endif // defined(GGML_USE_HIP)
+ static __global__ void mul_mat_q(
+@@ -3558,7 +3603,7 @@ static __global__ void mul_mat_q(
+     constexpr int warp_size = ggml_cuda_get_physical_warp_size();
+ 
+     constexpr int qk    = ggml_cuda_type_traits<type>::qk;
+-    constexpr int mmq_y = get_mmq_y_device();
+    constexpr int mmq_y = get_mmq_y_device<type>();
+ 
+     const uint32_t nty = (nrows_x + mmq_y - 1) / mmq_y; // Number of tiles y
+ 
+@@ -3790,7 +3835,7 @@ static __global__ void mul_mat_q_stream_k_fixup(
+         float * __restrict__ tmp_last_tile, const uint3 blocks_per_ne00, const int nrows_x, const int ncols_dst,
+         const int stride_col_dst, const uint3 nchannels_y, const int stride_channel_dst, const uint3 nsamples_y,
+         const int stride_sample_dst, const uint3 ntx) {
+-    constexpr int mmq_y           = get_mmq_y_device();
+    constexpr int mmq_y           = get_mmq_y_device<type>();
+     constexpr int qk              = ggml_cuda_type_traits<type>::qk;
+     constexpr int ITER_K          = get_iter_k(type);
+     constexpr int blocks_per_iter = ITER_K / qk;
+@@ -3947,7 +3992,7 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
+     const int nsm = ggml_cuda_info().devices[id].nsm;
+     const int warp_size = ggml_cuda_info().devices[id].warp_size;
+     const int nwarps = mmq_get_nwarps_host(cc, warp_size);
+-    const int mmq_y = get_mmq_y_host(cc);
+    const int mmq_y = get_mmq_y_host(cc, type);
+ 
+     const dim3 block_dims(warp_size, nwarps, 1);
+ 
+@@ -4103,6 +4148,21 @@ static inline int ggml_cuda_moe_density_max() {
+     return d;
+ }
+ 
+// [paged patch 0017 / track B] DENSE NVFP4 decode mmq_x re-read occupancy DIAGNOSTIC (env, default off).
+// GGML_CUDA_FP4_DENSE_MMQ_X=<n> caps the dense (non-MoE) NVFP4 col-tile to <n>, splitting the M=128
+// decode ubatch into ceil(128/n) col-tiles. Each col-tile re-reads the full weight set (fatal cost
+// in the BW-bound regime) but multiplies resident CTAs. This is the scope s4.1 A/B probe: if
+// decode_agg RISES with cap=64 despite the 2x weight read, occupancy is badly broken (the kernel is
+// compute/occupancy-bound, so mmq_y-down / min-blocks has large upside); if it FALLS, the tile is
+// already bandwidth-saturated and the occupancy ceiling is lower. Unset/<=0 => stock selection.
+static inline int ggml_cuda_fp4_dense_mmq_x_cap() {
+    static const int c = []() -> int {
+        const char * s = getenv("GGML_CUDA_FP4_DENSE_MMQ_X");
+        return s ? atoi(s) : 0;
+    }();
+    return c;
+}
+
+ template <ggml_type type>
+ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
+     const int    id     = ggml_cuda_get_device();
+@@ -4112,7 +4172,7 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
+     const int nwarps    = mmq_get_nwarps_host(cc, warp_size);
+ 
+     const int mmq_x_max = get_mmq_x_max_host(cc);
+-    const int mmq_y = get_mmq_y_host(cc);
+    const int mmq_y = get_mmq_y_host(cc, type);
+ 
+     // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON).
+     // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens
+@@ -4145,6 +4205,13 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
+     //   - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection).
+     //   - LLAMA_MOE_DECODE_TILE=<n>, LLAMA_MOE_DENSITY_MAX=<n> : tune the tile / threshold.
+     int mmq_x_lim = mmq_x_max;
+    if (args.expert_bounds == nullptr && type == GGML_TYPE_NVFP4) {
+        // dense NVFP4 decode mmq_x re-read occupancy diagnostic (see ggml_cuda_fp4_dense_mmq_x_cap).
+        const int cap = ggml_cuda_fp4_dense_mmq_x_cap();
+        if (cap > 0 && cap < mmq_x_max) {
+            mmq_x_lim = cap < 8 ? 8 : cap;
+        }
+    }
+     if (args.expert_bounds != nullptr) {
+         const int moe_cap = ggml_cuda_moe_mmq_x_cap();
+         if (moe_cap > 0) {
+diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
+index f219309..291c275 100644
+--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
+@@ -8591,6 +8591,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
+         }
+     }
+ 
+    // [paged P0 / track B] NVFP4/MXFP4 dense decode-shape mmq_y-down bit-exact gate.
+    // The dense FP4 weight GEMM is the track-B target; P1 lowers mmq_y (the weight-row tile) on the
+    // NVFP4 decode path to raise resident-CTA occupancy. mmq_y is a pure N-row tiling knob, so a
+    // smaller mmq_y must stay BIT-EXACT (identical per-output reduction over K) - this gate proves
+    // it. m = weight rows (N, tiled by mmq_y): 2048 (exact at mmq_y 64 & 128), 1600 (ragged vs 128),
+    // 2050 (ragged vs both 64 & 128 -> exercises the need_check last-row-tile at both). n = decode
+    // token count M = 32 and 128 (the scope decode shapes, tiled by mmq_x). k = 2048 hidden. Must
+    // pass with the default build (mmq_y=128) AND a mmq_y=64 build, CUDA-vs-CPU oracle, bit-exact.
+    for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) {
+        for (int64_t m : {2048, 1600, 2050}) {
+            for (int64_t n : {32, 128}) {
+                test_cases.emplace_back(new test_mul_mat(type_a, GGML_TYPE_F32, m, n, 2048, {1, 1}, {1, 1}));
+            }
+        }
+    }
+
+     for (ggml_type type_a : all_types) {
+         test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a)));
+     }
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/THROUGHPUT_B_P1_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/THROUGHPUT_B_P1_RESULTS.md
@@ -0,0 +1,126 @@
+# Track B P0 + P1 results: the FP4-MMA decode-GEMM occupancy tune (GB10, sm_121)
+
+Measured on the DGX (GB10 / DGX Spark, sm_121, `~/llama-paged-dev`, branch `paged`). Implements
+`FP4_GEMM_SCOPE_B.md` P0 (baseline + bit-exact gate) and P1 (the cheap host/occupancy tile tune).
+Dev-tree commit: **089f78d** (`feat(paged): FP4 decode GEMM track-B P0 gate + default-off occupancy
+instrumentation`). Patch artifact: `0017-fp4-gemm-decode-tile-tune.patch`.
+
+**Headline verdict: the P1 occupancy kill-gate TRIPPED.** None of the cheap host/occupancy levers
+lift dense or MoE decode_agg on GB10; every dense probe regresses and the nsys evidence shows the
+FP4 GEMM kernel gets *slower* under register-capping. Nothing is enabled by default (the levers are
+compile-time/env gated and the default build is byte-identical to stock). The one untested lever is
+the structural `mmq_y`-down, which is **not** a host switch: it is coupled to `nwarps` by the
+`nwarps*tile_C::I == mmq_y` static_assert, so it requires an `nwarps=4` warp-remap (P2 kernel work).
+
+All benches: `llama-batched-bench -fa on -c 32768 -ngl 99 -npp 128 -ntg 128 -npl 32,128`.
+`decode_agg = S_TG` (aggregate decode tok/s). 3 reps dense, 2 reps MoE; medians below.
+
+## P0 baseline (mmq_y=128, minblocks=1 — stock)
+
+### Bit-exact parity gate (CPU oracle vs CUDA, deterministic)
+- `test-backend-ops -o MUL_MAT  -b CUDA0`: **1115/1115** (1103 stock + 12 new NVFP4/MXFP4 dense
+  decode-shape cases), NVFP4 0 fail.
+- `test-backend-ops -o MUL_MAT_ID -b CUDA0`: **805/805**, NVFP4 0 fail.
+- New P0 cases exercise the weight-row (`mmq_y`) tiling boundary: `type_a ∈ {NVFP4, MXFP4}`,
+  `m ∈ {2048 (exact at mmq_y 64/128), 1600 (ragged vs 128), 2050 (ragged vs both 64 & 128 →
+  need_check last row-tile)}`, `n ∈ {32, 128}` (decode M), `k = 2048`. They make the oracle cover
+  the `mmq_y`/min-blocks changes and stay bit-exact with every lever on.
+
+### Decode throughput (decode_agg = S_TG)
+| model | npl32 | npl128 |
+|---|---:|---:|
+| DENSE q36-27b-nvfp4 | 117.3 | **149.5** |
+| MoE q36-35b-a3b-nvfp4 (stock mmq_x=128/expert) | 262.6 | **336.3** |
+
+(For reference the scope §6 cites dense 161 / MoE 333 from a server harness; this is the cleaner
+batched-bench A/B baseline. The relative P0→P1 deltas below are what the kill-gate turns on.)
+
+### nsys FP4 GEMM efficiency (dense, `-npp 64 -ntg 48 -npl 128`)
+The decode FP4 weight GEMM kernel = `mul_mat_q<NVFP4(40), mmq_x=128, need_check=0>`:
+- **33.2 %** of GPU kernel time, total **2.782 s** / 4576 inst, **avg 608 µs/launch**.
+- Plus `quantize_mmq_nvfp4` 9.1 % (the act-quant bucket — track A's target), `mul_mat_q<…,16,…>`
+  5.8 % (prefill ubatch tiling), stream-k fixups ~0.5 %.
+
+This is the locked baseline; P1 must lower the GEMM kernel time (raise FP4-eff) to pass.
+
+## P1 — the cheap occupancy levers (all default-off, byte-identical when off)
+
+Three bit-exact, gated levers were added (`mmq.cuh`):
+- `GGML_CUDA_FP4_MMQ_Y` (default 128): type-aware `get_mmq_y_host/device` plumbing for an NVFP4
+  weight-row tile override. **Inert** — see "the mmq_y wall" below.
+- `GGML_CUDA_FP4_MINBLOCKS` (default 1): NVFP4-only `__launch_bounds__` min-resident-CTAs lever
+  (register-caps the FP4-MMA kernel so >1 CTA co-resides). The bounded occupancy probe.
+- `GGML_CUDA_FP4_DENSE_MMQ_X` (env, default off): dense col-tile re-read occupancy diagnostic
+  (the §4.1 A/B: does eating a 2× weight re-read at a smaller `mmq_x` buy net occupancy?).
+
+P1 parity: with `MINBLOCKS=2` the gate stays **MUL_MAT 1115/1115, MUL_MAT_ID 805/805, NVFP4 0
+fail** — register allocation is result-neutral, so bit-exactness holds.
+
+### DENSE decode_agg @ npl128 — every occupancy probe REGRESSES
+| config | npl32 | npl128 | Δ vs P0 @npl128 |
+|---|---:|---:|---:|
+| P0 stock (mmq_y=128, minblocks=1) | 117.3 | **149.5** | — |
+| MINBLOCKS=2 (2 resident CTAs via reg-cap) | 115.7 | 147.9 | **−1.1 %** |
+| DENSE_MMQ_X=64 (2 col-tiles, 2× weight re-read) | 115.3 | 144.3 | **−3.5 %** |
+| DENSE_MMQ_X=32 (4 col-tiles, 4× weight re-read) | 115.4 | 141.7 | **−5.2 %** |
+
+### MoE decode_agg @ npl128 — mmq_x-down regresses; min-blocks neutral
+| config | npl32 | npl128 | Δ vs stock @npl128 |
+|---|---:|---:|---:|
+| stock (mmq_x=128/expert) | 262.6 | **336.3** | — |
+| TILE32 | 262.1 | 336.0 | −0.1 % |
+| TILE16 | 261.1 | 324.0 | **−3.7 %** |
+| TILE8 | 260.8 | 316.6 | **−5.9 %** |
+| MINBLOCKS=2 | 260.0 | 337.7 | +0.4 % (noise) |
+
+The MoE result reproduces patch 0015 exactly: q36-35b-a3b (256 tiny experts, GDN linear attention)
+decode is GDN/bandwidth-bound, **not** col-tile-occupancy-bound, so tightening `mmq_x` below 64
+(the brief's "8–16 ideal") monotonically *loses*. 64 ≈ 32 ≈ stock is the floor.
+
+### nsys kill-gate evidence (the decisive datum)
+`mul_mat_q<NVFP4,128,0>` under MINBLOCKS=2: **2.782 s → 3.025 s**, avg **608 µs → 661 µs
+(+8.7 % SLOWER)**. The FP4-MMA kernel needs >128 regs/thread; forcing 2 CTAs/SM register-caps it,
+which **spills to local memory**, so the GEMM does *more* work per launch — occupancy did not
+usefully rise, it inverted. FP4-eff went **down**, not up. Kill-gate tripped, with hard evidence.
+
+## Why P1 can't lift it (and why mmq_y-down is P2, not P1)
+
+The two orthogonal occupancy probes both regress: register-capping (minblocks↑) spills, and
+col-tile-shrinking (mmq_x↓) re-reads the 18 GB weight set. This says the **dense M=128 tile is
+already weight-read / one-read-optimal at mmq_x=128** — it is not occupancy-starved in a way the
+cheap levers can fix. This contradicts the scope's central "self-inflicted occupancy, recover it by
+raising resident CTAs" hypothesis *for the cheap levers*.
+
+The only lever that raises resident CTAs **without** spilling and **without** extra weight reads is
+the structural `mmq_y`-down (smaller weight-row tile → smaller shared + smaller accumulator → more
+CTAs, weights still read once). But `mmq_y` is **rigidly** `nwarps * tile_C::I = 8 * 16 = 128`
+(the `mmq.cuh:3258` static_assert; `tile_C::I=16` is the fixed `m16n8k64` MMA shape). So
+`mmq_y=64` requires **`nwarps=4`** — a warp-remap, not a host switch. That remap threads `nwarps`
+through ~13 NVFP4-reachable sites including the **shared** `vec_dot_fp4_fp4_mma` (used by both NVFP4
+and MXFP4) and the loader/kernel nwarps lockstep, with real risk of a silent shared-mem/thread-block
+mismatch. It was scoped but **deferred to P2** (the scope's own phase table also places `mmq_y`-down
+at P2, after the P1 host-only knobs). The `get_mmq_y` host/device plumbing is committed and inert so
+P2 only has to add the `nwarps` half.
+
+## Honest verdict vs the scope targets
+
+- **DENSE:** P1 (host knobs + min-blocks + re-read diagnostic) does **not** move decode_agg toward
+  the 391 target — it slightly *regresses* (149.5 → 147.9, 38 % of vLLM). The scope's P1 row
+  (~177, "honest: small") was optimistic; on GB10 the cheap levers are net-negative. The remaining
+  upside lives entirely in the P2 `mmq_y`-down (nwarps=4) kernel remap **plus** track A. Whether
+  that clears the floor is now an *open, unproven* question — the cheap-lever evidence here leans
+  *against* large occupancy upside (the tile already looks one-read-optimal), so the P2 ceiling is
+  plausibly lower than the scope's 316–328.
+- **MoE:** the mmq_x-down lever (the brief's MoE P1) is a **confirmed dead-end on this model**
+  (regresses; GDN/BW-bound, reproduces patch 0015). min-blocks is neutral. No host-level MoE win.
+
+**Kill-gate: TRIPPED on both arms.** Per the brief this is *not* forced into a default-on change.
+Committed: the P0 bit-exact gate + the default-off instrumentation + this honest record. Not pushed.
+
+## Reproduce
+```
+# default (byte-identical stock): build-cuda as-is -> MUL_MAT 1115/1115, MUL_MAT_ID 805/805
+# occupancy probe:  cmake build with -DGGML_CUDA_FP4_MINBLOCKS=2  (or flip the macro default)
+# dense re-read A/B: GGML_CUDA_FP4_DENSE_MMQ_X=64 ./llama-batched-bench -m q36-27b-nvfp4.gguf ...
+# nsys: nsys profile --trace cuda ... ; nsys stats --report cuda_gpu_kern_sum  (watch mul_mat_q<40,128,0>)
+```