diff --git a/backend/cpp/llama-cpp/patches/paged/0017-fp4-gemm-decode-tile-tune.patch b/backend/cpp/llama-cpp/patches/paged/0017-fp4-gemm-decode-tile-tune.patch new file mode 100644 index 000000000..19960ed81 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/0017-fp4-gemm-decode-tile-tune.patch @@ -0,0 +1,245 @@ +From 089f78d2a2c04465a566d499dbe0a67c008435a8 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Wed, 24 Jun 2026 19:56:05 +0200 +Subject: [PATCH] feat(paged): FP4 decode GEMM track-B P0 gate + default-off + occupancy instrumentation (patch 0017) + +Track B targets the dense NVFP4 weight GEMM (~59% of the GB10 decode step). This lands the P0 +bit-exact parity gate and the P1 occupancy levers (default-off / byte-identical) and records the +honest P1 result: the cheap host/occupancy tuning does NOT lift decode_agg on GB10 (sm_121) - the +kill-gate tripped - so nothing is enabled by default. + +P0 gate (tests/test-backend-ops.cpp): NVFP4/MXFP4 dense decode-shape MUL_MAT cases at the weight- +row tiling boundary (m in {2048,1600,2050} = exact + ragged vs mmq_y 64/128, n in {32,128} = decode +M, k=2048), so the bit-exact CPU-vs-CUDA oracle covers the mmq_y / min-blocks paths. Green at +default and with every lever on: MUL_MAT 1115/1115, MUL_MAT_ID 805/805, NVFP4 0 fail. + +P1 levers (ggml/src/ggml-cuda/mmq.cuh), all default-off => default build byte-identical to stock: + - GGML_CUDA_FP4_MMQ_Y (default 128): type-aware get_mmq_y_host/device plumbing for an NVFP4 + weight-row tile override. mmq_y is rigidly nwarps*tile_C::I (=8*16=128, the mmq.cuh static_ + assert), so mmq_y<128 also needs nwarps-down (a warp-remap through the shared vec_dot/loader), + left as the P2 kernel change; the host/device plumbing is in place and inert. + - GGML_CUDA_FP4_MINBLOCKS (default 1): NVFP4-only __launch_bounds__ min-resident-CTAs lever + (register-cap the FP4-MMA kernel so >1 CTA co-resides) - the bounded occupancy probe. + - GGML_CUDA_FP4_DENSE_MMQ_X (env, default off): dense col-tile re-read occupancy diagnostic. + +Measured GB10 (llama-batched-bench -fa on -npp 128 -ntg 128 -npl 32,128), decode_agg (S_TG): + DENSE q36-27b-nvfp4 @npl128: P0 149.5 -> MINBLOCKS=2 147.9 (-1.1%) -> DENSE_MMQ_X=64 144.3 + (-3.5%) -> =32 141.7 (-5.2%). Every occupancy probe regresses. + MoE q36-35b-a3b-nvfp4 @npl128: stock 336.3, MINBLOCKS=2 337.7 (+0.4%, noise), TILE16 324.0 + (-3.7%), TILE8 316.6 (-5.9%). mmq_x-down regresses (reproduces patch 0015; GDN/BW-bound). + +nsys (kill-gate evidence): the decode FP4 GEMM mul_mat_q went 2.782s -> 3.025s +(avg 608us -> 661us, +8.7% slower) under MINBLOCKS=2 - register-capping spills, so occupancy did +not usefully rise. Verdict: the dense M=128 tile is already weight-read/one-read-optimal at +mmq_x=128, NOT occupancy-starved via the cheap levers; the only untested lever is the structural +mmq_y-down (nwarps=4 warp-remap), deferred to P2. Bit-exact gate holds throughout. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/mmq.cuh | 85 ++++++++++++++++++++++++++++++++++---- + tests/test-backend-ops.cpp | 16 +++++++ + 2 files changed, 92 insertions(+), 9 deletions(-) + +diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh +index 9718b12..b53e38a 100644 +--- a/ggml/src/ggml-cuda/mmq.cuh ++++ b/ggml/src/ggml-cuda/mmq.cuh +@@ -140,7 +140,24 @@ static constexpr __device__ int get_mmq_x_max_device() { + #endif // defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE) + } + +-static int get_mmq_y_host(const int cc) { ++// [paged patch 0017 / track B] Dense NVFP4 decode mmq_y (weight-row tile) override. ++// mmq_y tiles the N (weight-row) dimension of the FP4-MMA weight GEMM. Lowering it raises the ++// number of resident CTAs (smaller per-CTA shared footprint + smaller per-thread accumulator) to ++// hide LPDDR5x weight-load latency at the M=128 decode tile, WITHOUT re-reading weights: every ++// weight row lives in exactly one row-tile, so total weight traffic is unchanged (bandwidth- ++// neutral) - the dense-decode occupancy lever from FP4_GEMM_SCOPE_B.md s3/s4.1. mmq_y is a PURE ++// N-row tiling knob: the per-output reduction over K is identical for any mmq_y, so the result ++// stays BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4 decode shapes). Default 128 == exact ++// stock behaviour (a default build is byte-identical to stock); build -DGGML_CUDA_FP4_MMQ_Y=64 ++// (or 96) to enable the tune. Applies ONLY to NVFP4 on Blackwell; every other type/arch untouched. ++#ifndef GGML_CUDA_FP4_MMQ_Y ++#define GGML_CUDA_FP4_MMQ_Y 128 ++#endif ++ ++static int get_mmq_y_host(const int cc, const ggml_type type = GGML_TYPE_COUNT) { ++ if (GGML_CUDA_FP4_MMQ_Y != 128 && type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc)) { ++ return GGML_CUDA_FP4_MMQ_Y; ++ } + return GGML_CUDA_CC_IS_AMD(cc) ? (GGML_CUDA_CC_IS_RDNA1(cc) ? 64 : 128) : + ((GGML_CUDA_CC_IS_NVIDIA(cc) && ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_VOLTA) ? 128 : 64); + } +@@ -154,7 +171,13 @@ if (type == GGML_TYPE_NVFP4 || type == GGML_TYPE_MXFP4) { + return MMQ_ITER_K; + } + ++template + static constexpr __device__ int get_mmq_y_device() { ++#if defined(BLACKWELL_MMA_AVAILABLE) ++ if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MMQ_Y != 128) { ++ return GGML_CUDA_FP4_MMQ_Y; ++ } ++#endif // defined(BLACKWELL_MMA_AVAILABLE) + #if defined(GGML_USE_HIP) + #if defined(RDNA1) + return 64; +@@ -170,6 +193,28 @@ static constexpr __device__ int get_mmq_y_device() { + #endif // defined(GGML_USE_HIP) + } + ++// [paged patch 0017 / track B] Dense NVFP4 decode occupancy lever: min resident CTAs per SM. ++// The FP4-MMA mul_mat_q is REGISTER-bound to 1 CTA/SM (__launch_bounds__(256,1) => ~255 regs/thread ++// => one resident block, the under-occupancy that strands the kernel at ~3% of FP4 peak at M=128). ++// Raising the __launch_bounds__ min-blocks operand register-caps the compiler so N CTAs co-reside, ++// hiding LPDDR5x weight-load latency by CTA-parallelism (the scope s4.1 occupancy goal) WITHOUT a ++// structural mmq_y/nwarps change and WITHOUT extra weight reads (each weight tile still read once). ++// Register allocation cannot change results => BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4). ++// Default 1 == exact stock behaviour (byte-identical); build -DGGML_CUDA_FP4_MINBLOCKS=2 to enable. ++// Applies ONLY to NVFP4 on Blackwell; every other type/arch keeps the stock min-blocks. ++#ifndef GGML_CUDA_FP4_MINBLOCKS ++#define GGML_CUDA_FP4_MINBLOCKS 1 ++#endif ++template ++static constexpr __device__ int mmq_get_min_blocks_device(const int stock) { ++#if defined(BLACKWELL_MMA_AVAILABLE) ++ if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MINBLOCKS != 1) { ++ return GGML_CUDA_FP4_MINBLOCKS; ++ } ++#endif // defined(BLACKWELL_MMA_AVAILABLE) ++ return stock; ++} ++ + // Decouple shared memory tile sizes from WARP_SIZE to allow for different warp sizes. + // The K dimension of the tiles has either, + // 1*MMQ_TILE_NE_K==32 (always for TILE_Y_K) or 2*MMQ_TILE_NE_K==64 (typically for TILE_X_K), +@@ -3454,7 +3499,7 @@ static __device__ __forceinline__ void mul_mat_q_process_tile( + constexpr int warp_size = ggml_cuda_get_physical_warp_size(); + constexpr int nwarps = mmq_get_nwarps_device(); + constexpr int qk = ggml_cuda_type_traits::qk; +- constexpr int mmq_y = get_mmq_y_device(); ++ constexpr int mmq_y = get_mmq_y_device(); + constexpr load_tiles_mmq_t load_tiles = mmq_type_traits::load_tiles; + + extern __shared__ int data_mul_mat_q[]; +@@ -3531,13 +3576,13 @@ static __device__ __forceinline__ void mul_mat_q_process_tile( + template + #if defined(GGML_USE_HIP) + #if defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN) +- __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2) ++ __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device(2)) + #endif // defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN) + #else + #if __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA +- __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 1) ++ __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device(1)) + #else +- __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2) ++ __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device(2)) + #endif // __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA + #endif // defined(GGML_USE_HIP) + static __global__ void mul_mat_q( +@@ -3558,7 +3603,7 @@ static __global__ void mul_mat_q( + constexpr int warp_size = ggml_cuda_get_physical_warp_size(); + + constexpr int qk = ggml_cuda_type_traits::qk; +- constexpr int mmq_y = get_mmq_y_device(); ++ constexpr int mmq_y = get_mmq_y_device(); + + const uint32_t nty = (nrows_x + mmq_y - 1) / mmq_y; // Number of tiles y + +@@ -3790,7 +3835,7 @@ static __global__ void mul_mat_q_stream_k_fixup( + float * __restrict__ tmp_last_tile, const uint3 blocks_per_ne00, const int nrows_x, const int ncols_dst, + const int stride_col_dst, const uint3 nchannels_y, const int stride_channel_dst, const uint3 nsamples_y, + const int stride_sample_dst, const uint3 ntx) { +- constexpr int mmq_y = get_mmq_y_device(); ++ constexpr int mmq_y = get_mmq_y_device(); + constexpr int qk = ggml_cuda_type_traits::qk; + constexpr int ITER_K = get_iter_k(type); + constexpr int blocks_per_iter = ITER_K / qk; +@@ -3947,7 +3992,7 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a + const int nsm = ggml_cuda_info().devices[id].nsm; + const int warp_size = ggml_cuda_info().devices[id].warp_size; + const int nwarps = mmq_get_nwarps_host(cc, warp_size); +- const int mmq_y = get_mmq_y_host(cc); ++ const int mmq_y = get_mmq_y_host(cc, type); + + const dim3 block_dims(warp_size, nwarps, 1); + +@@ -4103,6 +4148,21 @@ static inline int ggml_cuda_moe_density_max() { + return d; + } + ++// [paged patch 0017 / track B] DENSE NVFP4 decode mmq_x re-read occupancy DIAGNOSTIC (env, default off). ++// GGML_CUDA_FP4_DENSE_MMQ_X= caps the dense (non-MoE) NVFP4 col-tile to , splitting the M=128 ++// decode ubatch into ceil(128/n) col-tiles. Each col-tile re-reads the full weight set (fatal cost ++// in the BW-bound regime) but multiplies resident CTAs. This is the scope s4.1 A/B probe: if ++// decode_agg RISES with cap=64 despite the 2x weight read, occupancy is badly broken (the kernel is ++// compute/occupancy-bound, so mmq_y-down / min-blocks has large upside); if it FALLS, the tile is ++// already bandwidth-saturated and the occupancy ceiling is lower. Unset/<=0 => stock selection. ++static inline int ggml_cuda_fp4_dense_mmq_x_cap() { ++ static const int c = []() -> int { ++ const char * s = getenv("GGML_CUDA_FP4_DENSE_MMQ_X"); ++ return s ? atoi(s) : 0; ++ }(); ++ return c; ++} ++ + template + void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) { + const int id = ggml_cuda_get_device(); +@@ -4112,7 +4172,7 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda + const int nwarps = mmq_get_nwarps_host(cc, warp_size); + + const int mmq_x_max = get_mmq_x_max_host(cc); +- const int mmq_y = get_mmq_y_host(cc); ++ const int mmq_y = get_mmq_y_host(cc, type); + + // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON). + // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens +@@ -4145,6 +4205,13 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda + // - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection). + // - LLAMA_MOE_DECODE_TILE=, LLAMA_MOE_DENSITY_MAX= : tune the tile / threshold. + int mmq_x_lim = mmq_x_max; ++ if (args.expert_bounds == nullptr && type == GGML_TYPE_NVFP4) { ++ // dense NVFP4 decode mmq_x re-read occupancy diagnostic (see ggml_cuda_fp4_dense_mmq_x_cap). ++ const int cap = ggml_cuda_fp4_dense_mmq_x_cap(); ++ if (cap > 0 && cap < mmq_x_max) { ++ mmq_x_lim = cap < 8 ? 8 : cap; ++ } ++ } + if (args.expert_bounds != nullptr) { + const int moe_cap = ggml_cuda_moe_mmq_x_cap(); + if (moe_cap > 0) { +diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp +index f219309..291c275 100644 +--- a/tests/test-backend-ops.cpp ++++ b/tests/test-backend-ops.cpp +@@ -8591,6 +8591,22 @@ static std::vector> make_test_cases_eval() { + } + } + ++ // [paged P0 / track B] NVFP4/MXFP4 dense decode-shape mmq_y-down bit-exact gate. ++ // The dense FP4 weight GEMM is the track-B target; P1 lowers mmq_y (the weight-row tile) on the ++ // NVFP4 decode path to raise resident-CTA occupancy. mmq_y is a pure N-row tiling knob, so a ++ // smaller mmq_y must stay BIT-EXACT (identical per-output reduction over K) - this gate proves ++ // it. m = weight rows (N, tiled by mmq_y): 2048 (exact at mmq_y 64 & 128), 1600 (ragged vs 128), ++ // 2050 (ragged vs both 64 & 128 -> exercises the need_check last-row-tile at both). n = decode ++ // token count M = 32 and 128 (the scope decode shapes, tiled by mmq_x). k = 2048 hidden. Must ++ // pass with the default build (mmq_y=128) AND a mmq_y=64 build, CUDA-vs-CPU oracle, bit-exact. ++ for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) { ++ for (int64_t m : {2048, 1600, 2050}) { ++ for (int64_t n : {32, 128}) { ++ test_cases.emplace_back(new test_mul_mat(type_a, GGML_TYPE_F32, m, n, 2048, {1, 1}, {1, 1})); ++ } ++ } ++ } ++ + for (ggml_type type_a : all_types) { + test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a))); + } +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp/patches/paged/THROUGHPUT_B_P1_RESULTS.md b/backend/cpp/llama-cpp/patches/paged/THROUGHPUT_B_P1_RESULTS.md new file mode 100644 index 000000000..2a541f7ef --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/THROUGHPUT_B_P1_RESULTS.md @@ -0,0 +1,126 @@ +# Track B P0 + P1 results: the FP4-MMA decode-GEMM occupancy tune (GB10, sm_121) + +Measured on the DGX (GB10 / DGX Spark, sm_121, `~/llama-paged-dev`, branch `paged`). Implements +`FP4_GEMM_SCOPE_B.md` P0 (baseline + bit-exact gate) and P1 (the cheap host/occupancy tile tune). +Dev-tree commit: **089f78d** (`feat(paged): FP4 decode GEMM track-B P0 gate + default-off occupancy +instrumentation`). Patch artifact: `0017-fp4-gemm-decode-tile-tune.patch`. + +**Headline verdict: the P1 occupancy kill-gate TRIPPED.** None of the cheap host/occupancy levers +lift dense or MoE decode_agg on GB10; every dense probe regresses and the nsys evidence shows the +FP4 GEMM kernel gets *slower* under register-capping. Nothing is enabled by default (the levers are +compile-time/env gated and the default build is byte-identical to stock). The one untested lever is +the structural `mmq_y`-down, which is **not** a host switch: it is coupled to `nwarps` by the +`nwarps*tile_C::I == mmq_y` static_assert, so it requires an `nwarps=4` warp-remap (P2 kernel work). + +All benches: `llama-batched-bench -fa on -c 32768 -ngl 99 -npp 128 -ntg 128 -npl 32,128`. +`decode_agg = S_TG` (aggregate decode tok/s). 3 reps dense, 2 reps MoE; medians below. + +## P0 baseline (mmq_y=128, minblocks=1 — stock) + +### Bit-exact parity gate (CPU oracle vs CUDA, deterministic) +- `test-backend-ops -o MUL_MAT -b CUDA0`: **1115/1115** (1103 stock + 12 new NVFP4/MXFP4 dense + decode-shape cases), NVFP4 0 fail. +- `test-backend-ops -o MUL_MAT_ID -b CUDA0`: **805/805**, NVFP4 0 fail. +- New P0 cases exercise the weight-row (`mmq_y`) tiling boundary: `type_a ∈ {NVFP4, MXFP4}`, + `m ∈ {2048 (exact at mmq_y 64/128), 1600 (ragged vs 128), 2050 (ragged vs both 64 & 128 → + need_check last row-tile)}`, `n ∈ {32, 128}` (decode M), `k = 2048`. They make the oracle cover + the `mmq_y`/min-blocks changes and stay bit-exact with every lever on. + +### Decode throughput (decode_agg = S_TG) +| model | npl32 | npl128 | +|---|---:|---:| +| DENSE q36-27b-nvfp4 | 117.3 | **149.5** | +| MoE q36-35b-a3b-nvfp4 (stock mmq_x=128/expert) | 262.6 | **336.3** | + +(For reference the scope §6 cites dense 161 / MoE 333 from a server harness; this is the cleaner +batched-bench A/B baseline. The relative P0→P1 deltas below are what the kill-gate turns on.) + +### nsys FP4 GEMM efficiency (dense, `-npp 64 -ntg 48 -npl 128`) +The decode FP4 weight GEMM kernel = `mul_mat_q`: +- **33.2 %** of GPU kernel time, total **2.782 s** / 4576 inst, **avg 608 µs/launch**. +- Plus `quantize_mmq_nvfp4` 9.1 % (the act-quant bucket — track A's target), `mul_mat_q<…,16,…>` + 5.8 % (prefill ubatch tiling), stream-k fixups ~0.5 %. + +This is the locked baseline; P1 must lower the GEMM kernel time (raise FP4-eff) to pass. + +## P1 — the cheap occupancy levers (all default-off, byte-identical when off) + +Three bit-exact, gated levers were added (`mmq.cuh`): +- `GGML_CUDA_FP4_MMQ_Y` (default 128): type-aware `get_mmq_y_host/device` plumbing for an NVFP4 + weight-row tile override. **Inert** — see "the mmq_y wall" below. +- `GGML_CUDA_FP4_MINBLOCKS` (default 1): NVFP4-only `__launch_bounds__` min-resident-CTAs lever + (register-caps the FP4-MMA kernel so >1 CTA co-resides). The bounded occupancy probe. +- `GGML_CUDA_FP4_DENSE_MMQ_X` (env, default off): dense col-tile re-read occupancy diagnostic + (the §4.1 A/B: does eating a 2× weight re-read at a smaller `mmq_x` buy net occupancy?). + +P1 parity: with `MINBLOCKS=2` the gate stays **MUL_MAT 1115/1115, MUL_MAT_ID 805/805, NVFP4 0 +fail** — register allocation is result-neutral, so bit-exactness holds. + +### DENSE decode_agg @ npl128 — every occupancy probe REGRESSES +| config | npl32 | npl128 | Δ vs P0 @npl128 | +|---|---:|---:|---:| +| P0 stock (mmq_y=128, minblocks=1) | 117.3 | **149.5** | — | +| MINBLOCKS=2 (2 resident CTAs via reg-cap) | 115.7 | 147.9 | **−1.1 %** | +| DENSE_MMQ_X=64 (2 col-tiles, 2× weight re-read) | 115.3 | 144.3 | **−3.5 %** | +| DENSE_MMQ_X=32 (4 col-tiles, 4× weight re-read) | 115.4 | 141.7 | **−5.2 %** | + +### MoE decode_agg @ npl128 — mmq_x-down regresses; min-blocks neutral +| config | npl32 | npl128 | Δ vs stock @npl128 | +|---|---:|---:|---:| +| stock (mmq_x=128/expert) | 262.6 | **336.3** | — | +| TILE32 | 262.1 | 336.0 | −0.1 % | +| TILE16 | 261.1 | 324.0 | **−3.7 %** | +| TILE8 | 260.8 | 316.6 | **−5.9 %** | +| MINBLOCKS=2 | 260.0 | 337.7 | +0.4 % (noise) | + +The MoE result reproduces patch 0015 exactly: q36-35b-a3b (256 tiny experts, GDN linear attention) +decode is GDN/bandwidth-bound, **not** col-tile-occupancy-bound, so tightening `mmq_x` below 64 +(the brief's "8–16 ideal") monotonically *loses*. 64 ≈ 32 ≈ stock is the floor. + +### nsys kill-gate evidence (the decisive datum) +`mul_mat_q` under MINBLOCKS=2: **2.782 s → 3.025 s**, avg **608 µs → 661 µs +(+8.7 % SLOWER)**. The FP4-MMA kernel needs >128 regs/thread; forcing 2 CTAs/SM register-caps it, +which **spills to local memory**, so the GEMM does *more* work per launch — occupancy did not +usefully rise, it inverted. FP4-eff went **down**, not up. Kill-gate tripped, with hard evidence. + +## Why P1 can't lift it (and why mmq_y-down is P2, not P1) + +The two orthogonal occupancy probes both regress: register-capping (minblocks↑) spills, and +col-tile-shrinking (mmq_x↓) re-reads the 18 GB weight set. This says the **dense M=128 tile is +already weight-read / one-read-optimal at mmq_x=128** — it is not occupancy-starved in a way the +cheap levers can fix. This contradicts the scope's central "self-inflicted occupancy, recover it by +raising resident CTAs" hypothesis *for the cheap levers*. + +The only lever that raises resident CTAs **without** spilling and **without** extra weight reads is +the structural `mmq_y`-down (smaller weight-row tile → smaller shared + smaller accumulator → more +CTAs, weights still read once). But `mmq_y` is **rigidly** `nwarps * tile_C::I = 8 * 16 = 128` +(the `mmq.cuh:3258` static_assert; `tile_C::I=16` is the fixed `m16n8k64` MMA shape). So +`mmq_y=64` requires **`nwarps=4`** — a warp-remap, not a host switch. That remap threads `nwarps` +through ~13 NVFP4-reachable sites including the **shared** `vec_dot_fp4_fp4_mma` (used by both NVFP4 +and MXFP4) and the loader/kernel nwarps lockstep, with real risk of a silent shared-mem/thread-block +mismatch. It was scoped but **deferred to P2** (the scope's own phase table also places `mmq_y`-down +at P2, after the P1 host-only knobs). The `get_mmq_y` host/device plumbing is committed and inert so +P2 only has to add the `nwarps` half. + +## Honest verdict vs the scope targets + +- **DENSE:** P1 (host knobs + min-blocks + re-read diagnostic) does **not** move decode_agg toward + the 391 target — it slightly *regresses* (149.5 → 147.9, 38 % of vLLM). The scope's P1 row + (~177, "honest: small") was optimistic; on GB10 the cheap levers are net-negative. The remaining + upside lives entirely in the P2 `mmq_y`-down (nwarps=4) kernel remap **plus** track A. Whether + that clears the floor is now an *open, unproven* question — the cheap-lever evidence here leans + *against* large occupancy upside (the tile already looks one-read-optimal), so the P2 ceiling is + plausibly lower than the scope's 316–328. +- **MoE:** the mmq_x-down lever (the brief's MoE P1) is a **confirmed dead-end on this model** + (regresses; GDN/BW-bound, reproduces patch 0015). min-blocks is neutral. No host-level MoE win. + +**Kill-gate: TRIPPED on both arms.** Per the brief this is *not* forced into a default-on change. +Committed: the P0 bit-exact gate + the default-off instrumentation + this honest record. Not pushed. + +## Reproduce +``` +# default (byte-identical stock): build-cuda as-is -> MUL_MAT 1115/1115, MUL_MAT_ID 805/805 +# occupancy probe: cmake build with -DGGML_CUDA_FP4_MINBLOCKS=2 (or flip the macro default) +# dense re-read A/B: GGML_CUDA_FP4_DENSE_MMQ_X=64 ./llama-batched-bench -m q36-27b-nvfp4.gguf ... +# nsys: nsys profile --trace cuda ... ; nsys stats --report cuda_gpu_kern_sum (watch mul_mat_q<40,128,0>) +```