feat(paged): qwen35moe NVFP4 activation-quantize de-dup (patch 0023)

Mirror patch 0023 + results into the paged series. Bit-exact MoE decode/prefill lever: ggml mul_mat_id re-quantizes each token's activation once per expert for the broadcast up/gate proj (ne11==1); quantize_mmq_nvfp4 has no cross-thread reduction, so the gathered blocks are byte-identical across experts. The lever quantizes the ne12 unique tokens once and gathers the block_fp4_mmq rows into the expert-gathered layout with a coalesced uint4 copy (144 B = 9 uint4); the GEMM is untouched and down_proj keeps the stock path. Measured (DGX GB10, on top of patch 0022, q36-35b-a3b-nvfp4): decode S_TG npl128 745.2 -> 758.1 t/s (+1.73%), npl32 +0.6%, prefill T_PP -4%; dense q36-27b-nvfp4 byte-flat. nsys: quantize_mmq_nvfp4 868 -> 457 ms, gather +32 ms (net -379 ms). Bit-exact: q36-27b 5951a5b4..., q36-35b-a3b 07db32c2... (on == off == 0022); test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805. On by default; GGML_CUDA_MOE_QUANT_DEDUP=0 restores stock. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 09:26:55 -04:00 · 2026-06-25 21:49:15 +00:00
parent 3c1ed67b4b
commit 02cbae5ea9
3 changed files with 538 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/0023-qwen35moe-nvfp4-quant-dedup.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0023-qwen35moe-nvfp4-quant-dedup.patch
@@ -0,0 +1,144 @@
+From f7409c2de2868a6a048d3c333329468b4cc9e483 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Thu, 25 Jun 2026 23:47:25 +0200
+Subject: [PATCH] feat(paged): qwen35moe NVFP4 activation-quantize de-dup
+ (patch 0023)
+
+Bit-exact decode/prefill lever for the MoE (qwen3.5moe) NVFP4 path. ggml`s
+mul_mat_id quantizes the EXPERT-GATHERED activation rows (ne11_flat =
+ne12*n_expert_used). For the broadcast up/gate projections (ne11 == 1) every
+expert of a token receives the SAME token activation, so the stock path
+re-quantizes each token n_expert_used times. quantize_mmq_nvfp4 produces each
+block as a pure per-thread function of its 16 consecutive inputs (no cross-thread
+reduction), so the gathered blocks are byte-identical across the experts.
+
+Lever: when ne11 == 1, quantize the ne12 UNIQUE token activations once, then
+gather the resulting block_fp4_mmq rows into the expert-gathered layout keyed by
+ids_src1 with a coalesced uint4 copy (block_fp4_mmq == 9 uint4 == 144 B). Pure
+byte copy of identical blocks, so the gathered buffer is byte-for-byte identical
+to re-quantizing each gathered row; the GEMM is untouched. down_proj
+(ne11 == n_expert_used, distinct per expert) keeps the stock path.
+
+Measured GB10 (sm_121a), on top of HEAD 8a3229f (patch 0022), q36-35b-a3b-nvfp4:
+- nsys decode-isolated: quantize_mmq_nvfp4 868 -> 457 ms/run (-411 ms), new
+  gather_mmq_fp4 +32 ms; net -379 ms of decode GPU-time.
+- S_TG npl128 745.2 -> 758.1 t/s (+1.73%), npl32 +0.6%; prefill T_PP -4%.
+- Dense q36-27b-nvfp4 byte-flat (no mul_mat_id): 373.24 t/s unchanged.
+
+Bit-exact gate (greedy --temp 0 --seed 1 md5, byte-identical to 0022):
+  q36-27b-nvfp4     5951a5b4d624ce891e22ab5fca9bc439 (unchanged)
+  q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd (de-dup on == off)
+  test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805.
+
+On by default; GGML_CUDA_MOE_QUANT_DEDUP=0 restores the stock re-quantize path.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/mmq.cu       | 21 +++++++++++++++++--
+ ggml/src/ggml-cuda/quantize.cu  | 37 +++++++++++++++++++++++++++++++++
+ ggml/src/ggml-cuda/quantize.cuh |  4 ++++
+ 3 files changed, 60 insertions(+), 2 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu
+index e1add5e..9933fa6 100644
+--- a/ggml/src/ggml-cuda/mmq.cu
+++ b/ggml/src/ggml-cuda/mmq.cu
+@@ -1,3 +1,4 @@
+#include <cstdlib>
+ #include "common.cuh"
+ #include "mmq.cuh"
+ #include "quantize.cuh"
+@@ -197,8 +198,24 @@ void ggml_cuda_mul_mat_q(
+         const int64_t s13 = src1->nb[3] / ts_src1;
+ 
+         if (use_native_fp4) {
+-            quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
+-                                    ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
+            // 0023: de-dup the broadcast (up/gate) quantize. ne11==1 means src1 is shared
+            // across experts, so quantize the ne12 unique tokens once and gather the blocks.
+            static const bool moe_quant_dedup = []{
+                const char * e = getenv("GGML_CUDA_MOE_QUANT_DEDUP");
+                return e ? atoi(e) != 0 : true;  // 0023: on by default; GGML_CUDA_MOE_QUANT_DEDUP=0 disables
+            }();
+            if (moe_quant_dedup && ne11 == 1) {
+                const size_t nbytes_unique = ne12*ne10_padded * sizeof(block_q8_1)/QK8_1 +
+                    get_mmq_x_max_host(cc)*sizeof(block_q8_1_mmq);
+                ggml_cuda_pool_alloc<char> src1_unique(ctx.pool(), nbytes_unique);
+                quantize_mmq_fp4_cuda(src1_d, nullptr, src1_unique.get(), src0->type, ne10, s12, 0, 0,
+                                        ne10_padded, ne12, 1, 1, stream);
+                gather_mmq_fp4_cuda(src1_unique.get(), ids_src1.get(), src1_q8_1.get(),
+                                    ne11_flat, ne12, ne10_padded, stream);
+            } else {
+                quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
+                                        ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
+            }
+         } else {
+             quantize_mmq_q8_1_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
+                                    ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
+diff --git a/ggml/src/ggml-cuda/quantize.cu b/ggml/src/ggml-cuda/quantize.cu
+index 39a500a..a7fd86f 100644
+--- a/ggml/src/ggml-cuda/quantize.cu
+++ b/ggml/src/ggml-cuda/quantize.cu
+@@ -419,6 +419,43 @@ void quantize_mmq_q8_1_cuda(
+     }
+ }
+ 
+// MoE NVFP4 quantize de-dup (0023): for the broadcast (up/gate) expert matmuls every
+// gathered row references one of ne12 unique token activations, so the stock path
+// re-quantizes each token n_expert_used times. Quantize the unique tokens once, then copy
+// the resulting block_fp4_mmq rows into the expert-gathered layout keyed by ids. This is a
+// pure byte copy of identical blocks => the gathered buffer is byte-identical to stock.
+static __global__ void gather_mmq_fp4(
+        const uint4 * __restrict__ unique, const int32_t * __restrict__ ids,
+        uint4 * __restrict__ gathered, const int ne11_flat, const int ne12_unique,
+        const int64_t total_words) {
+    constexpr int W = (int) (sizeof(block_fp4_mmq) / sizeof(uint4)); // 9 uint4 per 144B block
+    const int64_t t = (int64_t) blockIdx.x * blockDim.x + threadIdx.x;
+    if (t >= total_words) {
+        return;
+    }
+    const int     w   = (int) (t % W);
+    const int64_t ib  = t / W;                 // destination block index = kb*ne11_flat + j
+    const int     j   = (int) (ib % ne11_flat);
+    const int     kb  = (int) (ib / ne11_flat);
+    const int     src = ids[j];
+    const int64_t ib_u = (int64_t) kb * ne12_unique + src;
+    gathered[t] = unique[ib_u * W + w];
+}
+
+void gather_mmq_fp4_cuda(
+        const void * unique, const int32_t * ids, void * gathered,
+        int64_t ne11_flat, int64_t ne12_unique, int64_t ne0_padded, cudaStream_t stream) {
+    const int     blocks_per_col = (int) ((ne0_padded + QK_K - 1) / QK_K);
+    constexpr int W = (int) (sizeof(block_fp4_mmq) / sizeof(uint4));
+    const int64_t total_words = ne11_flat * (int64_t) blocks_per_col * W;
+    const int     bs = 256;
+    const dim3    block_size(bs, 1, 1);
+    const dim3    num_blocks((unsigned) ((total_words + bs - 1) / bs), 1, 1);
+    gather_mmq_fp4<<<num_blocks, block_size, 0, stream>>>(
+        (const uint4 *) unique, ids, (uint4 *) gathered,
+        (int) ne11_flat, (int) ne12_unique, total_words);
+}
+
+ void quantize_mmq_fp4_cuda(
+         const float * x, const int32_t * ids, void * vy, const ggml_type type_src0,
+         const int64_t ne00, const int64_t s01, const int64_t s02, const int64_t s03,
+diff --git a/ggml/src/ggml-cuda/quantize.cuh b/ggml/src/ggml-cuda/quantize.cuh
+index 768a3ae..7f64069 100644
+--- a/ggml/src/ggml-cuda/quantize.cuh
+++ b/ggml/src/ggml-cuda/quantize.cuh
+@@ -26,6 +26,10 @@ void quantize_mmq_q8_1_cuda(
+         ggml_type type_src0, int64_t ne00, int64_t s01, int64_t s02, int64_t s03,
+         int64_t ne0, int64_t ne1, int64_t ne2, int64_t ne3, cudaStream_t stream);
+ 
+void gather_mmq_fp4_cuda(const void * unique, const int32_t * ids, void * gathered,
+                         int64_t ne11_flat, int64_t ne12_unique, int64_t ne0_padded,
+                         cudaStream_t stream);
+
+ void quantize_mmq_fp4_cuda(const float *   x,
+                              const int32_t * ids,
+                              void *          vy,
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/MOE_QUANT_DEDUP_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_QUANT_DEDUP_RESULTS.md
@@ -0,0 +1,71 @@
+# MOE_QUANT_DEDUP_RESULTS.md - patch 0023 (qwen35moe NVFP4 activation-quantize de-dup)
+
+Bit-exact MoE decode/prefill lever. Built + measured on DGX GB10 (sm_121a) on top of HEAD
+8a3229f (patch 0022). Companion analysis: NONRECURRENCE_BITEXACT.md (section "nonrec-build").
+
+## What
+
+ggml `mul_mat_id` quantizes the EXPERT-GATHERED activation rows: it allocates
+`ne11_flat = ne12 * n_expert_used` rows and quantizes each via `quantize_mmq_nvfp4(..., ids_src1)`.
+For the broadcast up/gate projections the activation is the per-token hidden state, the SAME for
+every expert that token routes to (`ne11 == 1`). So the stock path re-quantizes each token
+`n_expert_used` times (4x for q36-35b-a3b).
+
+`quantize_mmq_nvfp4` computes each `block_fp4_mmq` as a pure per-thread function of its 16
+consecutive inputs (per-thread amax, the +/-2 ue4m3 search, the e2m1 packing - NO cross-thread
+shfl/reduction). So the quantized block for a given token is byte-identical no matter which
+expert slot it lands in.
+
+## Lever
+
+When `ne11 == 1` (broadcast up/gate):
+1. Quantize the `ne12` UNIQUE token activations once into a compact buffer
+   (`quantize_mmq_fp4_cuda(src1_d, nullptr, ..., ne12, 1, 1)`, row stride `s12`).
+2. Gather the `block_fp4_mmq` rows into the expert-gathered layout keyed by `ids_src1`
+   (`gather_mmq_fp4`): `block_fp4_mmq == 9 * uint4 == 144 B`, copied with a coalesced uint4
+   kernel whose output is written fully contiguously (`gathered[t] = unique[ib_u*9 + w]`).
+
+Pure byte copy of identical blocks => the gathered buffer is byte-for-byte identical to
+re-quantizing each gathered row. The MMQ GEMM is UNTOUCHED. `down_proj`
+(`ne11 == n_expert_used`, distinct per expert) keeps the stock re-quantize path.
+
+The first gather draft (one thread copies one 144 B struct, scattered) was uncoalesced and cost
+478 ms - it ate 84% of the quantize saving and decode stayed flat. The shipped coalesced-uint4
+gather costs 32 ms.
+
+## Measurements (q36-35b-a3b-nvfp4 dense=q36-27b-nvfp4, -fa on, -npp 128 -ntg 128)
+
+nsys decode-isolated (`--cuda-graph-trace=node`, npp8 ntg128 npl128), per-run kernel sums:
+| kernel                | dedup off | dedup on |
+|-----------------------|-----------|----------|
+| quantize_mmq_nvfp4    | 868 ms    | 457 ms   |
+| gather_mmq_fp4        | -         | 32 ms    |
+| net quantize path     | 868 ms    | 489 ms   |  (-379 ms decode GPU-time)
+| gated_delta_net (50%) | unchanged | unchanged |
+| mul_mat_q<NVFP4>      | unchanged | unchanged |
+
+Decode S_TG (t/s), back-to-back same-build A/B (default-on vs GGML_CUDA_MOE_QUANT_DEDUP=0):
+| model           | npl32 off->on    | npl128 off->on        |
+|-----------------|------------------|-----------------------|
+| MoE q36-35b-a3b | 440.3 -> 442.8 (+0.6%) | 745.2 -> 758.1 (+1.73%) |
+| dense q36-27b   | 207.4 -> 206.9 (flat)  | 373.28 -> 373.24 (byte-flat) |
+
+Prefill: MoE T_PP 7.69 -> 7.38 s (~ -4% time). Dense unaffected (no `mul_mat_id`).
+
+## Bit-exact gate (greedy --temp 0 --seed 1 md5, byte-identical to 0022)
+
+| model            | md5 (default on)                     | == 0022 |
+|------------------|--------------------------------------|---------|
+| q36-27b-nvfp4    | 5951a5b4d624ce891e22ab5fca9bc439     | yes (dense untouched) |
+| q36-35b-a3b-nvfp4| 07db32c2bcb78d17a43ed18bc22705cd     | yes (on == off == 0022) |
+
+test-backend-ops: MUL_MAT 1115/1115, MUL_MAT_ID 805/805 (default on).
+
+## Knob
+
+On by default. `GGML_CUDA_MOE_QUANT_DEDUP=0` restores the stock per-expert re-quantize path
+(byte-identical output, used as the A/B baseline).
+
+Commits: DGX dev tree f7409c2; worktree patch `0023-qwen35moe-nvfp4-quant-dedup.patch`.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/NONRECURRENCE_BITEXACT.md
+++ b/backend/cpp/llama-cpp/patches/paged/NONRECURRENCE_BITEXACT.md
@@ -0,0 +1,323 @@
+# NONRECURRENCE_BITEXACT.md - bit-exact non-recurrence decode levers (label nonrec-design, READ-ONLY, no GPU)
+
+Post-0022 the gated-DeltaNet recurrence is at 84.6% BW = 102.6% of vLLM (3.488 ms/call), past parity.
+The remaining ~5% to vLLM lives in the non-recurrence path. Per the node-level decode trace (nsys
+`--cuda-graph-trace=node`, clean build, q36-27b-nvfp4 dense, npl128) the decode step is ONE replayed
+CUDA graph, ALL kernels on a SINGLE stream (stream 14), strictly serial, 99.94% GPU-busy, 0.06% idle.
+That single-stream-99.94%-busy fact is load-bearing for everything below: there is NO overlap, so any
+kernel GPU-time genuinely removed (or any kernel folded away) cuts wall-clock 1:1; and conversely, if a
+"faster kernel" leaves wall-clock flat, then the kernel did NOT actually get faster at the decode shape.
+
+Post-recurrence-fix kernel mix of the ~367 ms decode step (was 380.4 pre-0022; recurrence now smaller):
+- `mul_mat_q` FP4 GEMM (496 calls/step) ~24% (the biggest non-recurrence bucket)
+- `quantize_mmq_nvfp4` (496/step) ~4.5%
+- `nvjet` lm_head GEMM ~3.1%
+- `flash_attn_ext_f16` (16 attn layers) ~3.1%
+- elementwise glue: k_bin_bcast (gate mul+add) ~1.7%, unary_gated silu/sigmoid ~1.4%, rms_norm ~0.9%,
+  l2_norm ~0.2%, plus conv-state concat_cont/cpy (Lever-1 territory, not in this scope).
+
+Files read on the DGX 0022 tree (HEAD 8a3229f): `mmq.cuh`, `mmq.cu`, `quantize.cu`, `gated_delta_net.cu`,
+`fattn.cu`, `fattn-common.cuh`.
+
+---
+
+## RESOLUTION of the P2a puzzle (load-bearing) - mmmq_y=64 / minblocks: bit-exact but FLAT on decode
+
+The existing P2a machinery is two NVFP4-gated, default-stock flags in `mmq.cuh`:
+- `GGML_CUDA_FP4_MMQ_Y` (L143-163): overrides the weight-row N-tile `mmq_y` 128 -> 64/96 for NVFP4 on
+  Blackwell. mmq_y tiles N (output rows); each weight row lives in exactly one row-tile, so total weight
+  traffic is unchanged. **Bit-exact**: the per-output K-reduction is the `for frag` loop in
+  `vec_dot_fp4_fp4_mma` (L1097-1108, `sum[...] += C.x[l]`), whose order is independent of mmq_y. md5-
+  verified in prior runs (1115/805 gate, byte-identical).
+- `GGML_CUDA_FP4_MINBLOCKS` (L205-216): raises the `__launch_bounds__` min-blocks operand (L3579-3585)
+  for NVFP4 so >1 CTA co-resides per SM. **Bit-exact**: register allocation / occupancy cannot change
+  results.
+
+The paradox restated: P2a made a standalone `mul_mat_q<NVFP4,m=128>` -24.7% faster (bit-exact), yet
+decode was FLAT (335->336 post-0020). The trace says decode is 99.94% single-stream busy and mul_mat_q
+is ~24% of it, so a -24.7% cut should give ~+6%. RESOLUTION (airtight, from the single-stream fact):
+
+> On a 99.94%-busy single stream, freed kernel GPU-time MUST lower the wall 1:1. Decode is flat =>
+> mmq_y=64 did NOT free per-call GPU-time at the DECODE shapes => the -24.7% was measured at a
+> NON-decode shape (a single large-N or prefill-M GEMM that runs enough waves to reach asymptotic
+> throughput). There is no contradiction; the two measurements are at different GEMM shapes.
+
+Mechanism (grounded in the launch path, `launch_mul_mat_q` L3989-4088): decode runs ONE `mul_mat_q` per
+weight with mmq_x=128 fused tokens => ntx=1, and the grid is `nty = N / mmq_y` CTAs (xy-tiling, or
+stream-k at nsm=48 when `tiles_efficiency_percent < 90`, L4044-4047). The 496 decode GEMMs have small N:
+- FFN up/gate N=17408 -> nty=136 CTAs (mmq_y=128) = ceil(136/48)=3 waves, last wave 40/48=83% full
+- FFN down / qkv / o-proj N~5120-6144 -> nty=40-48 CTAs = 1 wave (and eff<90 => stream-k at 48 CTAs)
+
+So EVERY decode GEMM is a 1-3 wave, 40-136 CTA kernel: it is **ramp + tail (wave-quantization) bound**,
+dominated by the first-wave weight-load latency before any MMA can start plus the fractional last wave -
+NOT by steady-state occupancy. mmq_y=64 doubles the grid (272 CTAs, 6 waves for the fat FFN) which only
+helps the ASYMPTOTIC achieved-BW the microbench measures; at 1-3 waves there is no steady state for it
+to act over, and each CTA now carries half the arithmetic-per-weight-load so the ramp is relatively MORE
+exposed. minblocks=2 is worse: the FP4 MMA is register-bound at ~255 regs/thread (the `(256,1)` bound),
+so forcing 2 CTAs/SM register-caps to ~128 regs => heavy spill => net-negative. Both are the in-wave
+occupancy lever, and the decode GEMM has no in-wave occupancy problem - it has a too-few-waves problem.
+
+VERDICT: re-test P2a (mmq_y=64, and 96) and minblocks=2 ON TOP of 0022 because it is a FREE one-build
+re-test (flags already exist, default stock). **Design prediction: still ~flat (maybe +1-2% from the
+one fat-FFN N=17408 GEMM that has 3->6 waves of room; ~0% from the 1-wave thin GEMMs).** The decisive
+measurement for the reprofile agent is NOT a standalone microbench - it is the PER-CALL `mul_mat_q`
+GPU-time at the REAL decode shapes (the 496 calls), flag on vs off, summed. If per-call decode time
+drops, it ships (free bit-exact win). If per-call decode time is ~unchanged (predicted), the -24.7%
+was a large-N artifact and the GEMM has no bit-exact occupancy lever - confirming the structural wall.
+
+WHY the decode GEMM has no high-value bit-exact lever: its bottleneck is wave-quantization at a small
+grid. The only knobs that change the grid are (a) mmq_y-down [bit-exact, flat per above], (b) mmq_x-down
+[FORBIDDEN: re-reads the 18 GB weights ntiles_x times, strictly worse, and pins one-read], (c) the
+stream-k-vs-tiling threshold [FORBIDDEN for bit-exactness: stream-k splits each output tile's K-sum
+across CTAs and re-adds via the fixup kernel - a DIFFERENT K-accumulation order than one-CTA-full-K
+tiling, so flipping the L4047 threshold changes which path a GEMM takes and breaks md5 vs the 0022
+baseline]. So at the bandwidth/wave-quant floor for these tiny grids, 3% FP4 efficiency is structural;
+no order-preserving change moves it.
+
+---
+
+## RANKED bit-exact non-recurrence levers
+
+Ranked by expected bit-exact decode gain. "Bit-exact-safe" = keeps the exact reduction/FMA order; the
+gate is md5-identity to llama 0022 f32 output on both models (dense + MoE), greedy temp0.
+
+### 1. Quantize producer-fold (Track A) - bit-exact-safe - ceiling 4.5%, realistic ~2-2.5%
+Fold `quantize_mmq_nvfp4` (4.5%, ~17 ms, 496/step) into the PRODUCER epilogue (the rms_norm / silu that
+emits each GEMM's activation), so the f32 activation is quantized to `block_fp4_mmq` directly from the
+producer's registers instead of being written to HBM as f32 and re-read by a standalone quantize kernel.
+- **Bit-exactness: SAFE, and unusually clean.** `quantize_mmq_nvfp4` (quantize.cu:78-171) computes
+  `amax_raw` PER-THREAD over the thread's own QK_NVFP4_SUB=16 values (L108-118) with NO cross-thread
+  shfl/reduction (unlike `quantize_mmq_q8_1` which does a warp shfl_xor). Each thread independently runs
+  the +/-2 ue4m3 scale search (L120-150) and `ggml_cuda_float_to_fp4_e2m1` packing (L155-166). So the
+  output block is a pure per-thread function of its 16 inputs. Copy that arithmetic VERBATIM into the
+  producer epilogue and the `block_fp4_mmq` bytes are identical => md5-safe. The only requirement is the
+  producer thread-layout owns contiguous 16-element K-sub-blocks (feasible for an rms_norm/silu epilogue).
+- **Expected gain:** the win is removing the standalone kernel's f32 activation READ (the producer already
+  holds the f32); the quant compute + fp4 write still happen (now folded). So ~the read-half of the 17 ms,
+  ~2-2.5% of the step, and it is REAL because the step is single-stream 99.94% busy (no overlap to hide
+  the removed kernel).
+- **Trap / caveat:** the SPENT "Lever-2" was a DIFFERENT fusion (quantize -> GEMM *consumer* prologue,
+  measured net-zero because the GEMM still reads the same activation bytes). Track A is the *producer*
+  fold and removes a true f32 round-trip, so it is not subject to that flatness - but it needs real
+  producer-kernel surgery + the frozen `block_fp4_mmq` ABI (mmq.cuh:53), more plumbing than the others.
+- Ranked #1: largest cleanly-bit-exact non-GEMM bucket, no reduction trap (per-thread quant).
+
+### 2. Activation / op fold - POINTWISE subset only - bit-exact-safe - realistic ~1.5-2.5%
+Fold the pure pointwise glue off the single-stream chain into the adjacent kernel's epilogue/prologue:
+the GDN residual ADDs and gate MULs (`k_bin_bcast`, ~1.7%), the `silu`/`sigmoid` (`unary_gated`, ~1.4%,
+the part that is the output gate, not FFN), and the post-GDN gate MUL after the output rms_norm.
+- **Bit-exactness: SAFE for the pointwise ops only.** Add/mul/silu/sigmoid are elementwise fp32 with the
+  same formula and the same op order whether standalone or folded => byte-identical. This is the bit-exact
+  half of the prior Lever-3 design.
+- **THE TRAP (FORBIDDEN half):** the `rms_norm`/`l2_norm` REDUCTIONS must NOT be re-folded with a
+  different reduction tree. The standalone `l2_norm_f32<32>`/`rms_norm_f32` use a specific warp/block
+  reduction; folding the norm into a kernel with a different `warp_reduce_sum` width or eps placement
+  (`x*rsqrt(sumsq+eps)` vs `x/max(sqrt(sumsq),eps)`) changes the last ULP => breaks md5. Fold the MUL that
+  FOLLOWS the norm (pointwise, safe); do NOT fold the norm's reduction. (This is the direct analog of the
+  f32x4 lane-remap trap that blocked the recurrence's vectorized state loads: any change to a reduction's
+  grouping is forbidden.)
+- **Expected gain:** ceiling ~3.3% (the Lever-3 slice), realistic ~1.5-2.5% once the norm reductions are
+  excluded. Real (single-stream, no overlap), bounded, lower plumbing than #1 (no new ABI).
+- Ranked #2: smaller than #1 and the high-value pieces (norms) are off-limits.
+
+### 3. mul_mat_q occupancy retune (existing P2a: mmq_y=64/96, minblocks=2) - bit-exact-safe - ~FLAT
+See the P2a resolution above. Bit-exact-safe (N-tiling / register-cap preserve the K-reduction order;
+md5-verified). Design prediction FLAT on decode (decode GEMMs are 40-136 CTA, 1-3 wave, ramp/tail-bound;
+the -24.7% was an asymptotic large-N number). **Worth the one-build re-test only because it is free**
+(flags exist, default stock). Possible marginal +1-2% from the single N=17408 fat-FFN GEMM (3->6 waves).
+Measure PER-CALL decode-shape `mul_mat_q` time, not a microbench. Ranked #3: zero plumbing, but low/zero
+expected gain - it is the diagnostic that confirms the GEMM wall is structural, not a shippable lever.
+
+### 4. Attention occupancy (flash_attn_ext_f16) - NO bit-exact lever - NO-GO
+`flash_attn_ext_f16` is ~3.1% (11.67 ms, 16 attn layers), grid 48 CTAs = exactly ONE full wave on 48
+SMs (trace). There is no occupancy headroom (already 1 wave, perfectly filled, no tail) and no in-wave
+under-occupancy to fix. The only knobs that change the attention grid are split-KV / parallel_blocks /
+a different KV-tile (the `ncols1`/`ncols2`/`cols_per_block` selection in `fattn.cu`), and EVERY one of
+them changes the online-softmax running-max/sum RESCALING ORDER across KV blocks => NOT bit-exact
+(forbidden, the softmax-rescale analog of the reduction-tree trap). At 3.1% with one full wave the
+attention is effectively at floor. Ranked last: no bit-exact lever exists; do not pursue.
+
+---
+
+## FORBIDDEN levers (require a precision or accumulation-order change - excluded by the gate)
+- Stream-k vs plain-tiling threshold flip for the GEMM wave-quant tail: splits + re-adds the K-sum across
+  CTAs => different f32 accumulation order than one-CTA-full-K tiling => breaks md5.
+- Vectorized / lane-remapped tile loads in the GEMM (`load_tiles_nvfp4_nvfp4` / `load_ldmatrix`): any
+  remap of which lane holds which K-element changes the MMA fragment->accumulator mapping => can change
+  the per-output sum grouping => forbidden (the f32x4 lane-remap trap, same class that blocked the
+  recurrence's vectorized state loads).
+- mmq_x-down at dense decode: re-reads the 18 GB weights `ntiles_x` times. Order-preserving but strictly
+  slower and breaks the one-read invariant; not a lever.
+- Folding rms_norm / l2_norm with a different reduction tree or eps placement: last-ULP change => md5 break.
+- flash-attn split-KV / KV-retile: changes the online-softmax rescale order => not bit-exact.
+- bf16 state / bf16 anything: precision change, SHELVED, forbidden by the gate.
+
+---
+
+## One-line summary for the parent
+The remaining non-recurrence decode gap has NO single big bit-exact lever. The largest cleanly bit-exact
+win is the **quantize producer-fold (Track A, ~2-2.5%, the per-16 NVFP4 quant has no cross-thread
+reduction so it copies verbatim into the rms_norm/silu epilogue)**; second is the **pointwise activation
+fold (~1.5-2.5%, fold the residual adds / gate muls / silu but NOT the norm reductions)**; the
+**mul_mat_q occupancy retune (P2a mmq_y/minblocks) is bit-exact but predicted FLAT** (decode GEMMs are
+small-grid wave-quant/ramp-bound, so the -24.7% asymptotic number does not apply per-call - confirmed by
+the airtight single-stream-99.94%-busy logic, re-test only because the flag is free); and **attention has
+NO bit-exact lever** (already one full wave; every grid knob changes the softmax rescale order). The
+P2a puzzle is resolved: not a contradiction - the -24.7% and the flat decode are simply at different GEMM
+shapes (large-N asymptotic vs 1-3-wave decode per-call).
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+
+---
+
+# EMPIRICAL P2a RE-TEST ON 0022 (label reprofile-puzzle, GPU agent) - measured, build + bench + nsys
+
+The design section above PREDICTED P2a flat from the single-stream logic. This section is the GPU
+measurement that CONFIRMS it byte-for-byte, plus one load-bearing correction: an early "+11% decode"
+A/B was a STALE-BASELINE artifact, not the flag. Box: DGX GB10 (sm_121a), HEAD 8a3229f (patch 0022),
+SM+MEM clock pinned 2190 MHz (verified via `nvidia-smi dmon`, identical base vs flag - NOT a clock story).
+
+## (1) Fresh node-level decode decomposition (nsys --cuda-graph-trace=node, dense q36-27b-nvfp4, npl128)
+Per-instance trace windowed to one steady decode step (103 steady steps, step = 48 GDN-layer boundaries):
+
+  Committed-default build (build-cuda-base, 336 t/s @128) -- step span 383.1 ms, kernel-busy 99.24-99.30%:
+    gated_delta_net (SSM recurrence)   193.97 ms/step   51.0%   <- BINDING KERNEL
+    mul_mat_q<NVFP4,m=128,nc=0>         93.64 ms/step   24.6%   <- the P2a target
+    quantize_mmq_nvfp4                  16.77 ms/step    4.4%
+    nvjet (cublas lm_head GEMM)         12.07 ms/step    3.2%
+    flash_attn_ext_f16                  11.69 ms/step    3.1%
+    concat_cont 8.14 / cpy_scalar 7.49 / k_get_rows 7.29 / ssm_conv 6.55 / silu 5.32 / k_bin_bcast 4.67
+    mul_mat_q_stream_k_fixup 3.95 / rms_norm 3.56 / ... ; SUM 380.1 ms = 99.24% of the 383.1 ms wall.
+
+  conv-inplace + GDN(16,8) build (the 374 t/s state) -- step span 345.3 ms, kernel-busy 99.0%:
+    gated_delta_net 167.99 (49.2%), mul_mat_q<NVFP4,128,0> 93.79 (27.5%), quantize 17.66 (5.2%),
+    nvjet 12.05 (3.5%), flash_attn 11.66 (3.4%), ssm_conv(fused update) 8.44 (2.5%), k_get_rows 7.32 (2.1%).
+
+  BINDING KERNEL = gated_delta_net (~49-51% of the step) in BOTH; mul_mat_q<NVFP4,m=128> is #2 (~25-27.5%).
+  Decode is ~99.0-99.3% GPU-busy single-stream (confirms the 99.94% claim; ~0 idle, strictly serial).
+
+## (2) P2a A/B - the -DGGML_CUDA_FP4_MMQ_Y=64 nwarps-remap, re-applied + built + bit-exact-gated on 0022
+The committed 0022 machinery was PARTIAL (patch 0017 templated get_mmq_y_device<type> but left
+mmq_get_nwarps_device() stock -> mmq_y=64 + nwarps=8 fails static_assert nwarps*tile_C::I==mmq_y at
+mmq.cuh:3280). Re-derived the full threading: templated mmq_get_nwarps_device<type>() -> mmq_y/16 (=4)
+for NVFP4+flag; type-aware mmq_get_nwarps_host(...,type); threaded <type> through the NVFP4 loader (998),
+write_back_mma (3266), process_tile (3500), mul_mat_q launch_bounds (3579/83/85) + body (3602),
+stream_k_fixup launch_bounds (3832) + body (3843), 2 host launch sites (3994/4172). Reverted after.
+
+  cuobjdump proof the flag took effect: mul_mat_q<NVFP4,m=128,nc=0> STACK 112 -> 56 (256-thr/8-warp CTA
+  -> 128-thr/4-warp CTA => 1 -> 2 resident CTAs/SM). REG 255 (HW-capped), no new spill.
+  BIT-EXACT GATE (HELD): test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805; greedy md5 base==flag
+  IDENTICAL = 5951a5b4d624ce891e22ab5fca9bc439 (matches the prior P2a gate hash). Byte-identical output.
+
+  CLEAN A/B (same build dir, ONLY mmq.cuh toggled => non-mmq .o byte-identical; back-to-back, pinned clocks)
+  S_TG t/s, llama-batched-bench -fa on -npp128 -ntg128:
+    DENSE q36-27b:   npl 32  208.02 -> 207.51 (-0.2%)   npl 128  374.30 -> 373.19 (-0.3%)   FLAT
+    MoE  q36-35b-a3b: npl 32  438.83 -> 439.30 (+0.1%)   npl 128  745.71 -> 745.07 (-0.1%)   FLAT
+  Prefill S_PP also flat at 0022 (npp128 1056->1050; npp2048/npl1 1028.85->1024.19).
+
+## (3) RESOLUTION - why FLAT, where the GEMM time goes, and a correction to the prior "-24.7%->+6%" logic
+Decode-isolated per-kernel A/B (node trace, same-source toggle, identical non-mmq code):
+    gated_delta_net          167.99 -> 167.89 ms/step  (IDENTICAL - it is byte-identical code, untouched)
+    mul_mat_q<NVFP4,128,0>    93.79 ->  92.74 ms/step  (-1.1%, FLAT)            <- the P2a target, decode shape
+    mul_mat_q_stream_k_fixup   3.96 ->   5.65 ms/step  (+1.7ms, REGRESSES at nwarps/2=2)
+  => the decode mmq FAMILY is flat-to-slightly-WORSE; the flag delivers ~nothing at the m=128 decode shape.
+
+The "-24.7%" is REAL but it is a PREFILL-shape number. Full-run aggregate (npp128 ntg128, prefill+decode)
+mul_mat_q<NVFP4,128>: 19630 -> 17569 ms = -10.5%; subtracting the flat decode portion (93.8x128 vs
+92.7x128) leaves the prefill-shape portion at 7625 -> 5699 ms = -25.3% (matches the prior -24.7%). So the
+occupancy lever genuinely cuts the COMPUTE/occupancy-bound prefill-shape GEMM ~25%, and ~0 of the
+BANDWIDTH-bound m=128 decode-shape GEMM (it reads the full NVFP4 weight matrix from 273 GB/s LPDDR5x; the
+mmq_y knob is deliberately bandwidth-neutral - every weight row still read once - so it cannot move a
+bandwidth-bound wall). Confirmed at the SOURCE-of-decode level, not inferred.
+
+Reconciling with "99.94% busy single stream => a -24.7% cut should give ~+6%": the PREMISE is false. The
+flag does NOT cut the decode mul_mat_q by 24.7% (it cuts it 1.1%). There is therefore NO freed time on the
+99% busy stream - so the "where does the freed time go (idle gaps?)" question is moot: no time is freed at
+the decode shape. The contradiction dissolves: mul_mat_q IS on the critical path AND single-stream-busy, but
+the lever simply doesn't accelerate the decode-shape invocation. (Net it slightly hurts via stream_k_fixup.)
+
+CORRECTION to an earlier in-session A/B (recorded so the parent does not chase it): a first pass showed
+build-cuda-base 334.6 -> "flag" 372 (+11%). That was a STALE-BASELINE artifact, NOT the flag. build-cuda-base
+(binaries 18:46) was compiled from a pre-0021 source - it has NO ssm_conv_update_f32 (cuobjdump symbol count
+0 vs 4 in the 0022 build) and the un-retuned GDN default (gated_delta_net 194 vs 168 ms/step). Those ~40 ms
+of non-mmq differences (conv fuse ~14 ms + GDN ~26 ms) are the entire 334.6->373 gap. With a correct
+same-source baseline (toggle ONLY mmq.cuh in one build dir) the flag is flat (373.19 vs 374.30). Lesson:
+the only valid P2a A/B holds every non-mmq .o byte-identical; comparing two independently-built trees mixes
+in whatever other flag/patch state each was built from.
+
+## VERDICT
+P2a (mmq_y=64 nwarps-remap) is BIT-EXACT (md5-identical, 1115/805) and a genuine ~25% PREFILL-shape FP4-GEMM
+kernel win, but it is FLAT on decode (dense and MoE, npl 32 and 128) on 0022, AND flat on end-to-end prefill
+S_PP at 0022 (prefill is GDN/other-bound at these sizes, not mmq-bound). It is NOT a decode-parity lever and
+the decode commit-gate (lift decode_agg) is NOT met -> do NOT ship for decode. The binding decode kernel is
+gated_delta_net (~50%); the only decode levers left are the bit-exact folds in the design section above
+(quantize producer-fold ~2-2.5%, pointwise activation fold ~1.5-2.5%) and the GDN-region launch/fusion that
+vLLM already has. The mmq P2a machinery was reverted; the 0022 tree is left git-clean.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+
+---
+
+# nonrec-build (GPU agent) - built + measured. Lever shipped: MoE NVFP4 quantize de-dup (patch 0023)
+
+Box: DGX GB10 (sm_121a), baseline = clean rebuild of HEAD 8a3229f (patch 0022) in build-cuda
+(verified: mmq.cu.o rebuilt from clean source; the A/B-left binary was stale). md5 references
+locked: q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439, q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd.
+Baseline decode S_TG: dense 208.7/373.6, MoE 441/746 (npl 32/128). ncu unavailable (no
+GPU-counter permission, no sudo) -> all verdicts are nsys + back-to-back same-build A/B.
+
+## Levers EVALUATED
+
+### A. quantize_mmq_nvfp4 occupancy retune (token-packing) - BIT-EXACT, FLAT -> not shipped
+The decode quantize at the K=2048 shape is grid (128,1,1) = 128 CTAs = ~2.67 waves on 48 SMs.
+Unlike mul_mat_q (bandwidth-bound on LPDDR5x, so P2a was flat), quantize moves trivial memory,
+so I tried packing TPB token-rows per CTA (blockDim.y) to cut wave-quant - each thread still
+quantizes its own 16 consecutive values, so byte-identical (md5 5951a5b4/07db32c2 held at TPB
+1/2/4, after fixing the output ib index to use the token i1 not blockIdx.x). Result: DENSE npl128
+DEAD-FLAT 373.25 across TPB 1/2/4; npl32 and MoE flat-to-slightly-WORSE at TPB>1. The decode
+quantize is at its best config already (TPB=1 = max CTA parallelism = best latency hiding;
+fewer/bigger CTAs hurt). Second bit-exact occupancy lever (after P2a) proven flat. Reverted.
+
+### B. skip-ALL-quantize probe (NON-bit-exact, diagnostic) - the +30% MoE number is an ARTIFACT
+Skipping quantize_mmq_fp4_cuda entirely (garbage buffer, FP4-MMA timing data-independent) showed
+DENSE +2.7%/+3.7% (npl128/32) but MoE +29.9%/+43.9%. The MoE figure is NOT a valid ceiling: the
+garbage activation also corrupts the router (ffn_gate_inp) quantize -> degenerate topk expert
+selection -> less / better-localized expert work -> artificially fast. The authoritative
+decode decomposition (nsys --cuda-graph-trace=node, npp8 ntg128 npl128) shows quantize is only
+3.7% of MoE decode GPU-time, not 23%. Dense +2.7% IS real (rms_norm-fold territory, see D).
+
+### C. SHIPPED - MoE NVFP4 activation-quantize de-dup (patch 0023) - BIT-EXACT, lifts decode+prefill
+ggml mul_mat_id quantizes the gathered rows ne11_flat = ne12*n_expert_used. For the broadcast
+up/gate proj (ne11==1) every expert of a token sees the SAME token activation, so stock
+re-quantizes each token n_expert_used (=4 here) times. quantize_mmq_nvfp4 has NO cross-thread
+reduction (per-16-element per-thread), so the gathered blocks are byte-identical across experts.
+Lever: quantize the ne12 unique tokens once, then gather the block_fp4_mmq rows into the
+expert-gathered layout with a coalesced uint4 copy (block_fp4_mmq = 9 uint4 = 144 B). GEMM
+untouched; down_proj (ne11==n_expert_used, distinct) keeps stock.
+- Gather v1 (per-thread 144 B struct copy) was UNCOALESCED: gather 478 ms ate 84% of the 570 ms
+  quantize saving -> flat. Gather v2 (coalesced uint4, output written contiguously) = 32 ms.
+- nsys decode-isolated: quantize_mmq_nvfp4 868 -> 457 ms/run (-411 ms), gather +32 ms, net -379 ms.
+- DECODE S_TG: MoE npl128 745.2 -> 758.1 (+1.73%), npl32 +0.6%. PREFILL T_PP -4%. DENSE byte-flat.
+- BIT-EXACT GATE (default on): q36-27b 5951a5b4 (unchanged), q36-35b-a3b 07db32c2 (on==off==0022);
+  test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805. On by default; GGML_CUDA_MOE_QUANT_DEDUP=0
+  restores stock. Committed: DGX f7409c2 + worktree patch 0023.
+
+### D. NOT built - dense quantize producer-fold (rms_norm -> fp4) - real but ~2.7%, needs graph fusion
+Dense decode quantize is ~2.7% (skip B, real). Folding it into the rms_norm+mul producer is
+bit-exact-feasible (keep the strided sumsq reduction byte-identical, re-partition only the
+writeback to 16-consecutive-per-thread + the verbatim per-thread quant) but requires a 3-op
+{RMS_NORM,MUL,MUL_MAT(NVFP4)} graph fusion hoisting the GEMM into the producer node and a
+mul_mat_q pre-quantized-src1 path (the scratch is a per-call pool buffer). High plumbing for
+~2.7% dense only; left for a follow-up. mul_mat_q (bandwidth wall), flash_attn (softmax rescale
+order), lm_head (cublas) have NO bit-exact lever.
+
+## Verdict
+The non-recurrence path has ONE shippable bit-exact decode lever found and built: the MoE
+quantize de-dup (0023, +1.73% MoE npl128 decode + 4% prefill, dense untouched, byte-identical).
+It is the only redundant-work bucket; the rest of the non-recurrence kernels are at their
+bit-exact floor (mul_mat_q bandwidth-bound, quantize occupancy-flat, attention softmax-locked).
+The remaining bit-exact headroom is the dense rms_norm->fp4 producer-fold (~2.7% dense, graph-
+fusion surgery, not built) and then bf16 state (precision change, shelved) - no other bit-exact
+lever moves the LPDDR5x-bandwidth-bound, recurrence-dominated (~50%, past vLLM parity) decode wall.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]