feat(paged): qwen35moe NVFP4 activation-quantize de-dup (patch 0023)

Mirror patch 0023 + results into the paged series. Bit-exact MoE decode/prefill
lever: ggml mul_mat_id re-quantizes each token's activation once per expert for
the broadcast up/gate proj (ne11==1); quantize_mmq_nvfp4 has no cross-thread
reduction, so the gathered blocks are byte-identical across experts. The lever
quantizes the ne12 unique tokens once and gathers the block_fp4_mmq rows into the
expert-gathered layout with a coalesced uint4 copy (144 B = 9 uint4); the GEMM is
untouched and down_proj keeps the stock path.

Measured (DGX GB10, on top of patch 0022, q36-35b-a3b-nvfp4): decode S_TG npl128
745.2 -> 758.1 t/s (+1.73%), npl32 +0.6%, prefill T_PP -4%; dense q36-27b-nvfp4
byte-flat. nsys: quantize_mmq_nvfp4 868 -> 457 ms, gather +32 ms (net -379 ms).
Bit-exact: q36-27b 5951a5b4..., q36-35b-a3b 07db32c2... (on == off == 0022);
test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805. On by default;
GGML_CUDA_MOE_QUANT_DEDUP=0 restores stock.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-25 21:49:15 +00:00
parent 3c1ed67b4b
commit 02cbae5ea9
3 changed files with 538 additions and 0 deletions

View File

@@ -0,0 +1,144 @@
From f7409c2de2868a6a048d3c333329468b4cc9e483 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 23:47:25 +0200
Subject: [PATCH] feat(paged): qwen35moe NVFP4 activation-quantize de-dup
(patch 0023)
Bit-exact decode/prefill lever for the MoE (qwen3.5moe) NVFP4 path. ggml`s
mul_mat_id quantizes the EXPERT-GATHERED activation rows (ne11_flat =
ne12*n_expert_used). For the broadcast up/gate projections (ne11 == 1) every
expert of a token receives the SAME token activation, so the stock path
re-quantizes each token n_expert_used times. quantize_mmq_nvfp4 produces each
block as a pure per-thread function of its 16 consecutive inputs (no cross-thread
reduction), so the gathered blocks are byte-identical across the experts.
Lever: when ne11 == 1, quantize the ne12 UNIQUE token activations once, then
gather the resulting block_fp4_mmq rows into the expert-gathered layout keyed by
ids_src1 with a coalesced uint4 copy (block_fp4_mmq == 9 uint4 == 144 B). Pure
byte copy of identical blocks, so the gathered buffer is byte-for-byte identical
to re-quantizing each gathered row; the GEMM is untouched. down_proj
(ne11 == n_expert_used, distinct per expert) keeps the stock path.
Measured GB10 (sm_121a), on top of HEAD 8a3229f (patch 0022), q36-35b-a3b-nvfp4:
- nsys decode-isolated: quantize_mmq_nvfp4 868 -> 457 ms/run (-411 ms), new
gather_mmq_fp4 +32 ms; net -379 ms of decode GPU-time.
- S_TG npl128 745.2 -> 758.1 t/s (+1.73%), npl32 +0.6%; prefill T_PP -4%.
- Dense q36-27b-nvfp4 byte-flat (no mul_mat_id): 373.24 t/s unchanged.
Bit-exact gate (greedy --temp 0 --seed 1 md5, byte-identical to 0022):
q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439 (unchanged)
q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd (de-dup on == off)
test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805.
On by default; GGML_CUDA_MOE_QUANT_DEDUP=0 restores the stock re-quantize path.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/mmq.cu | 21 +++++++++++++++++--
ggml/src/ggml-cuda/quantize.cu | 37 +++++++++++++++++++++++++++++++++
ggml/src/ggml-cuda/quantize.cuh | 4 ++++
3 files changed, 60 insertions(+), 2 deletions(-)
diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu
index e1add5e..9933fa6 100644
--- a/ggml/src/ggml-cuda/mmq.cu
+++ b/ggml/src/ggml-cuda/mmq.cu
@@ -1,3 +1,4 @@
+#include <cstdlib>
#include "common.cuh"
#include "mmq.cuh"
#include "quantize.cuh"
@@ -197,8 +198,24 @@ void ggml_cuda_mul_mat_q(
const int64_t s13 = src1->nb[3] / ts_src1;
if (use_native_fp4) {
- quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
- ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
+ // 0023: de-dup the broadcast (up/gate) quantize. ne11==1 means src1 is shared
+ // across experts, so quantize the ne12 unique tokens once and gather the blocks.
+ static const bool moe_quant_dedup = []{
+ const char * e = getenv("GGML_CUDA_MOE_QUANT_DEDUP");
+ return e ? atoi(e) != 0 : true; // 0023: on by default; GGML_CUDA_MOE_QUANT_DEDUP=0 disables
+ }();
+ if (moe_quant_dedup && ne11 == 1) {
+ const size_t nbytes_unique = ne12*ne10_padded * sizeof(block_q8_1)/QK8_1 +
+ get_mmq_x_max_host(cc)*sizeof(block_q8_1_mmq);
+ ggml_cuda_pool_alloc<char> src1_unique(ctx.pool(), nbytes_unique);
+ quantize_mmq_fp4_cuda(src1_d, nullptr, src1_unique.get(), src0->type, ne10, s12, 0, 0,
+ ne10_padded, ne12, 1, 1, stream);
+ gather_mmq_fp4_cuda(src1_unique.get(), ids_src1.get(), src1_q8_1.get(),
+ ne11_flat, ne12, ne10_padded, stream);
+ } else {
+ quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
+ ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
+ }
} else {
quantize_mmq_q8_1_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
diff --git a/ggml/src/ggml-cuda/quantize.cu b/ggml/src/ggml-cuda/quantize.cu
index 39a500a..a7fd86f 100644
--- a/ggml/src/ggml-cuda/quantize.cu
+++ b/ggml/src/ggml-cuda/quantize.cu
@@ -419,6 +419,43 @@ void quantize_mmq_q8_1_cuda(
}
}
+// MoE NVFP4 quantize de-dup (0023): for the broadcast (up/gate) expert matmuls every
+// gathered row references one of ne12 unique token activations, so the stock path
+// re-quantizes each token n_expert_used times. Quantize the unique tokens once, then copy
+// the resulting block_fp4_mmq rows into the expert-gathered layout keyed by ids. This is a
+// pure byte copy of identical blocks => the gathered buffer is byte-identical to stock.
+static __global__ void gather_mmq_fp4(
+ const uint4 * __restrict__ unique, const int32_t * __restrict__ ids,
+ uint4 * __restrict__ gathered, const int ne11_flat, const int ne12_unique,
+ const int64_t total_words) {
+ constexpr int W = (int) (sizeof(block_fp4_mmq) / sizeof(uint4)); // 9 uint4 per 144B block
+ const int64_t t = (int64_t) blockIdx.x * blockDim.x + threadIdx.x;
+ if (t >= total_words) {
+ return;
+ }
+ const int w = (int) (t % W);
+ const int64_t ib = t / W; // destination block index = kb*ne11_flat + j
+ const int j = (int) (ib % ne11_flat);
+ const int kb = (int) (ib / ne11_flat);
+ const int src = ids[j];
+ const int64_t ib_u = (int64_t) kb * ne12_unique + src;
+ gathered[t] = unique[ib_u * W + w];
+}
+
+void gather_mmq_fp4_cuda(
+ const void * unique, const int32_t * ids, void * gathered,
+ int64_t ne11_flat, int64_t ne12_unique, int64_t ne0_padded, cudaStream_t stream) {
+ const int blocks_per_col = (int) ((ne0_padded + QK_K - 1) / QK_K);
+ constexpr int W = (int) (sizeof(block_fp4_mmq) / sizeof(uint4));
+ const int64_t total_words = ne11_flat * (int64_t) blocks_per_col * W;
+ const int bs = 256;
+ const dim3 block_size(bs, 1, 1);
+ const dim3 num_blocks((unsigned) ((total_words + bs - 1) / bs), 1, 1);
+ gather_mmq_fp4<<<num_blocks, block_size, 0, stream>>>(
+ (const uint4 *) unique, ids, (uint4 *) gathered,
+ (int) ne11_flat, (int) ne12_unique, total_words);
+}
+
void quantize_mmq_fp4_cuda(
const float * x, const int32_t * ids, void * vy, const ggml_type type_src0,
const int64_t ne00, const int64_t s01, const int64_t s02, const int64_t s03,
diff --git a/ggml/src/ggml-cuda/quantize.cuh b/ggml/src/ggml-cuda/quantize.cuh
index 768a3ae..7f64069 100644
--- a/ggml/src/ggml-cuda/quantize.cuh
+++ b/ggml/src/ggml-cuda/quantize.cuh
@@ -26,6 +26,10 @@ void quantize_mmq_q8_1_cuda(
ggml_type type_src0, int64_t ne00, int64_t s01, int64_t s02, int64_t s03,
int64_t ne0, int64_t ne1, int64_t ne2, int64_t ne3, cudaStream_t stream);
+void gather_mmq_fp4_cuda(const void * unique, const int32_t * ids, void * gathered,
+ int64_t ne11_flat, int64_t ne12_unique, int64_t ne0_padded,
+ cudaStream_t stream);
+
void quantize_mmq_fp4_cuda(const float * x,
const int32_t * ids,
void * vy,
--
2.43.0

View File

@@ -0,0 +1,71 @@
# MOE_QUANT_DEDUP_RESULTS.md - patch 0023 (qwen35moe NVFP4 activation-quantize de-dup)
Bit-exact MoE decode/prefill lever. Built + measured on DGX GB10 (sm_121a) on top of HEAD
8a3229f (patch 0022). Companion analysis: NONRECURRENCE_BITEXACT.md (section "nonrec-build").
## What
ggml `mul_mat_id` quantizes the EXPERT-GATHERED activation rows: it allocates
`ne11_flat = ne12 * n_expert_used` rows and quantizes each via `quantize_mmq_nvfp4(..., ids_src1)`.
For the broadcast up/gate projections the activation is the per-token hidden state, the SAME for
every expert that token routes to (`ne11 == 1`). So the stock path re-quantizes each token
`n_expert_used` times (4x for q36-35b-a3b).
`quantize_mmq_nvfp4` computes each `block_fp4_mmq` as a pure per-thread function of its 16
consecutive inputs (per-thread amax, the +/-2 ue4m3 search, the e2m1 packing - NO cross-thread
shfl/reduction). So the quantized block for a given token is byte-identical no matter which
expert slot it lands in.
## Lever
When `ne11 == 1` (broadcast up/gate):
1. Quantize the `ne12` UNIQUE token activations once into a compact buffer
(`quantize_mmq_fp4_cuda(src1_d, nullptr, ..., ne12, 1, 1)`, row stride `s12`).
2. Gather the `block_fp4_mmq` rows into the expert-gathered layout keyed by `ids_src1`
(`gather_mmq_fp4`): `block_fp4_mmq == 9 * uint4 == 144 B`, copied with a coalesced uint4
kernel whose output is written fully contiguously (`gathered[t] = unique[ib_u*9 + w]`).
Pure byte copy of identical blocks => the gathered buffer is byte-for-byte identical to
re-quantizing each gathered row. The MMQ GEMM is UNTOUCHED. `down_proj`
(`ne11 == n_expert_used`, distinct per expert) keeps the stock re-quantize path.
The first gather draft (one thread copies one 144 B struct, scattered) was uncoalesced and cost
478 ms - it ate 84% of the quantize saving and decode stayed flat. The shipped coalesced-uint4
gather costs 32 ms.
## Measurements (q36-35b-a3b-nvfp4 dense=q36-27b-nvfp4, -fa on, -npp 128 -ntg 128)
nsys decode-isolated (`--cuda-graph-trace=node`, npp8 ntg128 npl128), per-run kernel sums:
| kernel | dedup off | dedup on |
|-----------------------|-----------|----------|
| quantize_mmq_nvfp4 | 868 ms | 457 ms |
| gather_mmq_fp4 | - | 32 ms |
| net quantize path | 868 ms | 489 ms | (-379 ms decode GPU-time)
| gated_delta_net (50%) | unchanged | unchanged |
| mul_mat_q<NVFP4> | unchanged | unchanged |
Decode S_TG (t/s), back-to-back same-build A/B (default-on vs GGML_CUDA_MOE_QUANT_DEDUP=0):
| model | npl32 off->on | npl128 off->on |
|-----------------|------------------|-----------------------|
| MoE q36-35b-a3b | 440.3 -> 442.8 (+0.6%) | 745.2 -> 758.1 (+1.73%) |
| dense q36-27b | 207.4 -> 206.9 (flat) | 373.28 -> 373.24 (byte-flat) |
Prefill: MoE T_PP 7.69 -> 7.38 s (~ -4% time). Dense unaffected (no `mul_mat_id`).
## Bit-exact gate (greedy --temp 0 --seed 1 md5, byte-identical to 0022)
| model | md5 (default on) | == 0022 |
|------------------|--------------------------------------|---------|
| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | yes (dense untouched) |
| q36-35b-a3b-nvfp4| 07db32c2bcb78d17a43ed18bc22705cd | yes (on == off == 0022) |
test-backend-ops: MUL_MAT 1115/1115, MUL_MAT_ID 805/805 (default on).
## Knob
On by default. `GGML_CUDA_MOE_QUANT_DEDUP=0` restores the stock per-expert re-quantize path
(byte-identical output, used as the A/B baseline).
Commits: DGX dev tree f7409c2; worktree patch `0023-qwen35moe-nvfp4-quant-dedup.patch`.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -0,0 +1,323 @@
# NONRECURRENCE_BITEXACT.md - bit-exact non-recurrence decode levers (label nonrec-design, READ-ONLY, no GPU)
Post-0022 the gated-DeltaNet recurrence is at 84.6% BW = 102.6% of vLLM (3.488 ms/call), past parity.
The remaining ~5% to vLLM lives in the non-recurrence path. Per the node-level decode trace (nsys
`--cuda-graph-trace=node`, clean build, q36-27b-nvfp4 dense, npl128) the decode step is ONE replayed
CUDA graph, ALL kernels on a SINGLE stream (stream 14), strictly serial, 99.94% GPU-busy, 0.06% idle.
That single-stream-99.94%-busy fact is load-bearing for everything below: there is NO overlap, so any
kernel GPU-time genuinely removed (or any kernel folded away) cuts wall-clock 1:1; and conversely, if a
"faster kernel" leaves wall-clock flat, then the kernel did NOT actually get faster at the decode shape.
Post-recurrence-fix kernel mix of the ~367 ms decode step (was 380.4 pre-0022; recurrence now smaller):
- `mul_mat_q` FP4 GEMM (496 calls/step) ~24% (the biggest non-recurrence bucket)
- `quantize_mmq_nvfp4` (496/step) ~4.5%
- `nvjet` lm_head GEMM ~3.1%
- `flash_attn_ext_f16` (16 attn layers) ~3.1%
- elementwise glue: k_bin_bcast (gate mul+add) ~1.7%, unary_gated silu/sigmoid ~1.4%, rms_norm ~0.9%,
l2_norm ~0.2%, plus conv-state concat_cont/cpy (Lever-1 territory, not in this scope).
Files read on the DGX 0022 tree (HEAD 8a3229f): `mmq.cuh`, `mmq.cu`, `quantize.cu`, `gated_delta_net.cu`,
`fattn.cu`, `fattn-common.cuh`.
---
## RESOLUTION of the P2a puzzle (load-bearing) - mmmq_y=64 / minblocks: bit-exact but FLAT on decode
The existing P2a machinery is two NVFP4-gated, default-stock flags in `mmq.cuh`:
- `GGML_CUDA_FP4_MMQ_Y` (L143-163): overrides the weight-row N-tile `mmq_y` 128 -> 64/96 for NVFP4 on
Blackwell. mmq_y tiles N (output rows); each weight row lives in exactly one row-tile, so total weight
traffic is unchanged. **Bit-exact**: the per-output K-reduction is the `for frag` loop in
`vec_dot_fp4_fp4_mma` (L1097-1108, `sum[...] += C.x[l]`), whose order is independent of mmq_y. md5-
verified in prior runs (1115/805 gate, byte-identical).
- `GGML_CUDA_FP4_MINBLOCKS` (L205-216): raises the `__launch_bounds__` min-blocks operand (L3579-3585)
for NVFP4 so >1 CTA co-resides per SM. **Bit-exact**: register allocation / occupancy cannot change
results.
The paradox restated: P2a made a standalone `mul_mat_q<NVFP4,m=128>` -24.7% faster (bit-exact), yet
decode was FLAT (335->336 post-0020). The trace says decode is 99.94% single-stream busy and mul_mat_q
is ~24% of it, so a -24.7% cut should give ~+6%. RESOLUTION (airtight, from the single-stream fact):
> On a 99.94%-busy single stream, freed kernel GPU-time MUST lower the wall 1:1. Decode is flat =>
> mmq_y=64 did NOT free per-call GPU-time at the DECODE shapes => the -24.7% was measured at a
> NON-decode shape (a single large-N or prefill-M GEMM that runs enough waves to reach asymptotic
> throughput). There is no contradiction; the two measurements are at different GEMM shapes.
Mechanism (grounded in the launch path, `launch_mul_mat_q` L3989-4088): decode runs ONE `mul_mat_q` per
weight with mmq_x=128 fused tokens => ntx=1, and the grid is `nty = N / mmq_y` CTAs (xy-tiling, or
stream-k at nsm=48 when `tiles_efficiency_percent < 90`, L4044-4047). The 496 decode GEMMs have small N:
- FFN up/gate N=17408 -> nty=136 CTAs (mmq_y=128) = ceil(136/48)=3 waves, last wave 40/48=83% full
- FFN down / qkv / o-proj N~5120-6144 -> nty=40-48 CTAs = 1 wave (and eff<90 => stream-k at 48 CTAs)
So EVERY decode GEMM is a 1-3 wave, 40-136 CTA kernel: it is **ramp + tail (wave-quantization) bound**,
dominated by the first-wave weight-load latency before any MMA can start plus the fractional last wave -
NOT by steady-state occupancy. mmq_y=64 doubles the grid (272 CTAs, 6 waves for the fat FFN) which only
helps the ASYMPTOTIC achieved-BW the microbench measures; at 1-3 waves there is no steady state for it
to act over, and each CTA now carries half the arithmetic-per-weight-load so the ramp is relatively MORE
exposed. minblocks=2 is worse: the FP4 MMA is register-bound at ~255 regs/thread (the `(256,1)` bound),
so forcing 2 CTAs/SM register-caps to ~128 regs => heavy spill => net-negative. Both are the in-wave
occupancy lever, and the decode GEMM has no in-wave occupancy problem - it has a too-few-waves problem.
VERDICT: re-test P2a (mmq_y=64, and 96) and minblocks=2 ON TOP of 0022 because it is a FREE one-build
re-test (flags already exist, default stock). **Design prediction: still ~flat (maybe +1-2% from the
one fat-FFN N=17408 GEMM that has 3->6 waves of room; ~0% from the 1-wave thin GEMMs).** The decisive
measurement for the reprofile agent is NOT a standalone microbench - it is the PER-CALL `mul_mat_q`
GPU-time at the REAL decode shapes (the 496 calls), flag on vs off, summed. If per-call decode time
drops, it ships (free bit-exact win). If per-call decode time is ~unchanged (predicted), the -24.7%
was a large-N artifact and the GEMM has no bit-exact occupancy lever - confirming the structural wall.
WHY the decode GEMM has no high-value bit-exact lever: its bottleneck is wave-quantization at a small
grid. The only knobs that change the grid are (a) mmq_y-down [bit-exact, flat per above], (b) mmq_x-down
[FORBIDDEN: re-reads the 18 GB weights ntiles_x times, strictly worse, and pins one-read], (c) the
stream-k-vs-tiling threshold [FORBIDDEN for bit-exactness: stream-k splits each output tile's K-sum
across CTAs and re-adds via the fixup kernel - a DIFFERENT K-accumulation order than one-CTA-full-K
tiling, so flipping the L4047 threshold changes which path a GEMM takes and breaks md5 vs the 0022
baseline]. So at the bandwidth/wave-quant floor for these tiny grids, 3% FP4 efficiency is structural;
no order-preserving change moves it.
---
## RANKED bit-exact non-recurrence levers
Ranked by expected bit-exact decode gain. "Bit-exact-safe" = keeps the exact reduction/FMA order; the
gate is md5-identity to llama 0022 f32 output on both models (dense + MoE), greedy temp0.
### 1. Quantize producer-fold (Track A) - bit-exact-safe - ceiling 4.5%, realistic ~2-2.5%
Fold `quantize_mmq_nvfp4` (4.5%, ~17 ms, 496/step) into the PRODUCER epilogue (the rms_norm / silu that
emits each GEMM's activation), so the f32 activation is quantized to `block_fp4_mmq` directly from the
producer's registers instead of being written to HBM as f32 and re-read by a standalone quantize kernel.
- **Bit-exactness: SAFE, and unusually clean.** `quantize_mmq_nvfp4` (quantize.cu:78-171) computes
`amax_raw` PER-THREAD over the thread's own QK_NVFP4_SUB=16 values (L108-118) with NO cross-thread
shfl/reduction (unlike `quantize_mmq_q8_1` which does a warp shfl_xor). Each thread independently runs
the +/-2 ue4m3 scale search (L120-150) and `ggml_cuda_float_to_fp4_e2m1` packing (L155-166). So the
output block is a pure per-thread function of its 16 inputs. Copy that arithmetic VERBATIM into the
producer epilogue and the `block_fp4_mmq` bytes are identical => md5-safe. The only requirement is the
producer thread-layout owns contiguous 16-element K-sub-blocks (feasible for an rms_norm/silu epilogue).
- **Expected gain:** the win is removing the standalone kernel's f32 activation READ (the producer already
holds the f32); the quant compute + fp4 write still happen (now folded). So ~the read-half of the 17 ms,
~2-2.5% of the step, and it is REAL because the step is single-stream 99.94% busy (no overlap to hide
the removed kernel).
- **Trap / caveat:** the SPENT "Lever-2" was a DIFFERENT fusion (quantize -> GEMM *consumer* prologue,
measured net-zero because the GEMM still reads the same activation bytes). Track A is the *producer*
fold and removes a true f32 round-trip, so it is not subject to that flatness - but it needs real
producer-kernel surgery + the frozen `block_fp4_mmq` ABI (mmq.cuh:53), more plumbing than the others.
- Ranked #1: largest cleanly-bit-exact non-GEMM bucket, no reduction trap (per-thread quant).
### 2. Activation / op fold - POINTWISE subset only - bit-exact-safe - realistic ~1.5-2.5%
Fold the pure pointwise glue off the single-stream chain into the adjacent kernel's epilogue/prologue:
the GDN residual ADDs and gate MULs (`k_bin_bcast`, ~1.7%), the `silu`/`sigmoid` (`unary_gated`, ~1.4%,
the part that is the output gate, not FFN), and the post-GDN gate MUL after the output rms_norm.
- **Bit-exactness: SAFE for the pointwise ops only.** Add/mul/silu/sigmoid are elementwise fp32 with the
same formula and the same op order whether standalone or folded => byte-identical. This is the bit-exact
half of the prior Lever-3 design.
- **THE TRAP (FORBIDDEN half):** the `rms_norm`/`l2_norm` REDUCTIONS must NOT be re-folded with a
different reduction tree. The standalone `l2_norm_f32<32>`/`rms_norm_f32` use a specific warp/block
reduction; folding the norm into a kernel with a different `warp_reduce_sum` width or eps placement
(`x*rsqrt(sumsq+eps)` vs `x/max(sqrt(sumsq),eps)`) changes the last ULP => breaks md5. Fold the MUL that
FOLLOWS the norm (pointwise, safe); do NOT fold the norm's reduction. (This is the direct analog of the
f32x4 lane-remap trap that blocked the recurrence's vectorized state loads: any change to a reduction's
grouping is forbidden.)
- **Expected gain:** ceiling ~3.3% (the Lever-3 slice), realistic ~1.5-2.5% once the norm reductions are
excluded. Real (single-stream, no overlap), bounded, lower plumbing than #1 (no new ABI).
- Ranked #2: smaller than #1 and the high-value pieces (norms) are off-limits.
### 3. mul_mat_q occupancy retune (existing P2a: mmq_y=64/96, minblocks=2) - bit-exact-safe - ~FLAT
See the P2a resolution above. Bit-exact-safe (N-tiling / register-cap preserve the K-reduction order;
md5-verified). Design prediction FLAT on decode (decode GEMMs are 40-136 CTA, 1-3 wave, ramp/tail-bound;
the -24.7% was an asymptotic large-N number). **Worth the one-build re-test only because it is free**
(flags exist, default stock). Possible marginal +1-2% from the single N=17408 fat-FFN GEMM (3->6 waves).
Measure PER-CALL decode-shape `mul_mat_q` time, not a microbench. Ranked #3: zero plumbing, but low/zero
expected gain - it is the diagnostic that confirms the GEMM wall is structural, not a shippable lever.
### 4. Attention occupancy (flash_attn_ext_f16) - NO bit-exact lever - NO-GO
`flash_attn_ext_f16` is ~3.1% (11.67 ms, 16 attn layers), grid 48 CTAs = exactly ONE full wave on 48
SMs (trace). There is no occupancy headroom (already 1 wave, perfectly filled, no tail) and no in-wave
under-occupancy to fix. The only knobs that change the attention grid are split-KV / parallel_blocks /
a different KV-tile (the `ncols1`/`ncols2`/`cols_per_block` selection in `fattn.cu`), and EVERY one of
them changes the online-softmax running-max/sum RESCALING ORDER across KV blocks => NOT bit-exact
(forbidden, the softmax-rescale analog of the reduction-tree trap). At 3.1% with one full wave the
attention is effectively at floor. Ranked last: no bit-exact lever exists; do not pursue.
---
## FORBIDDEN levers (require a precision or accumulation-order change - excluded by the gate)
- Stream-k vs plain-tiling threshold flip for the GEMM wave-quant tail: splits + re-adds the K-sum across
CTAs => different f32 accumulation order than one-CTA-full-K tiling => breaks md5.
- Vectorized / lane-remapped tile loads in the GEMM (`load_tiles_nvfp4_nvfp4` / `load_ldmatrix`): any
remap of which lane holds which K-element changes the MMA fragment->accumulator mapping => can change
the per-output sum grouping => forbidden (the f32x4 lane-remap trap, same class that blocked the
recurrence's vectorized state loads).
- mmq_x-down at dense decode: re-reads the 18 GB weights `ntiles_x` times. Order-preserving but strictly
slower and breaks the one-read invariant; not a lever.
- Folding rms_norm / l2_norm with a different reduction tree or eps placement: last-ULP change => md5 break.
- flash-attn split-KV / KV-retile: changes the online-softmax rescale order => not bit-exact.
- bf16 state / bf16 anything: precision change, SHELVED, forbidden by the gate.
---
## One-line summary for the parent
The remaining non-recurrence decode gap has NO single big bit-exact lever. The largest cleanly bit-exact
win is the **quantize producer-fold (Track A, ~2-2.5%, the per-16 NVFP4 quant has no cross-thread
reduction so it copies verbatim into the rms_norm/silu epilogue)**; second is the **pointwise activation
fold (~1.5-2.5%, fold the residual adds / gate muls / silu but NOT the norm reductions)**; the
**mul_mat_q occupancy retune (P2a mmq_y/minblocks) is bit-exact but predicted FLAT** (decode GEMMs are
small-grid wave-quant/ramp-bound, so the -24.7% asymptotic number does not apply per-call - confirmed by
the airtight single-stream-99.94%-busy logic, re-test only because the flag is free); and **attention has
NO bit-exact lever** (already one full wave; every grid knob changes the softmax rescale order). The
P2a puzzle is resolved: not a contradiction - the -24.7% and the flat decode are simply at different GEMM
shapes (large-N asymptotic vs 1-3-wave decode per-call).
Assisted-by: Claude:opus-4.8 [Claude Code]
---
# EMPIRICAL P2a RE-TEST ON 0022 (label reprofile-puzzle, GPU agent) - measured, build + bench + nsys
The design section above PREDICTED P2a flat from the single-stream logic. This section is the GPU
measurement that CONFIRMS it byte-for-byte, plus one load-bearing correction: an early "+11% decode"
A/B was a STALE-BASELINE artifact, not the flag. Box: DGX GB10 (sm_121a), HEAD 8a3229f (patch 0022),
SM+MEM clock pinned 2190 MHz (verified via `nvidia-smi dmon`, identical base vs flag - NOT a clock story).
## (1) Fresh node-level decode decomposition (nsys --cuda-graph-trace=node, dense q36-27b-nvfp4, npl128)
Per-instance trace windowed to one steady decode step (103 steady steps, step = 48 GDN-layer boundaries):
Committed-default build (build-cuda-base, 336 t/s @128) -- step span 383.1 ms, kernel-busy 99.24-99.30%:
gated_delta_net (SSM recurrence) 193.97 ms/step 51.0% <- BINDING KERNEL
mul_mat_q<NVFP4,m=128,nc=0> 93.64 ms/step 24.6% <- the P2a target
quantize_mmq_nvfp4 16.77 ms/step 4.4%
nvjet (cublas lm_head GEMM) 12.07 ms/step 3.2%
flash_attn_ext_f16 11.69 ms/step 3.1%
concat_cont 8.14 / cpy_scalar 7.49 / k_get_rows 7.29 / ssm_conv 6.55 / silu 5.32 / k_bin_bcast 4.67
mul_mat_q_stream_k_fixup 3.95 / rms_norm 3.56 / ... ; SUM 380.1 ms = 99.24% of the 383.1 ms wall.
conv-inplace + GDN(16,8) build (the 374 t/s state) -- step span 345.3 ms, kernel-busy 99.0%:
gated_delta_net 167.99 (49.2%), mul_mat_q<NVFP4,128,0> 93.79 (27.5%), quantize 17.66 (5.2%),
nvjet 12.05 (3.5%), flash_attn 11.66 (3.4%), ssm_conv(fused update) 8.44 (2.5%), k_get_rows 7.32 (2.1%).
BINDING KERNEL = gated_delta_net (~49-51% of the step) in BOTH; mul_mat_q<NVFP4,m=128> is #2 (~25-27.5%).
Decode is ~99.0-99.3% GPU-busy single-stream (confirms the 99.94% claim; ~0 idle, strictly serial).
## (2) P2a A/B - the -DGGML_CUDA_FP4_MMQ_Y=64 nwarps-remap, re-applied + built + bit-exact-gated on 0022
The committed 0022 machinery was PARTIAL (patch 0017 templated get_mmq_y_device<type> but left
mmq_get_nwarps_device() stock -> mmq_y=64 + nwarps=8 fails static_assert nwarps*tile_C::I==mmq_y at
mmq.cuh:3280). Re-derived the full threading: templated mmq_get_nwarps_device<type>() -> mmq_y/16 (=4)
for NVFP4+flag; type-aware mmq_get_nwarps_host(...,type); threaded <type> through the NVFP4 loader (998),
write_back_mma (3266), process_tile (3500), mul_mat_q launch_bounds (3579/83/85) + body (3602),
stream_k_fixup launch_bounds (3832) + body (3843), 2 host launch sites (3994/4172). Reverted after.
cuobjdump proof the flag took effect: mul_mat_q<NVFP4,m=128,nc=0> STACK 112 -> 56 (256-thr/8-warp CTA
-> 128-thr/4-warp CTA => 1 -> 2 resident CTAs/SM). REG 255 (HW-capped), no new spill.
BIT-EXACT GATE (HELD): test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805; greedy md5 base==flag
IDENTICAL = 5951a5b4d624ce891e22ab5fca9bc439 (matches the prior P2a gate hash). Byte-identical output.
CLEAN A/B (same build dir, ONLY mmq.cuh toggled => non-mmq .o byte-identical; back-to-back, pinned clocks)
S_TG t/s, llama-batched-bench -fa on -npp128 -ntg128:
DENSE q36-27b: npl 32 208.02 -> 207.51 (-0.2%) npl 128 374.30 -> 373.19 (-0.3%) FLAT
MoE q36-35b-a3b: npl 32 438.83 -> 439.30 (+0.1%) npl 128 745.71 -> 745.07 (-0.1%) FLAT
Prefill S_PP also flat at 0022 (npp128 1056->1050; npp2048/npl1 1028.85->1024.19).
## (3) RESOLUTION - why FLAT, where the GEMM time goes, and a correction to the prior "-24.7%->+6%" logic
Decode-isolated per-kernel A/B (node trace, same-source toggle, identical non-mmq code):
gated_delta_net 167.99 -> 167.89 ms/step (IDENTICAL - it is byte-identical code, untouched)
mul_mat_q<NVFP4,128,0> 93.79 -> 92.74 ms/step (-1.1%, FLAT) <- the P2a target, decode shape
mul_mat_q_stream_k_fixup 3.96 -> 5.65 ms/step (+1.7ms, REGRESSES at nwarps/2=2)
=> the decode mmq FAMILY is flat-to-slightly-WORSE; the flag delivers ~nothing at the m=128 decode shape.
The "-24.7%" is REAL but it is a PREFILL-shape number. Full-run aggregate (npp128 ntg128, prefill+decode)
mul_mat_q<NVFP4,128>: 19630 -> 17569 ms = -10.5%; subtracting the flat decode portion (93.8x128 vs
92.7x128) leaves the prefill-shape portion at 7625 -> 5699 ms = -25.3% (matches the prior -24.7%). So the
occupancy lever genuinely cuts the COMPUTE/occupancy-bound prefill-shape GEMM ~25%, and ~0 of the
BANDWIDTH-bound m=128 decode-shape GEMM (it reads the full NVFP4 weight matrix from 273 GB/s LPDDR5x; the
mmq_y knob is deliberately bandwidth-neutral - every weight row still read once - so it cannot move a
bandwidth-bound wall). Confirmed at the SOURCE-of-decode level, not inferred.
Reconciling with "99.94% busy single stream => a -24.7% cut should give ~+6%": the PREMISE is false. The
flag does NOT cut the decode mul_mat_q by 24.7% (it cuts it 1.1%). There is therefore NO freed time on the
99% busy stream - so the "where does the freed time go (idle gaps?)" question is moot: no time is freed at
the decode shape. The contradiction dissolves: mul_mat_q IS on the critical path AND single-stream-busy, but
the lever simply doesn't accelerate the decode-shape invocation. (Net it slightly hurts via stream_k_fixup.)
CORRECTION to an earlier in-session A/B (recorded so the parent does not chase it): a first pass showed
build-cuda-base 334.6 -> "flag" 372 (+11%). That was a STALE-BASELINE artifact, NOT the flag. build-cuda-base
(binaries 18:46) was compiled from a pre-0021 source - it has NO ssm_conv_update_f32 (cuobjdump symbol count
0 vs 4 in the 0022 build) and the un-retuned GDN default (gated_delta_net 194 vs 168 ms/step). Those ~40 ms
of non-mmq differences (conv fuse ~14 ms + GDN ~26 ms) are the entire 334.6->373 gap. With a correct
same-source baseline (toggle ONLY mmq.cuh in one build dir) the flag is flat (373.19 vs 374.30). Lesson:
the only valid P2a A/B holds every non-mmq .o byte-identical; comparing two independently-built trees mixes
in whatever other flag/patch state each was built from.
## VERDICT
P2a (mmq_y=64 nwarps-remap) is BIT-EXACT (md5-identical, 1115/805) and a genuine ~25% PREFILL-shape FP4-GEMM
kernel win, but it is FLAT on decode (dense and MoE, npl 32 and 128) on 0022, AND flat on end-to-end prefill
S_PP at 0022 (prefill is GDN/other-bound at these sizes, not mmq-bound). It is NOT a decode-parity lever and
the decode commit-gate (lift decode_agg) is NOT met -> do NOT ship for decode. The binding decode kernel is
gated_delta_net (~50%); the only decode levers left are the bit-exact folds in the design section above
(quantize producer-fold ~2-2.5%, pointwise activation fold ~1.5-2.5%) and the GDN-region launch/fusion that
vLLM already has. The mmq P2a machinery was reverted; the 0022 tree is left git-clean.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
# nonrec-build (GPU agent) - built + measured. Lever shipped: MoE NVFP4 quantize de-dup (patch 0023)
Box: DGX GB10 (sm_121a), baseline = clean rebuild of HEAD 8a3229f (patch 0022) in build-cuda
(verified: mmq.cu.o rebuilt from clean source; the A/B-left binary was stale). md5 references
locked: q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439, q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd.
Baseline decode S_TG: dense 208.7/373.6, MoE 441/746 (npl 32/128). ncu unavailable (no
GPU-counter permission, no sudo) -> all verdicts are nsys + back-to-back same-build A/B.
## Levers EVALUATED
### A. quantize_mmq_nvfp4 occupancy retune (token-packing) - BIT-EXACT, FLAT -> not shipped
The decode quantize at the K=2048 shape is grid (128,1,1) = 128 CTAs = ~2.67 waves on 48 SMs.
Unlike mul_mat_q (bandwidth-bound on LPDDR5x, so P2a was flat), quantize moves trivial memory,
so I tried packing TPB token-rows per CTA (blockDim.y) to cut wave-quant - each thread still
quantizes its own 16 consecutive values, so byte-identical (md5 5951a5b4/07db32c2 held at TPB
1/2/4, after fixing the output ib index to use the token i1 not blockIdx.x). Result: DENSE npl128
DEAD-FLAT 373.25 across TPB 1/2/4; npl32 and MoE flat-to-slightly-WORSE at TPB>1. The decode
quantize is at its best config already (TPB=1 = max CTA parallelism = best latency hiding;
fewer/bigger CTAs hurt). Second bit-exact occupancy lever (after P2a) proven flat. Reverted.
### B. skip-ALL-quantize probe (NON-bit-exact, diagnostic) - the +30% MoE number is an ARTIFACT
Skipping quantize_mmq_fp4_cuda entirely (garbage buffer, FP4-MMA timing data-independent) showed
DENSE +2.7%/+3.7% (npl128/32) but MoE +29.9%/+43.9%. The MoE figure is NOT a valid ceiling: the
garbage activation also corrupts the router (ffn_gate_inp) quantize -> degenerate topk expert
selection -> less / better-localized expert work -> artificially fast. The authoritative
decode decomposition (nsys --cuda-graph-trace=node, npp8 ntg128 npl128) shows quantize is only
3.7% of MoE decode GPU-time, not 23%. Dense +2.7% IS real (rms_norm-fold territory, see D).
### C. SHIPPED - MoE NVFP4 activation-quantize de-dup (patch 0023) - BIT-EXACT, lifts decode+prefill
ggml mul_mat_id quantizes the gathered rows ne11_flat = ne12*n_expert_used. For the broadcast
up/gate proj (ne11==1) every expert of a token sees the SAME token activation, so stock
re-quantizes each token n_expert_used (=4 here) times. quantize_mmq_nvfp4 has NO cross-thread
reduction (per-16-element per-thread), so the gathered blocks are byte-identical across experts.
Lever: quantize the ne12 unique tokens once, then gather the block_fp4_mmq rows into the
expert-gathered layout with a coalesced uint4 copy (block_fp4_mmq = 9 uint4 = 144 B). GEMM
untouched; down_proj (ne11==n_expert_used, distinct) keeps stock.
- Gather v1 (per-thread 144 B struct copy) was UNCOALESCED: gather 478 ms ate 84% of the 570 ms
quantize saving -> flat. Gather v2 (coalesced uint4, output written contiguously) = 32 ms.
- nsys decode-isolated: quantize_mmq_nvfp4 868 -> 457 ms/run (-411 ms), gather +32 ms, net -379 ms.
- DECODE S_TG: MoE npl128 745.2 -> 758.1 (+1.73%), npl32 +0.6%. PREFILL T_PP -4%. DENSE byte-flat.
- BIT-EXACT GATE (default on): q36-27b 5951a5b4 (unchanged), q36-35b-a3b 07db32c2 (on==off==0022);
test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805. On by default; GGML_CUDA_MOE_QUANT_DEDUP=0
restores stock. Committed: DGX f7409c2 + worktree patch 0023.
### D. NOT built - dense quantize producer-fold (rms_norm -> fp4) - real but ~2.7%, needs graph fusion
Dense decode quantize is ~2.7% (skip B, real). Folding it into the rms_norm+mul producer is
bit-exact-feasible (keep the strided sumsq reduction byte-identical, re-partition only the
writeback to 16-consecutive-per-thread + the verbatim per-thread quant) but requires a 3-op
{RMS_NORM,MUL,MUL_MAT(NVFP4)} graph fusion hoisting the GEMM into the producer node and a
mul_mat_q pre-quantized-src1 path (the scratch is a per-call pool buffer). High plumbing for
~2.7% dense only; left for a follow-up. mul_mat_q (bandwidth wall), flash_attn (softmax rescale
order), lm_head (cublas) have NO bit-exact lever.
## Verdict
The non-recurrence path has ONE shippable bit-exact decode lever found and built: the MoE
quantize de-dup (0023, +1.73% MoE npl128 decode + 4% prefill, dense untouched, byte-identical).
It is the only redundant-work bucket; the rest of the non-recurrence kernels are at their
bit-exact floor (mul_mat_q bandwidth-bound, quantize occupancy-flat, attention softmax-locked).
The remaining bit-exact headroom is the dense rms_norm->fp4 producer-fold (~2.7% dense, graph-
fusion surgery, not built) and then bf16 state (precision change, shelved) - no other bit-exact
lever moves the LPDDR5x-bandwidth-bound, recurrence-dominated (~50%, past vLLM parity) decode wall.
Assisted-by: Claude:opus-4.8 [Claude Code]