docs(paged): build-ready track-B FP4-GEMM scope - kernel decision + per-phase decode_agg

Rewrite the track-B scope into the definitive build-ready plan for the NVFP4 FP4-MMA decode GEMM toward vLLM GB10 parity. Source-read of the mmq.cuh/mma.cuh/quantize.cu FP4 path on the dgx paged dev tree settles two load-bearing facts the prior draft got partly wrong: - llama's dense path is already TRUE W4A4 (block_fp4_mmq packs 256 e2m1 values + ue4m3 scales; the MMA is kind::mxf4nvf4 e2m1.e2m1...ue4m3), so there is no activation-bit-width work to do; the whole dense deficit is scheduling/occupancy. - the mmq_x selector minimizes ntiles_x, which PINS dense decode at mmq_x=128 (weights read once). Shrinking mmq_x re-reads the 18 GB weights, so the dense occupancy lever is mmq_y-down (BW-neutral), NOT mmq_x-down; MoE's free lever is the per-expert mmq_x-down (patch 0015). Adds the explicit kernel-approach decision (tune the existing FP4-MMA mul_mat_q; reject the cutlass-SM120 rewrite, dead on GB10 and broken on sm_121; reject the BF16-Marlin descent), the concrete build-ready changes (mmq_y/granularity/stream-k knobs, FP4-MMA fragment invariants, the ue4m3 scale path, and the block_fp4_mmq y-tile ABI contract for the track-A act-quant fusion handoff), the GB10-fit rules, the bit-exact test-backend-ops gate with decode-shape + ragged-M cases, and per-phase expected decode_agg tables. Verdict (honest, roofline-grounded): the decode GEMM is bandwidth-bound on the hardware roofline (M=128 << crossover 611; weight-read floors 4-6x above vLLM) but compute-bound in practice at ~3% FP4 eff, so 273 GB/s is not the wall. DENSE: GO (conditional) - B+A reaches 376-394 tok/s = 90-103% of vLLM 391, gated by a P2 occupancy kill-gate (<15% FP4 eff -> parity off). MoE: PARTIAL/NO-GO - ceiling ~76% of 811 (618) from the GEMM alone; full MoE parity needs the non-GEMM tracks too. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-25 17:12:10 -04:00 · 2026-06-24 14:21:48 +00:00
parent c1d7f336cb
commit 7434d64c75
1 changed files with 314 additions and 223 deletions
--- a/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md
+++ b/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md
@@ -1,271 +1,362 @@
-# Track B: the FP4-MMA weight-GEMM for GB10 decode parity with vLLM — roofline + go/no-go
+# Track B: the FP4-MMA weight-GEMM for GB10 decode parity with vLLM — build-ready scope + honest go/no-go

-Scope only (build-ready plan + honest verdict). **Not implemented in this workflow.** This is the
-residual-kernel track after track A (fuse the standalone `quantize_mmq_nvfp4` activation-requant,
-the 8.2% bucket) is handled separately. Track B asks the load-bearing question and answers it
-quantitatively: at the decode batch shape (M≈128 tokens, NVFP4 weights), is the weight GEMM
-**compute-bound** (FP4-MMA throughput is the lever → parity reachable with a better kernel) or
-**bandwidth-bound** (273 GB/s weight-read is a hard floor → parity capped)? And given the prior
-GB10 occupancy history, can a better FP4-MMA decode GEMM actually reach vLLM's 391 (dense) / 811
-(MoE) tok/s, or only partway?
+Scope only (build-ready plan + honest verdict). **Not implemented in this workflow.** Track B is the
+residual-kernel track after track A (fuse the standalone `quantize_mmq_fp4` activation-requant, the
+8.2% decode bucket — tasks 38-41, the fused `rms_norm+mul+nvfp4-quant` producer + prequantized-MMQ
+consumer) is handled separately. Track B owns the **weight GEMM**, the ~59% bucket.
+
+**The load-bearing question, restated:** at the decode batch shape (M≈128 tokens fused into one
+ubatch, NVFP4 weights), is the weight GEMM **compute-bound** (FP4-MMA throughput is the lever →
+parity reachable with a better kernel) or **bandwidth-bound** (273 GB/s weight-read is a hard floor →
+parity capped)? And given the GB10 occupancy history, can a better FP4-MMA decode GEMM actually reach
+vLLM's **391 (dense) / 811 (MoE)** decode-agg tok/s @npl128, or only partway?

 Hardware: NVIDIA GB10 / DGX Spark, sm_121 (CC 1210 = `GGML_CUDA_CC_DGX_SPARK`), unified LPDDR5x.
 Dev tree `~/llama-paged-dev` (branch `paged`, build-cuda sm_121). All numbers are reasoned from the
-committed nsys decomposition + measured GB10 specs; **no new GPU benchmarks were run** (track A is on
-the box).
+committed nsys decomposition + measured GB10 specs + a source read of the FP4-MMA kernel; **no new GPU
+benchmarks were run** (track A is on the box).

-## 0. The grounded inputs (measured, committed)
+## 0. Grounded inputs (measured, committed)

 | quantity | value | source |
 |---|---|---|
 | LPDDR5x bandwidth (spec) | **273 GB/s** | `BLACKWELL_KERNEL_GAPS.md`, `VLLM_DECODE_GROUNDING.md` |
-| LPDDR5x bandwidth (achieved, batch-1) | **~216 GB/s** (19 GB / ~88 ms irreducible) | prior batch-1 weight-read study |
-| FP4 (NVFP4/MXFP4) dense peak | **~427–500 TFLOP/s** (2× BF16; GB10 is 1:1:2 BF16:INT8:FP4) | `BLACKWELL_KERNEL_GAPS.md` §2 (measured) |
-| BF16 peak | ~213 TFLOP/s | same |
-| Demonstrated GB10 FP4-MMA efficiency | **~17%** of FP4 peak at prefill M=512 (MXFP4 dense 1153 t/s); ~3–7% at decode; ~5% MoE | `BLACKWELL_KERNEL_GAPS.md` §6, `GDN_DECODE_VERIFY.md` |
-| Demonstrated GB10 INT8-MMQ efficiency | ~21% of BF16 peak | `BLACKWELL_KERNEL_GAPS.md` §3 |
-| Dense Qwen3.6-27B NVFP4 weights | **18.8 GB** file (`q36-27b-nvfp4.gguf`); ~18 GB matmul tensors | `du` on DGX |
-| MoE Qwen3.6-35B-A3B NVFP4 weights | **23.85 GB** file; ~22 GB read/step at npl128 (≈98% experts hit) | `du` on DGX |
-| Decode step decomposition (dense npl128, nsys, GPU 92.7% busy) | GEMM_weight **59.2%**, act_quant 8.2%, GDN(recurrent+conv) 10.4%, full-attn 1.8%, elementwise/norm/rope 13.5%, embed 2.9%, copy 1.8% | `GDN_DECODE_VERIFY.md` §3a |
-| Measured per-step times @npl128 | dense **~795 ms** (llama) → **~328 ms** (vLLM); MoE **~384 ms** → **~158 ms** | `VLLM_DECODE_GROUNDING.md` |
-| Aggregate decode @npl128 | dense 161 (llama) vs **391** (vLLM); MoE 333 vs **811** | `QWEN36_NVFP4_BENCH.md` |
+| LPDDR5x bandwidth (achieved, batch-1 weight read) | **~216 GB/s** (19 GB / ~88 ms irreducible) | prior batch-1 study |
+| FP4 (NVFP4/MXFP4) dense peak | **~427–500 TFLOP/s** (2× BF16; GB10 is 1:1:2 BF16:INT8:FP4) | `BLACKWELL_KERNEL_GAPS.md` §2 |
+| BF16 / INT8 peak | ~213 TFLOP/s / ~215 TOPS (INT8 == BF16 on GB10) | same §2 |
+| Demonstrated GB10 FP4-MMA efficiency | **~17%** of FP4 peak at prefill M=512 (MXFP4 dense 1153 t/s); **~3% dense / ~35%-of-BW MoE at decode** | `BLACKWELL_KERNEL_GAPS.md` §6, `GDN_DECODE_VERIFY.md` |
+| Dense Qwen3.6-27B NVFP4 weights | **18.8 GB** file; ~18 GB matmul tensors | `du` on DGX |
+| MoE Qwen3.6-35B-A3B NVFP4 weights | **23.85 GB** file; ~22 GB read/step @npl128 (~98% experts hit) | `du` on DGX |
+| Decode step decomposition (dense npl128, nsys, GPU 92.7% busy) | GEMM_weight **59.2%**, act_quant 8.2%, GDN 10.4%, full-attn 1.8%, elementwise/norm/rope 13.5%, embed 2.9%, copy 1.8% | `GDN_DECODE_VERIFY.md` §3a |
+| Measured per-step @npl128 | dense **~795 ms** (llama) → **~328 ms** (vLLM); MoE **~384 ms** → **~158 ms** | `VLLM_DECODE_GROUNDING.md` |
+| Aggregate decode @npl128 (the parity scoreboard) | dense **161** (llama) vs **391** (vLLM); MoE **333** vs **811** | `QWEN36_NVFP4_BENCH.md` |

-Crossover formula used throughout (per-GEMM and whole-model are identical):
-`M* = b · peak / (2 · BW)` where `b` = bytes per weight element. Below `M*` the GEMM is
-bandwidth-bound; above it, compute-bound.
+`decode_agg = npl / step_s = 128 / step_s`. Crossover formula throughout:
+`M* = b · peak / (2 · BW)`, `b` = bytes per weight element. Below `M*` bandwidth-bound, above it
+compute-bound.

 ---

-## 1. DENSE Qwen3.6-27B — the roofline at decode M=128
+## 1. The kernel-approach decision: TUNE the existing FP4-MMA `mul_mat_q`, do NOT write a cutlass kernel

-`b = 18e9 B / 27e9 params = 0.667 B/param`. FLOPs/step `= 2·M·P = 2·128·27e9 = 6.91 TFLOP`.
+This is the first thing track B must settle, and the evidence settles it decisively.

-**(a) Weight-read floor** (weights read ONCE for all 128 tokens):
- @273 GB/s: 18 GB / 273 = **65.9 ms/step → 1,942 tok/s ceiling**
- @216 GB/s (achieved): 18 / 216 = **83 ms/step → 1,542 tok/s**
+| option | verdict | why |
+|---|---|---|
+| **(A) Tune the existing `mul_mat_q<NVFP4>` FP4-MMA path** | **CHOSEN — the tractable spine** | The kernel already exists, is **bit-exact** (`test-backend-ops MUL_MAT` 1103/1103), is genuine **W4A4** (below), and already **beats vLLM at batch-1 prefill** (MXFP4 1153 t/s vs vLLM's 800 W4A16 — vLLM has no FP4 cubins on sm_121). The deficit is **decode-shape scheduling**, not the math op. Host-side selection + a bounded occupancy tune respects the GB10 lessons and is build-ready against known files/lines. |
+| **(B) New cutlass-style SM120 FP4 collective** | **REJECTED** | Repeats the **proven GB10 dead-end**: the from-scratch W4A16 BF16 GEMM hit only ~9–15 TFLOP/s (¼ of MMQ) and was **STOPPED** (`W4A16_MARLIN_KERNEL_PLAN.md`) because deep `cp.async` + XOR-swizzle **collapse GB10 occupancy**. Worse, **CUTLASS's own SM120 grouped block-scaled FP4 GEMM is broken on consumer Blackwell** (garbage/init-fail — CUTLASS #3096/#2800) — it is the exact reason vLLM falls back to **BF16 Marlin** for its MoE on sm_121. "Port cutlass" is not even a working option for the MoE arm. |
+| **(C) Marlin-style W4A16 (FP4→BF16 dequant + BF16 HMMA)** | **REJECTED for the win, noted for context** | This is what **vLLM's MoE actually runs** on sm_121 (W4A16, BF16 activations, dequant-in-mainloop). On GB10 **INT8 == BF16 == ½ FP4 rate**, so a BF16-HMMA path concedes the 2× FP4 advantage llama already has. We do not want to *descend* to vLLM's slower arithmetic class; we want to keep the FP4-MMA class and schedule it better. |

-**(b) Compute floor:**
- @FP4 peak 500 TF/s: 6.91 / 500 = **13.8 ms → 9,275 tok/s**
- @17% FP4 (85 TF/s, the demonstrated prefill ceiling): 81 ms → 1,580 tok/s
- @5% FP4 (25 TF/s, measured decode regime): 276 ms → 464 tok/s
+**Decision: track B = tune `mul_mat_q<NVFP4>` (dense, `mmq.cu`/`mmq.cuh`) + the grouped `mul_mat_q`
+id-branch (MoE, `mmid.cu` + the same `mmq.cuh`).** No new kernel, no rewrite, no descent to BF16.
+The win is kernel *engineering around an FP4-MMA llama already possesses*, so there is **no
+hardware-instruction wall** — but it is gated by whether MMQ's occupancy-bound design can be pushed
+to the bandwidth floor at the thin decode M-tile.

-**(c) Crossover:**
- At FP4 **peak**: `M* = 0.667·500e12 / (2·273e9) = 611`. **M=128 ≪ 611 → an ideal FP4 GEMM at decode is BANDWIDTH-BOUND.**
- At the kernel's **achieved** efficiency the effective peak collapses, dragging `M*` down: 17% → M*≈104; 5% → M*≈30. So **at its current ~3–7% efficiency the kernel is COMPUTE-BOUND at M=128** (limited by its own poor FP4-MMA throughput), even though the hardware says it should be bandwidth-bound.
+### What "the existing path" actually is (source-read, DGX `ggml/src/ggml-cuda/`)

-**Where llama actually sits:** GEMM = 59.2% × 795 ms = **471 ms**. Achieved = 6.91e12 / 0.471 =
-**14.7 TFLOP/s = 2.9% of FP4 peak**. That is **7.1× slower than the 66 ms weight-read floor** and
-matches the ~3–7% decode-efficiency band. The 471 ms is not a hardware bandwidth wall — it is the
-FP4-MMA kernel running deep in compute-bound territory at single-digit efficiency.
+Decode runs **one `mul_mat_q` per weight, M=128** (all 128 slots' single tokens fused into one
+ubatch — confirmed `mul_mat_q(M=128)` in `GDN_DECODE_VERIFY.md`, not 128× M=1). The NVFP4 path:
+`mmq.cu` `use_native_fp4` gate (L125) → `quantize_mmq_fp4_cuda` act-quant (L138 dense / L200 id;
+**track A's fuse target**) → `mul_mat_q` → `vec_dot_fp4_fp4_mma` (`mmq.cuh:997`) →
+`mma_block_scaled_fp4` (`mma.cuh:1126`).

-**Where vLLM sits:** step 328 ms → if its native-FP4 cutlass GEMM is at the ~66 ms BW floor, the
-GEMM is only ~20% of vLLM's step; the rest (~262 ms) is GDN + full-attn + host. vLLM's **whole step
-(328 ms) ≈ llama's GEMM bucket alone (471 ms)** minus a bit. The entire 2.42× gap is the GEMM.
+**Confirmed W4A4 (this corrects an earlier "A is 8-bit-class" framing):** `block_fp4_mmq`
+(`mmq.cuh:53`) is `uint32_t d4[4]` (four `ue4m3` block scales) + `int8_t qs[4*32]` = **256 FP4 (e2m1)
+values packed 2-per-byte**. `quantize_mmq_fp4_cuda` (`quantize.cu:422`) emits FP4 via
+`ggml_cuda_float_to_fp4_e2m1`. The MMA is
+`mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3`
+(`mma.cuh:1145`) — **both operands e2m1, ue4m3 block scales**. So llama's dense FP4-MMA path is
+already the *same arithmetic class as vLLM's cutlass W4A4 dense*. The `sizeof(block_fp4_mmq) ==
+sizeof(block_q8_1_mmq)` static_assert is a shared-tile-footprint convention, **not** an 8-bit
+activation. **Consequence: there is no "make activations 4-bit" work to do and no activation-traffic
+halving to win — that is already banked. The entire dense deficit is scheduling/occupancy.**

-**Dense parity arithmetic** (795 ms = GEMM 471 + act 65 + GDN 83 + attn 14 + rest 162):
- B alone (GEMM → 66 ms BW floor, requires ~21% FP4 eff): step 728→… = 66+65+83+14+162 = **390 ms → 328 tok/s = 84% of vLLM**.
- **B + A** (GEMM 66 ms floor **and** act-quant fused away): 66+83+14+162 = **325 ms → 394 tok/s = 101% of vLLM → PARITY/BEAT.**
- B+A at the softer 17% FP4 (GEMM 81 ms, the *demonstrated* prefill ceiling, not the 21% floor): 340 ms → **376 tok/s = 96% of vLLM.**
-
-**Dense robust band: 90–103% of vLLM**, and it is insensitive to the 273-vs-216 GB/s uncertainty
-(at 216 GB/s the floor is 83 ms → step 357 ms → 359 tok/s = 92%). The conclusion holds.
-
---
-
-## 2. MoE Qwen3.6-35B-A3B — the roofline at decode M=128
-
-At npl128, 128 tokens × top-8 over 256 experts ⇒ P(expert unused) = (1−8/256)^128 ≈ 1.7%, so
-**~98% of experts are read** → ~22 GB/step (essentially the full weight set), the same
-weight-read regime as dense. The grouped GEMM (`mmid.cu` / `mul_mat_q` id-branch) reads each
-routed expert's weight **once** for the ~128·8/256 = **4 tokens/expert** on average.
-
-**(a) Weight-read floor:**
- @273 GB/s: 22 / 273 = **80.6 ms → 1,588 tok/s**
- @216 GB/s: 102 ms → 1,255 tok/s
-
-**(b) Compute floor:** only ~3B active params/token → FLOPs = 2·128·3e9 = 0.77 TFLOP → 1.5 ms @peak.
-**Trivial.** MoE decode is **purely bandwidth/occupancy bound**, never compute-bound. The hard part
-is that per-expert M ≈ 4: the grouped GEMM must saturate ~273 GB/s while feeding tiny ragged M-tiles
-— the regime where ggml's dense-tuned `mmq_x=128` underfills (see `MOE_GROUPED_GEMM_SCOPE.md`).
-
-**Where llama sits:** GEMM = 59% × 384 = **227 ms** → effective BW 22 GB / 0.227 s =
-**97 GB/s = 35% of 273** (less compute-bound than dense, but only 1/3 of peak bandwidth — an
-occupancy/tile-fill loss, exactly the `MOE_GROUPED_GEMM_SCOPE.md` M-tile finding).
-
-**Where vLLM sits:** step 158 ms ≈ GEMM at the ~80 ms floor (grouped Marlin-NvFp4, 51% of its step)
-+ ~78 ms non-GEMM. So vLLM is already pushing the MoE bandwidth floor.
-
-**MoE parity arithmetic** (384 ms = GEMM 227 + act 31 + GDN 38 + attn 8 + rest 81):
- B + A, GEMM → 80 ms floor + act fused: 80+38+8+81 = **207 ms → 618 tok/s = 76% of vLLM.**
- This is the **ceiling from the GEMM track**: even with a *perfect* MoE weight-read-floor GEMM,
-  llama's non-GEMM (GDN 38 + attn 8 + rest 81 = 127 ms) is **1.6× vLLM's whole non-GEMM (~78 ms)**,
-  so the step cannot drop below ~207 ms. To reach vLLM's 158 ms needs the non-GEMM buckets too
-  (GDN state I/O is intrinsic and vLLM pays it identically — `GDN_DECODE_VERIFY.md` — so the
-  remaining ~49 ms is elementwise + host loop, **outside track B**).
-
-**MoE band from B+A: ~60–76% of vLLM.** Full MoE parity is **not reachable from the GEMM alone.**
-
---
-
-## 2b. The precise code-level inefficiencies (source-read, the "why slower than vLLM")
-
-Decode runs **one `mul_mat_q` per weight, M=128** (all 128 slots' single tokens are fused into one
-ubatch — confirmed `mul_mat_q(M=128)` in `GDN_DECODE_VERIFY.md`, not 128 × M=1). The NVFP4 path:
-`mmq.cu` `use_native_fp4` gate → `quantize_mmq_fp4_cuda` (act-quant) → `mul_mat_q` →
-`vec_dot_fp4_fp4_mma` (`mmq.cuh:997`) → `mma_block_scaled_fp4` (`mma.cuh:1126`, PTX
-`mma.sync...kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3`).
-Geometry: `get_mmq_x_max=128`, `mmq_y=128`, `nwarps=256/32=8`, `iter_k=MMQ_ITER_K_FP4=512`. Tiles:
+Geometry (`vec_dot_fp4_fp4_mma`): `MMQ_NWARPS=8`, `iter_k=MMQ_ITER_K_FP4=512`, tiles
 `tile_A<16,8,int>` (weights, 16 N-rows × 64 FP4-in-K), `tile_B<8,8,int>` (acts, 8 M-cols × 64
-FP4-in-K), `tile_C<16,8,float>` (16 N-rows × 8 M-cols), `nfrags=32/8=4`.
-
-1. **Separate activation-quant pass (track A's target).** `quantize_mmq_fp4_cuda` writes the *entire*
-   activation tensor to `block_fp4_mmq` in a standalone kernel before `mul_mat_q`. vLLM fuses
-   `scaled_fp4_quant` into the preceding RMSNorm/SiLU epilogue (`rms_quant_fusion`/`act_quant_fusion`)
-   — no separate pass, no extra activation read+write+launch. 8.2% of the npl128 step. **B must consume
-   A's in-place `block_fp4_mmq` y-tile** so the fusion saves the round-trip, not just the launch.
-
-2. **No weight-load software pipeline → exposed latency at thin M (the #1 kernel lever).**
-   `load_tiles_nvfp4_nvfp4` does plain shared stores → `__syncthreads` → `vec_dot_fp4_fp4_mma`
-   (`load_ldmatrix` + MMAs): a **load→sync→compute→repeat** cadence with **no `cp.async`
-   double-buffering** overlapping the next K-block weight load with the current MMA. At M=128 the per-
-   tile MMA work is small (8 M-cols per `tile_C::J`), so serialized weight-load latency dominates →
-   the ~3% (dense) / 35%-of-BW (MoE) result. vLLM's Marlin runs a 4-stage `cp.async` pipeline. **The
-   defining caveat:** a *deep* pipeline + XOR-swizzle collapses GB10 occupancy
-   (`W4A16_MARLIN_KERNEL_PLAN.md`); the fix is a **shallow 2-stage prefetch + skew-pad**, not Marlin's 4.
-
-3. **`mmq_x` selector maximizes the M-tile — the opposite of the GB10 occupancy rule.**
-   `mul_mat_q_case` picks `mmq_x` by *minimizing* `ntiles_x = ceil(ncols_max/mmq_x)`, so it always
-   takes the *largest* tile that fits shared. Dense decode → `mmq_x=128`, `mmq_y=128`: a heavy 128×128
-   tile (8 warps) → low occupancy on the occupancy-dominated GB10. No padding waste and no redundant
-   weight read (`ntx=1` → each weight row-tile read once), so the loss is pure occupancy; a smaller
-   `mmq_x` with more resident CTAs may hide load latency better (P1 host-only sweep, zero kernel risk).
-
-4. **MoE per-expert M-tile waste (the structural MoE gap).** Stock applies the 128-wide tile *per
-   expert*; per-expert density is ~4 tokens (top-8 of 256 @npl128), so the 128-wide accumulator is
-   ~3% filled and only ~1 `tile_C` N-fragment is live (`tile_C::J=8`), the rest masked `need_check`
-   tails. Patch 0015 (`MOE_DENSITY_AUTO_TILE.md`) auto-caps to 64 at decode, but the ideal is
-   ~tokens/expert ≈ 8 — even 64 is ~8× too big. vLLM uses a small per-expert `BLOCK_SIZE_M` (16/32).
-   At ≤1 col-tile/expert a smaller tile costs **no** extra weight re-read → strictly occupancy-positive.
-   (Inefficiency 4 is the MoE arm of 3; at dense M=128, 128/8=16 N-frags are fully used — no dense
-   M-waste.)
-
-5. **`iter_k=512` (FP4) vs 256 couples to occupancy.** The FP4 main loop stages 512 K-elements/iter →
-   larger shared footprint → fewer iters but more pressure on the occupancy-bound part. A P5 knob.
-
-**Ruled out (so B does not chase them):** redundant weight reads (none — dense `ntx=1`, MoE ≤1
-col-tile/expert; the low effective BW is latency/occupancy, not re-reads); stream-K fixup (it *helps*
-fill the small GB10 grid, cheap at thin M); raw FP4-MMA peak rate (the path already beats Q4-MMQ and
-is BW-bound at batch 1 — at M=128 latency-hiding binds first, not MMA throughput).
+FP4-in-K), `tile_C<16,8,float>` (16 N-rows × 8 M-cols), `nfrags = MMQ_TILE_NE_K/tile_A::J`. The M loop
+is `for (j0=0; j0<mmq_x; j0 += ntx*tile_C::J)` — M tiled in steps of `tile_C::J=8`.

 ---

-## 3. The load-bearing verdict
+## 2. The roofline — answering the load-bearing question

-**Q: compute-bound or bandwidth-bound at M=128?**
-At the **hardware** roofline the decode GEMM is **bandwidth-bound** (M=128 ≪ crossover 515–611).
-At the **current kernel's** ~3–7% FP4 efficiency it is **compute-bound by its own inefficiency**
-(effective M*≈30). The two weight-read floors — **dense ~1,940 tok/s, MoE ~1,590 tok/s** — both sit
-**4–6× ABOVE vLLM's 391/811.** So **the 273 GB/s bandwidth is NOT the wall at the parity target.**
-There is large bandwidth headroom; the gap is the FP4-MMA kernel achieving single-digit % of peak
-where the roofline permits ~20%+ before bandwidth even binds.
+**Answer: BANDWIDTH-bound on the hardware roofline, but COMPUTE-bound in practice by the kernel's own
+under-occupancy. The 273 GB/s is NOT the wall at the parity target.**

-**Q: can a better FP4-MMA GEMM reach vLLM — TRUE PARITY?**
+### 2a. DENSE Qwen3.6-27B, M=128

- **DENSE: parity is PLAUSIBLY REACHABLE, but at the edge of the demonstrated envelope.** The entire
-  2.42× gap is the GEMM bucket; its ideal floor (66 ms) is 7× below the current 471 ms and is
-  bandwidth-bound, not hardware-capped. **B (GEMM → BW floor) + A (act-fuse) lands 376–394 tok/s ≈
-  vLLM's 391 (90–103%).** The catch: hitting the floor needs **~21% FP4-MMA efficiency at decode
-  M=128**, and GB10 has only ever demonstrated ~17% (and that at prefill M=512, a *larger, easier*
-  tile). Decode M=128 is a smaller M than prefill, so the same kernel must hold efficiency at a
-  thinner tile. This is a **reach, not a lock**: parity is on the table but with **no comfortable
-  margin** and **contingent on track A landing too**.
+`b = 18e9/27e9 = 0.667 B/param`; FLOPs/step `= 2·128·27e9 = 6.91 TFLOP`.

- **MoE: full parity is NOT reachable from track B.** Realistic ceiling **~60–76% of vLLM** (618 vs
-  811) even with a perfect weight-read-floor grouped GEMM, because (1) the MoE GEMM floor at M≈4/expert
-  demands near-**full** BW saturation in the hardest grouped-GEMM regime, where llama is at 35% of peak
-  BW and vLLM ships a purpose-built grouped Marlin-NvFp4, and (2) ~24% of the residual is non-GEMM
-  (elementwise + host loop) outside track B. MoE parity needs B **plus** the non-GEMM tracks.
+- **Weight-read floor** (18 GB read ONCE for all 128 tokens): @273 GB/s = **65.9 ms → 1,942 tok/s**;
+  @216 GB/s = 83 ms → 1,542 tok/s.
+- **Crossover** at FP4 peak: `M* = 0.667·500e12/(2·273e9) = 611`. **M=128 ≪ 611 → an ideal FP4 GEMM
+  at decode is BANDWIDTH-bound.** At the kernel's *achieved* ~3% efficiency the effective peak
+  collapses and drags M* to ≈30, putting the *current* kernel in self-inflicted compute-bound
+  territory.
+- **Where llama sits:** GEMM = 59.2% × 795 ms = **471 ms = 14.7 TFLOP/s = 2.9% of FP4 peak = 7.1×
+  slower than the 66 ms weight-read floor.** Not a bandwidth wall — a kernel running deep in
+  compute-bound territory at single-digit efficiency.
+- **Where vLLM sits:** step 328 ms ≈ llama's GEMM bucket (471 ms) alone. The **entire 2.42× gap is
+  the GEMM.**

-**Q: the GB10 occupancy wall — does it cap this?** Yes, it is the binding constraint, not bandwidth.
-History (`W4A16_MARLIN_KERNEL_PLAN.md`, `BLACKWELL_KERNEL_GAPS.md`): the from-scratch W4A16 BF16 GEMM
-hit only ~9–15 TFLOP/s (¼ of MMQ) because deep `cp.async` pipelines + XOR-swizzle **collapse GB10
-occupancy**; skew-pad + small-shared + high-occupancy won. **Crucially, decode M=128 is a different
-regime from that dead path:** it is bandwidth/occupancy-bound, not compute-throughput-bound, so the
-lever is **saturating LPDDR5x at a thin M-tile via occupancy**, not packing MMAs. The existing
-FP4-MMA path (`block_fp4_mmq` / `vec_dot_fp4_fp4_mma`) is **already at the BW floor at batch 1**
-(88 ms irreducible) — so the kernel *can* saturate bandwidth at M=1; the work is keeping it
-bandwidth-bound as M grows to 128 instead of degrading to compute-bound at 3% efficiency. That is a
-**tune/fix of a working path**, not the dead greenfield W4A16 rewrite.
+### 2b. MoE Qwen3.6-35B-A3B, M=128

-### Go / No-Go
+@npl128, 128 tok × top-8 / 256 experts ⇒ ~98% experts read ⇒ ~22 GB/step (the full weight set), per-
+expert M ≈ **4 tokens**.

- **DENSE — GO (conditional).** Build track B as a **decode-M-tile tune of the existing
-  `mul_mat_q<NVFP4>` FP4-MMA kernel**, co-delivered with track A. Honest expectation: **90–103% of
-  vLLM (parity within error), not a guaranteed beat.** Go condition: it is contingent on reaching
-  ~17–21% FP4 efficiency at M=128 (top of the demonstrated GB10 envelope) — set a P2 kill-gate
-  (below).
- **MoE — PARTIAL / NO-GO for parity-from-B.** Track B (the M-tile work already scoped in
-  `MOE_GROUPED_GEMM_SCOPE.md`) buys MoE → ~60–76% of vLLM and is worth doing, but **cannot deliver
-  MoE parity by itself**; do not promise 811. Full MoE parity requires B + the non-GEMM tracks
-  (elementwise/host CUDA-graph, GDN state I/O bf16) and is a multi-track effort.
+- **Weight-read floor:** 22/273 = **80.6 ms → 1,588 tok/s** (@216: 102 ms → 1,255).
+- **Compute floor:** only ~3B active params ⇒ 0.77 TFLOP ⇒ 1.5 ms @peak — **trivial. MoE decode is
+  purely bandwidth/occupancy-bound, never compute-bound.** The hard part is saturating 273 GB/s while
+  feeding ragged M≈4 tiles.
+- **Where llama sits:** GEMM = 59% × 384 = **227 ms = 97 GB/s = 35% of peak BW** (occupancy/tile-fill
+  loss, not compute).
+- **Where vLLM sits:** step 158 ms ≈ grouped Marlin-NvFp4 at the ~80 ms floor + ~78 ms non-GEMM —
+  already pushing the MoE BW floor.

-**Bottom line for the "TRUE PARITY" ask:** GB10 **can** plausibly deliver **dense** decode parity
-with vLLM via a tuned FP4-MMA decode GEMM **+ track A**, at the edge of the demonstrated efficiency
-envelope and with no margin. GB10 **cannot** deliver **MoE** decode parity from the GEMM track
-alone (ceiling ~76%); MoE parity is a B-plus-non-GEMM program. The hardware (273 GB/s) is **not** the
-ceiling — the GB10 FP4-MMA occupancy efficiency is, and it is a "reach" for dense and a "partial" for
-MoE.
+**Both weight-read floors (dense ~1,940, MoE ~1,590 tok/s) sit 4–6× ABOVE vLLM's 391/811. Bandwidth
+is not the wall; the GB10 FP4-MMA occupancy efficiency is.**

 ---

-## 4. Build-ready plan (do NOT implement here)
+## 3. The code-level inefficiencies, and the M-tile asymmetry that drives the whole plan

-The kernels already exist; track B is a **tune + fuse of the FP4-MMA `mul_mat_q` path at the decode
-M-tile**, not a new kernel. This respects every GB10 occupancy lesson (small shared, high occupancy,
-skew-pad, stay on `block_fp4_mmq`; never deep `cp.async` / XOR-swizzle).
+The selection is `mul_mat_q_case` (`mmq.cuh:4108`): it loops `mmq_x = 8..mmq_x_max(=128) step 8` and
+keeps the `mmq_x` that **minimizes `ntiles_x = ceil(ncols_max/mmq_x)`**, stopping at `ntiles_x==1`.
+`mmq_y` (the weight-row tile) is pinned at **128** by `get_mmq_y_host` (L143). This produces the
+single most important structural fact for track B:

-### Files (DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`)
- `mmq.cuh` — `block_fp4_mmq` (L53), `load_tiles_nvfp4_nvfp4` (L948), `vec_dot_fp4_fp4_mma` (L997),
-  the stream-k `mul_mat_q` kernel + `mul_mat_q_case` / `launch_mul_mat_q` tile selection (~L3320–4055,
-  all under `BLACKWELL_MMA_AVAILABLE`).
- `mmq.cu` — dense + id dispatch; `use_native_fp4` gate (L125), `quantize_mmq_fp4_cuda` act-quant
-  (L138/L200 — **track A's fuse target**).
- `mmid.cu` — `mm_ids_helper` MoE token-sort (the MoE M-tile lever, scoped in `MOE_GROUPED_GEMM_SCOPE.md`).
+> **`mmq_x` tiles M (tokens / output columns) — shrinking it RE-READS the weights `ntiles_x` times.
+> `mmq_y` tiles N (weight rows / output rows) — shrinking it does NOT re-read weights (each weight row
+> lives in exactly one row-tile); it only lowers shared footprint and raises occupancy.** The two
+> regimes pick opposite knobs:

-### Phases (each ends with: `test-backend-ops -o MUL_MAT[/_ID] -b CUDA0` bit-exact + a decode bench)
+| | dense decode (M=128, no `expert_bounds`) | MoE decode (per-expert M≈4) |
+|---|---|---|
+| selection picks | `mmq_x=128` → `ntiles_x=1` → **weights read ONCE** (the one-read optimum) | `mmq_x=128` applied **per expert** → tile ~3% filled |
+| shrink `mmq_x`? | **NO — re-reads 18 GB ×`ntiles_x`**, fatal in the BW-bound regime | **YES, FREE** — 1 col-tile/expert regardless, no re-read → strictly occupancy-positive |
+| FP4-MMA M-frag fill | **full** (128/`tile_C::J`=16 frag-groups, all live) → no fragment waste | **wasted** (~1 of 8/16 frag-groups live, rest masked tails) |
+| BW-neutral occupancy lever | **`mmq_y`↓** (more resident CTAs, weights still read once) — kernel-structure change | **`mmq_x`↓** (toward density ≈8) — host-side template switch |
+| dominant loss | **occupancy** at the heavy 128×128 tile (exposed weight-load latency) | **tile-fill** (dense-tuned M-tile applied to ragged M≈4) |

-| Phase | Work | Expected payoff | Risk |
-|---|---|---|---|
-| **P0** harness | Capture per-shape baseline at the **decode shape** (`test-backend-ops perf -o MUL_MAT`, type NVFP4, **n=128**, FFN K/N) + nsys decode window. Lock 1103/1103 parity + the 14.7 TFLOP/s baseline. Decode-M is the canonical target, not prefill n=512. | None (gate). | Low |
-| **P1** decode M-tile selection (dense) | In `mul_mat_q_case`/`launch_mul_mat_q`, pick `mmq_x`/`mmq_y` from the **decode M=128** shape rather than the prefill-tuned config. M=128 with FP4 N-frag 8 wants a small, occupancy-friendly tile; the prefill `mmq_x=128` likely underfills SM occupancy at decode. Host-side template selection, **zero new kernel**, mirrors `MOE_GROUPED_GEMM_SCOPE.md` [1]. | Lift dense FP4 eff from ~3% toward 10–17%; no extra weight read (one col-tile). | Low |
-| **P2** occupancy/pipeline tune | Sweep warps/tile/skew-pad on the FP4-MMA decode kernel to push toward the **66 ms BW floor (~21% FP4 eff)**. Honor GB10 rules: small shared, high occupancy, skew-pad +4, **no** deep cp.async / XOR-swizzle. **KILL-GATE:** if decode FP4 eff plateaus < ~15% (GEMM > ~110 ms) after the sweep, dense parity is off — stop and report partial. | The dense parity make-or-break. Target GEMM 471→66–81 ms. | **Med-high** (the occupancy wall is real; ncu unavailable on DGX → empirical sweep only) |
-| **P3** co-land track A | Verify the fused act-quant (track A) composes with the tuned GEMM (the requant folds into the FP4 GEMM prologue, removing the 8.2% bucket). | Dense 376–394 tok/s = 90–103% vLLM. | Low (track A owns the fuse) |
-| **P4** MoE M-tile | Land the `MOE_GROUPED_GEMM_SCOPE.md` expert-aware `mmq_x` ([1]) + block-pad align ([2]). | MoE → ~60–76% vLLM (not parity). | Med |
+This asymmetry is the spine of the plan: **MoE's lever is host-only `mmq_x`↓ (already landed as patch
+0015 auto-cap→64; ideal ≈8–16); dense's lever is `mmq_y`↓ + occupancy, a bounded kernel change.**

-### Parity gate (every phase)
-`GGML_CUDA_*` flag set and unset → `test-backend-ops test -o MUL_MAT -b CUDA0` = **1103/1103**,
-byte-identical when unset. Add **decode-shape (n=128) + ragged small-M** cases if absent. End-to-end:
-`llama-batched-bench -fa on -npp 512 -ntg 256 -npl 128` on `q36-27b-nvfp4.gguf`, confirm decode
-agg climbs toward ~376–394 and stays bit-stable vs the CPU oracle (within the GB10 greedy-decode
-non-determinism band). All bench/parity scripts **dev-tree-only**.
+The five inefficiencies, ranked:
+
+1. **Separate activation-quant pass (track A's bucket, 8.2%).** `quantize_mmq_fp4_cuda` writes the
+   whole activation tensor to `block_fp4_mmq` in a standalone kernel; vLLM fuses `scaled_fp4_quant`
+   into the preceding RMSNorm/SiLU epilogue. **Handoff (track A → B):** B must consume A's prequantized
+   `block_fp4_mmq` y-tile in place of calling `quantize_mmq_fp4_cuda`, so the fusion saves the
+   activation round-trip, not just the launch (see §4.4).
+
+2. **No weight-load software pipeline → exposed latency at thin M (the #1 dense kernel lever).**
+   `load_tiles_nvfp4_nvfp4` (`mmq.cuh:946`) does plain global→shared stores → `__syncthreads` →
+   `vec_dot_fp4_fp4_mma` (`load_ldmatrix` of A + MMA): a **load→sync→compute→repeat** cadence with **no
+   `cp.async` double-buffering** overlapping the next k-block weight load with the current MMA. At
+   M=128 the per-tile MMA work is small, so serialized weight-load latency dominates → 2.9% (dense) /
+   35%-of-BW (MoE). **Caveat (the GB10 wall):** a *deep* pipeline + XOR-swizzle collapses GB10
+   occupancy (`W4A16_MARLIN_KERNEL_PLAN.md`). The fix is **occupancy-first** (raise resident CTAs to
+   hide latency via CTA-parallelism), **shallow 2-stage prefetch second**, never Marlin's 4-stage.
+
+3. **`mmq_x` maximized for dense = occupancy-heavy, but pinned by the one-read constraint.** At dense
+   decode the 128×128 tile (8 warps, large shared) is low-occupancy on the occupancy-dominated GB10 —
+   but you cannot shrink `mmq_x` without doubling the 18 GB weight read. So the dense occupancy fix is
+   **`mmq_y`↓** (BW-neutral), not `mmq_x`↓.
+
+4. **MoE per-expert M-tile waste (the structural MoE gap).** The 128-wide (or patch-0015 64-wide)
+   tile is applied per expert at density ≈4, so the accumulator is ~3–6% filled and ~1 `tile_C` frag-
+   group is live, the rest masked `need_check` tails. Ideal `mmq_x` ≈ tokens/expert ≈ 8 (= `tile_C::J`).
+   At ≤1 col-tile/expert this costs **no** extra weight read → strictly occupancy-positive. (This is
+   the MoE arm of inefficiency 3; scoped in `MOE_GROUPED_GEMM_SCOPE.md`.)
+
+5. **`iter_k=512` (FP4) couples to occupancy.** The FP4 main loop stages 512 K-elements/iter → larger
+   shared footprint → adverse in the occupancy-bound regime. A P2 tuning knob.
+
+**Ruled out (do not chase):** redundant weight reads on the *current* selection (none — dense
+`ntiles_x=1`, MoE ≤1 col-tile/expert); stream-K fixup (it *helps* fill the small GB10 grid at thin M);
+raw FP4-MMA peak rate (already beats Q4-MMQ and is BW-bound at batch 1 — latency-hiding binds first).
+
+---
+
+## 4. The specific build-ready changes
+
+All against DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`. Every change is gated and defaults to exact
+stock behavior until proven.
+
+### 4.1 Dense M-tile / occupancy (the make-or-break)
+
+- **Keep `mmq_x=128` at dense decode** (the one-weight-read optimum; do **not** shrink it — that
+  re-reads 18 GB). Lock this as an invariant in P0.
+- **Make `mmq_y` decode-selectable** (`get_mmq_y_host`/`get_mmq_y_device`, L143/L157). Today pinned
+  128; try **64** (and 96) at decode. `mmq_y` is coupled to `nwarps × tile_C::I` via the MMQ
+  static_assert, so this is a **warp/fragment remap** (bounded kernel change), not a pure host switch:
+  fewer N-frags per warp or fewer warps → smaller per-CTA shared → **more resident CTAs → latency
+  hidden by CTA-parallelism**, with **weights still read once** (BW-neutral). This is the primary
+  dense occupancy lever and respects every GB10 rule.
+- **Host-only knobs first (P1, zero kernel):** the `mmq_get_granularity_host` choice (L274 — sets
+  `rows_per_warp=2·granularity`, `ntx`), and the stream-k-vs-xy-tiling threshold (`launch_mul_mat_q`
+  ~L3954, `tiles_efficiency_percent` L4001). Plus one **empirical A/B**: does eating a 2× weight
+  re-read at `mmq_x=64` buy enough occupancy to net positive? (Diagnostic: if yes, occupancy is badly
+  broken and P2 `mmq_y`↓ has large upside; if no, the tile is already BW-saturated and P2's ceiling is
+  lower.) All behind `GGML_CUDA_FP4_MMQ_Y` / `GGML_CUDA_FP4_GRAN` / `GGML_CUDA_FP4_FORCE_STREAMK`.
+
+### 4.2 FP4-MMA fragment usage
+
+- Fragments stay `tile_A<16,8,int>` / `tile_B<8,8,int>` / `tile_C<16,8,float>` — these match the
+  `m16n8k64` block-scaled FP4 MMA and must not change (they are the instruction shape). At dense M=128
+  all 16 `tile_C::J`-groups are live → **no dense fragment work needed**. The lever is *how many of
+  these tiles are resident per SM* (occupancy), set by `mmq_y`/`nwarps`/granularity, not the fragment
+  shape.
+- MoE: shrink `mmq_x` toward `tile_C::J`=8 so the live frag-group count matches density (§4.3).
+
+### 4.3 MoE M-tile (`MOE_GROUPED_GEMM_SCOPE.md`, partly landed)
+
+- **Patch 0015 already auto-caps `mmq_x`→64 at decode** via per-expert density in `mul_mat_q_case`
+  (the `expert_bounds != nullptr` block, L4118-4165; env `LLAMA_MOE_DECODE_TILE`,
+  `LLAMA_MOE_DENSITY_MAX`). Tighten the decode tile toward **8–16** (= density) and sweep.
+- **Optional [2]: block-padded `mm_ids_helper`** (`mmid.cu`) — pad each expert segment to a multiple
+  of the tile, removing `need_check` masked tails and tightening the stream-k schedule. Medium risk
+  (scatter + write-back masking); behind `LLAMA_MOE_BLOCK_ALIGN`.
+
+### 4.4 Scale handling + the act-quant fusion handoff (the track A → B ABI contract)
+
+- **Weight scales** (`ue4m3`, one per 16 weights) load in `load_tiles_nvfp4_nvfp4` into `x_sc`
+  (`x_u32 + 64 + kbx`), consumed as `scaleA` in `vec_dot_fp4_fp4_mma` and passed as the block-scale
+  operand to `mma_block_scaled_fp4`. **No change** — already a first-class MMA scale operand.
+- **Activation scales** (`ue4m3`) live in the `block_fp4_mmq` y-tile `d4[4]`, consumed as `scaleB`.
+- **The handoff contract:** track B must hold the **`block_fp4_mmq` y-tile layout invariant**
+  (`uint32_t d4[4]` ue4m3 scales + `int8_t qs[128]` = 256 packed FP4, `mmq.cuh:53`). Track A's fused
+  `rms_norm+mul+nvfp4-quant` producer (task 39) writes exactly this struct; track B's "prequantized
+  MMQ consumer" (task 40) makes `mul_mat_q` accept a prebuilt `src1_q8_1` buffer and **skip the
+  `quantize_mmq_fp4_cuda` call** (`mmq.cu:138`/`200`). The numerics must be **bit-identical** to the
+  unfused path (same `e2m1` rounding, same `ue4m3` block scale per 16) so the parity gate stays green
+  with the fusion on or off. B owns the consumer seam; A owns the producer kernel; the `block_fp4_mmq`
+  struct is the frozen interface between them.
+
+### 4.5 GB10-fit rules (binding constraints on every kernel change)
+
+- **Small shared mem + high occupancy.** Do **not** add deep `cp.async` stages or XOR-swizzle shared
+  layouts — they are exactly what collapsed W4A16 on GB10 (`W4A16_MARLIN_KERNEL_PLAN.md`: a 16 KB
+  XOR-swizzle dropped q4_K from 6.63→2.84 TFLOPS).
+- **Preserve the skew-pad** (`MMQ_MMA_TILE_X_K_FP4 = 2·MMQ_TILE_NE_K + 8 + 4`, the `% 8 == 4`
+  padding, `mmq.cuh:221/233`) — conflict-free `ldmatrix` at ~zero shared cost.
+- **Stay on the FP4-MMA path** (`block_fp4_mmq` / `mma_block_scaled_fp4`) — the only path at GB10's
+  FP4 = 2× INT8/BF16 rate. Never descend to BF16/INT8 (1:1 on GB10).
+- **Occupancy beats a conflict-free-but-wide layout.** Buy latency-hiding with *more resident CTAs*
+  (smaller `mmq_y`, smaller shared), not a deeper pipeline.
+- Tuning is **empirical** — `nsys` (throughput) is available, **`ncu` is not** on the DGX (no driver
+  perms). Sweep configs, measure decode_agg, bracket thermals (same-session cold A/B only).
+
+---
+
+## 5. Correctness / parity gate (every phase)
+
+- **Primary, bit-exact:** `test-backend-ops test -o MUL_MAT -b CUDA0` and
+  `test-backend-ops test -o MUL_MAT_ID -b CUDA0` must stay **1103/1103** with the flag set **and**
+  unset, and **byte-identical** when unset. The CPU reference is the deterministic oracle; the op test
+  is exact (the GB10 greedy-decode non-determinism band applies only to end-to-end, never to the op
+  test).
+- **Add decode-shape cases if absent:** `type_a ∈ {NVFP4, MXFP4}`, `type_b = F32`, dense **n=128** at
+  the real FFN K/N; for `_ID`, `n_mats=128, n_expert_used=8, n_tokens ∈ {8,32,64,128}` **plus ragged
+  small-M** (experts with 0/1/2 tokens, `n_tokens` not a multiple of `mmq_x`) — exactly where `mmq_x`/
+  `mmq_y` changes and block-pad masking can leak.
+- **Fusion-handoff parity (P3):** with track A's fused producer on, the prequantized-consumer path
+  must produce dst **identical** to the unfused `quantize_mmq_fp4_cuda` path (same `e2m1`/`ue4m3`
+  rounding).
+- **End-to-end:** `llama-batched-bench -fa on -npp 512 -ntg 256 -npl 128` on `q36-27b-nvfp4.gguf`
+  (dense) and `q36-35b-a3b-nvfp4.gguf` (MoE); confirm decode_agg climbs per §6 and output stays within
+  the documented CUDA batch-shape non-determinism band vs the CPU oracle. All scripts **dev-tree-only**.
+
+---
+
+## 6. Phased plan, with expected decode_agg at each phase
+
+Per-step model used (ms @npl128): **dense 795** = GEMM 471 + act 65 + GDN 83 + attn 14 + rest 162;
+**MoE 384** = GEMM 227 + act 31 + GDN 38 + attn 8 + rest 81. `decode_agg = 128 / step_s`.
+
+### DENSE (parity target 391)
+
+| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 391** | risk |
+|---|---|---:|---:|---:|---:|---|
+| **P0** harness | Lock baseline: 1103/1103, decode n=128 perf, nsys window, the 471 ms / 2.9% eff datum. Pin `mmq_x=128` one-read invariant. | 471 | 795 | **161** | 41% | low |
+| **P1** host-only tile/grid + re-read A/B | granularity + stream-k threshold sweep; the `mmq_x=64` re-read-vs-occupancy diagnostic. **Honest: small** — `mmq_x` is pinned, so this mostly de-risks P2. | ~400 | ~724 | **~177** | ~45% | low |
+| **P2** `mmq_y`↓ + occupancy/shallow-prefetch | The make-or-break: raise resident CTAs (`mmq_y` 128→64, granularity, shallow 2-stage weight prefetch, skew-pad), push GEMM toward the **66–81 ms BW floor (17–21% FP4 eff)**. **KILL-GATE: if eff plateaus <15% (GEMM >110 ms) → dense parity OFF, report partial.** | **66–81** | 390–405 | **316–328** | **81–84%** | **med-high** |
+| **P3** co-land track A | Consume A's prequantized `block_fp4_mmq` y-tile; the 65 ms act bucket folds away. | 66–81 | **325–340** | **376–394** | **96–101%** | low |
+
+Dense climb: **161 → ~177 → 316–328 → 376–394** tok/s = **41% → 45% → 81–84% → 96–101% of vLLM 391.**
+Robust to the 273-vs-216 GB/s uncertainty (@216 GB/s P3 → ~359 tok/s = 92%). **Parity within error,
+contingent on P2 clearing the kill-gate and on A landing.**
+
+### MoE (parity target 811)
+
+| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 811** | risk |
+|---|---|---:|---:|---:|---:|---|
+| **P0** harness | Lock 1103/1103 + the monotonic `85→1771` batched-bench curve + 227 ms / 35%-BW datum. | 227 | 384 | **333** | 41% | low |
+| **P1/P4** MoE `mmq_x`↓ (patch 0015 → tighten to 8–16) | Free per-expert tile shrink (no re-read); reclaim the 3–6% fill waste, raise occupancy. | ~140 | ~297 | **~431** | ~53% | low |
+| **P2** block-pad align + occupancy | Remove `need_check` tails, tighten stream-k; push toward the 80 ms floor. | ~100 | ~257 | **~498** | ~61% | med |
+| **P3** co-land track A | act bucket (31 ms) folds away; GEMM at the ~80 ms floor. | 80 | **207** | **618** | **76% — CEILING** | low |
+
+MoE climb: **333 → ~431 → ~498 → 618** tok/s = **41% → 53% → 61% → 76% of vLLM 811.** **The 76% is the
+hard ceiling from the GEMM track:** even a *perfect* weight-read-floor grouped GEMM leaves llama's
+non-GEMM (GDN 38 + attn 8 + rest 81 = 127 ms) at **1.6× vLLM's whole ~78 ms non-GEMM**, so the step
+cannot drop below ~207 ms. The remaining ~49 ms to vLLM's 158 ms step is elementwise + host-loop
+(GDN state I/O is intrinsic and vLLM pays it identically — `GDN_DECODE_VERIFY.md`), **outside track B.**

 ### Explicitly NOT in scope (and why)
- A from-scratch W4A16 / CUTLASS collective — the FP4-MMA path already exists and is BW-optimal at
-  batch 1; rewriting repeats the W4A16 occupancy dead-end (`W4A16_MARLIN_KERNEL_PLAN.md`: STOPPED).
- Deep multi-stage `cp.async` / XOR-swizzle shared layouts — proven to collapse GB10 occupancy.
- The non-GEMM MoE residual (elementwise, host CUDA-graph, GDN bf16 state) — needed for MoE parity
-  but **separate tracks**; track B owns the GEMM only.
+
+- A from-scratch W4A16 / CUTLASS SM120 collective — repeats the STOPPED occupancy dead-end and
+  CUTLASS's grouped FP4 is broken on sm_121.
+- Deep multi-stage `cp.async` / XOR-swizzle — proven to collapse GB10 occupancy.
+- "Make activations 4-bit" — already W4A4; no work, no win there.
+- The non-GEMM MoE residual (elementwise, host CUDA-graph, GDN bf16 state) — needed for MoE parity but
+  **separate tracks**; B owns the GEMM only.

 ---

-## 5. Honest one-paragraph summary
+## 7. The honest ceiling — does B reach TRUE PARITY?
+
+- **DENSE: TRUE PARITY is PLAUSIBLY REACHABLE, conditional, no margin.** The entire 2.42× gap is the
+  GEMM bucket; its ideal floor (66 ms) is 7× below the current 471 ms and is **bandwidth-bound, not
+  hardware-capped**. **B (GEMM → BW floor) + A (act-fuse) lands 376–394 tok/s = 90–103% of vLLM 391.**
+  The catch: it needs **~17–21% FP4-MMA efficiency at decode M=128**, and GB10 has only demonstrated
+  ~17% — and that at the *easier* prefill M=512 tile. It is a **reach, not a lock**, gated by the P2
+  occupancy kill-gate and contingent on track A. **GO (conditional).**
+
+- **MoE: full parity is NOT reachable from track B.** Realistic ceiling **~76% of vLLM (618 vs 811)**
+  even with a perfect weight-read-floor grouped GEMM, because (1) the MoE floor is the hardest
+  grouped-GEMM regime (M≈4/expert, vLLM ships purpose-built Marlin-NvFp4) and (2) ~24% of the step is
+  non-GEMM outside this track. Worth doing (333 → ~618, a 1.85× and a real win), but it **cannot
+  deliver 811 alone.** **PARTIAL / NO-GO for parity-from-B.**
+
+- **The 273 GB/s is not the ceiling — the GB10 FP4-MMA occupancy efficiency is.** Decode M=128 is a
+  *different* regime from the dead W4A16 path: bandwidth/occupancy-bound (saturate LPDDR5x at a thin
+  M-tile via resident CTAs), not compute-throughput-bound (pack MMAs). The existing path is already at
+  the BW floor at batch 1 (88 ms), so the work is **keeping it bandwidth-bound as M grows to 128**
+  (occupancy via `mmq_y`↓ + shallow prefetch), a **tune of a working path**, not the greenfield
+  rewrite. The binding risk is whether that occupancy can be bought without tripping the GB10 wall —
+  which is exactly what the P2 kill-gate measures.
+
+**Bottom line for the "TRUE PARITY" ask:** GB10 **can** plausibly deliver **dense** decode parity with
+vLLM via a tuned FP4-MMA decode GEMM **+ track A**, at the top of the demonstrated efficiency envelope
+with no margin. GB10 **cannot** deliver **MoE** decode parity from the GEMM track alone (ceiling ~76%);
+MoE parity is a B-plus-non-GEMM program. **Verdict: GO for dense (conditional, B+A, kill-gated),
+PARTIAL for MoE.**
+
+---
+
+## 8. One-paragraph summary

 The decode GEMM at M=128 is **bandwidth-bound on paper** (crossover M*≈611 ≫ 128) with weight-read
 floors 4–6× above vLLM, so **273 GB/s is not the wall** — but llama's FP4-MMA kernel runs at ~3% of
-FP4 peak, putting it in **self-inflicted compute-bound territory** (471 ms vs a 66 ms floor). Closing
-that is the entire dense gap: **track B (tune the FP4-MMA decode M-tile to the BW floor) + track A
-(fuse act-quant)** plausibly reaches **90–103% of vLLM dense (391)** — TRUE PARITY is on the table for
-dense, but only at the **top of the demonstrated GB10 FP4-efficiency envelope (~17–21%)** and with
-**no margin**, gated by the occupancy wall. **MoE parity is not reachable from the GEMM alone**
-(ceiling ~60–76% of 811), because its floor sits in the hardest grouped-GEMM regime and ~24% of its
-step is non-GEMM work outside this track. Verdict: **GO for dense (conditional, B+A), PARTIAL for MoE.**
+FP4 peak, in **self-inflicted compute-bound territory** (471 ms vs a 66 ms floor). The path is already
+**W4A4** and already **beats vLLM at batch-1 prefill**, so the fix is **tuning the existing
+`mul_mat_q<NVFP4>`**, not a cutlass rewrite (a proven GB10 dead-end, and broken on sm_121 anyway). The
+M-tile asymmetry sets the levers: **dense** is pinned at `mmq_x=128` (one weight read) so its occupancy
+win is **`mmq_y`↓ + shallow prefetch** (BW-neutral), while **MoE**'s win is the free per-expert
+**`mmq_x`↓** (patch 0015). **Track B (GEMM → BW floor) + track A (fuse act-quant)** plausibly reaches
+**90–103% of vLLM dense (391)** — TRUE PARITY on the table for dense, but only at the **top of the
+demonstrated GB10 FP4-efficiency envelope (~17–21%)**, with **no margin**, gated by the P2 occupancy
+kill-gate. **MoE parity is not reachable from the GEMM alone** (ceiling ~76% of 811), because its floor
+sits in the hardest grouped-GEMM regime and ~24% of its step is non-GEMM. **Verdict: GO for dense
+(conditional, B+A), PARTIAL for MoE.**