diff --git a/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md b/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md index 83f2e3dfb..8475abb92 100644 --- a/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md +++ b/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md @@ -1,271 +1,362 @@ -# Track B: the FP4-MMA weight-GEMM for GB10 decode parity with vLLM — roofline + go/no-go +# Track B: the FP4-MMA weight-GEMM for GB10 decode parity with vLLM — build-ready scope + honest go/no-go -Scope only (build-ready plan + honest verdict). **Not implemented in this workflow.** This is the -residual-kernel track after track A (fuse the standalone `quantize_mmq_nvfp4` activation-requant, -the 8.2% bucket) is handled separately. Track B asks the load-bearing question and answers it -quantitatively: at the decode batch shape (M≈128 tokens, NVFP4 weights), is the weight GEMM -**compute-bound** (FP4-MMA throughput is the lever → parity reachable with a better kernel) or -**bandwidth-bound** (273 GB/s weight-read is a hard floor → parity capped)? And given the prior -GB10 occupancy history, can a better FP4-MMA decode GEMM actually reach vLLM's 391 (dense) / 811 -(MoE) tok/s, or only partway? +Scope only (build-ready plan + honest verdict). **Not implemented in this workflow.** Track B is the +residual-kernel track after track A (fuse the standalone `quantize_mmq_fp4` activation-requant, the +8.2% decode bucket — tasks 38-41, the fused `rms_norm+mul+nvfp4-quant` producer + prequantized-MMQ +consumer) is handled separately. Track B owns the **weight GEMM**, the ~59% bucket. + +**The load-bearing question, restated:** at the decode batch shape (M≈128 tokens fused into one +ubatch, NVFP4 weights), is the weight GEMM **compute-bound** (FP4-MMA throughput is the lever → +parity reachable with a better kernel) or **bandwidth-bound** (273 GB/s weight-read is a hard floor → +parity capped)? And given the GB10 occupancy history, can a better FP4-MMA decode GEMM actually reach +vLLM's **391 (dense) / 811 (MoE)** decode-agg tok/s @npl128, or only partway? Hardware: NVIDIA GB10 / DGX Spark, sm_121 (CC 1210 = `GGML_CUDA_CC_DGX_SPARK`), unified LPDDR5x. Dev tree `~/llama-paged-dev` (branch `paged`, build-cuda sm_121). All numbers are reasoned from the -committed nsys decomposition + measured GB10 specs; **no new GPU benchmarks were run** (track A is on -the box). +committed nsys decomposition + measured GB10 specs + a source read of the FP4-MMA kernel; **no new GPU +benchmarks were run** (track A is on the box). -## 0. The grounded inputs (measured, committed) +## 0. Grounded inputs (measured, committed) | quantity | value | source | |---|---|---| | LPDDR5x bandwidth (spec) | **273 GB/s** | `BLACKWELL_KERNEL_GAPS.md`, `VLLM_DECODE_GROUNDING.md` | -| LPDDR5x bandwidth (achieved, batch-1) | **~216 GB/s** (19 GB / ~88 ms irreducible) | prior batch-1 weight-read study | -| FP4 (NVFP4/MXFP4) dense peak | **~427–500 TFLOP/s** (2× BF16; GB10 is 1:1:2 BF16:INT8:FP4) | `BLACKWELL_KERNEL_GAPS.md` §2 (measured) | -| BF16 peak | ~213 TFLOP/s | same | -| Demonstrated GB10 FP4-MMA efficiency | **~17%** of FP4 peak at prefill M=512 (MXFP4 dense 1153 t/s); ~3–7% at decode; ~5% MoE | `BLACKWELL_KERNEL_GAPS.md` §6, `GDN_DECODE_VERIFY.md` | -| Demonstrated GB10 INT8-MMQ efficiency | ~21% of BF16 peak | `BLACKWELL_KERNEL_GAPS.md` §3 | -| Dense Qwen3.6-27B NVFP4 weights | **18.8 GB** file (`q36-27b-nvfp4.gguf`); ~18 GB matmul tensors | `du` on DGX | -| MoE Qwen3.6-35B-A3B NVFP4 weights | **23.85 GB** file; ~22 GB read/step at npl128 (≈98% experts hit) | `du` on DGX | -| Decode step decomposition (dense npl128, nsys, GPU 92.7% busy) | GEMM_weight **59.2%**, act_quant 8.2%, GDN(recurrent+conv) 10.4%, full-attn 1.8%, elementwise/norm/rope 13.5%, embed 2.9%, copy 1.8% | `GDN_DECODE_VERIFY.md` §3a | -| Measured per-step times @npl128 | dense **~795 ms** (llama) → **~328 ms** (vLLM); MoE **~384 ms** → **~158 ms** | `VLLM_DECODE_GROUNDING.md` | -| Aggregate decode @npl128 | dense 161 (llama) vs **391** (vLLM); MoE 333 vs **811** | `QWEN36_NVFP4_BENCH.md` | +| LPDDR5x bandwidth (achieved, batch-1 weight read) | **~216 GB/s** (19 GB / ~88 ms irreducible) | prior batch-1 study | +| FP4 (NVFP4/MXFP4) dense peak | **~427–500 TFLOP/s** (2× BF16; GB10 is 1:1:2 BF16:INT8:FP4) | `BLACKWELL_KERNEL_GAPS.md` §2 | +| BF16 / INT8 peak | ~213 TFLOP/s / ~215 TOPS (INT8 == BF16 on GB10) | same §2 | +| Demonstrated GB10 FP4-MMA efficiency | **~17%** of FP4 peak at prefill M=512 (MXFP4 dense 1153 t/s); **~3% dense / ~35%-of-BW MoE at decode** | `BLACKWELL_KERNEL_GAPS.md` §6, `GDN_DECODE_VERIFY.md` | +| Dense Qwen3.6-27B NVFP4 weights | **18.8 GB** file; ~18 GB matmul tensors | `du` on DGX | +| MoE Qwen3.6-35B-A3B NVFP4 weights | **23.85 GB** file; ~22 GB read/step @npl128 (~98% experts hit) | `du` on DGX | +| Decode step decomposition (dense npl128, nsys, GPU 92.7% busy) | GEMM_weight **59.2%**, act_quant 8.2%, GDN 10.4%, full-attn 1.8%, elementwise/norm/rope 13.5%, embed 2.9%, copy 1.8% | `GDN_DECODE_VERIFY.md` §3a | +| Measured per-step @npl128 | dense **~795 ms** (llama) → **~328 ms** (vLLM); MoE **~384 ms** → **~158 ms** | `VLLM_DECODE_GROUNDING.md` | +| Aggregate decode @npl128 (the parity scoreboard) | dense **161** (llama) vs **391** (vLLM); MoE **333** vs **811** | `QWEN36_NVFP4_BENCH.md` | -Crossover formula used throughout (per-GEMM and whole-model are identical): -`M* = b · peak / (2 · BW)` where `b` = bytes per weight element. Below `M*` the GEMM is -bandwidth-bound; above it, compute-bound. +`decode_agg = npl / step_s = 128 / step_s`. Crossover formula throughout: +`M* = b · peak / (2 · BW)`, `b` = bytes per weight element. Below `M*` bandwidth-bound, above it +compute-bound. --- -## 1. DENSE Qwen3.6-27B — the roofline at decode M=128 +## 1. The kernel-approach decision: TUNE the existing FP4-MMA `mul_mat_q`, do NOT write a cutlass kernel -`b = 18e9 B / 27e9 params = 0.667 B/param`. FLOPs/step `= 2·M·P = 2·128·27e9 = 6.91 TFLOP`. +This is the first thing track B must settle, and the evidence settles it decisively. -**(a) Weight-read floor** (weights read ONCE for all 128 tokens): -- @273 GB/s: 18 GB / 273 = **65.9 ms/step → 1,942 tok/s ceiling** -- @216 GB/s (achieved): 18 / 216 = **83 ms/step → 1,542 tok/s** +| option | verdict | why | +|---|---|---| +| **(A) Tune the existing `mul_mat_q` FP4-MMA path** | **CHOSEN — the tractable spine** | The kernel already exists, is **bit-exact** (`test-backend-ops MUL_MAT` 1103/1103), is genuine **W4A4** (below), and already **beats vLLM at batch-1 prefill** (MXFP4 1153 t/s vs vLLM's 800 W4A16 — vLLM has no FP4 cubins on sm_121). The deficit is **decode-shape scheduling**, not the math op. Host-side selection + a bounded occupancy tune respects the GB10 lessons and is build-ready against known files/lines. | +| **(B) New cutlass-style SM120 FP4 collective** | **REJECTED** | Repeats the **proven GB10 dead-end**: the from-scratch W4A16 BF16 GEMM hit only ~9–15 TFLOP/s (¼ of MMQ) and was **STOPPED** (`W4A16_MARLIN_KERNEL_PLAN.md`) because deep `cp.async` + XOR-swizzle **collapse GB10 occupancy**. Worse, **CUTLASS's own SM120 grouped block-scaled FP4 GEMM is broken on consumer Blackwell** (garbage/init-fail — CUTLASS #3096/#2800) — it is the exact reason vLLM falls back to **BF16 Marlin** for its MoE on sm_121. "Port cutlass" is not even a working option for the MoE arm. | +| **(C) Marlin-style W4A16 (FP4→BF16 dequant + BF16 HMMA)** | **REJECTED for the win, noted for context** | This is what **vLLM's MoE actually runs** on sm_121 (W4A16, BF16 activations, dequant-in-mainloop). On GB10 **INT8 == BF16 == ½ FP4 rate**, so a BF16-HMMA path concedes the 2× FP4 advantage llama already has. We do not want to *descend* to vLLM's slower arithmetic class; we want to keep the FP4-MMA class and schedule it better. | -**(b) Compute floor:** -- @FP4 peak 500 TF/s: 6.91 / 500 = **13.8 ms → 9,275 tok/s** -- @17% FP4 (85 TF/s, the demonstrated prefill ceiling): 81 ms → 1,580 tok/s -- @5% FP4 (25 TF/s, measured decode regime): 276 ms → 464 tok/s +**Decision: track B = tune `mul_mat_q` (dense, `mmq.cu`/`mmq.cuh`) + the grouped `mul_mat_q` +id-branch (MoE, `mmid.cu` + the same `mmq.cuh`).** No new kernel, no rewrite, no descent to BF16. +The win is kernel *engineering around an FP4-MMA llama already possesses*, so there is **no +hardware-instruction wall** — but it is gated by whether MMQ's occupancy-bound design can be pushed +to the bandwidth floor at the thin decode M-tile. -**(c) Crossover:** -- At FP4 **peak**: `M* = 0.667·500e12 / (2·273e9) = 611`. **M=128 ≪ 611 → an ideal FP4 GEMM at decode is BANDWIDTH-BOUND.** -- At the kernel's **achieved** efficiency the effective peak collapses, dragging `M*` down: 17% → M*≈104; 5% → M*≈30. So **at its current ~3–7% efficiency the kernel is COMPUTE-BOUND at M=128** (limited by its own poor FP4-MMA throughput), even though the hardware says it should be bandwidth-bound. +### What "the existing path" actually is (source-read, DGX `ggml/src/ggml-cuda/`) -**Where llama actually sits:** GEMM = 59.2% × 795 ms = **471 ms**. Achieved = 6.91e12 / 0.471 = -**14.7 TFLOP/s = 2.9% of FP4 peak**. That is **7.1× slower than the 66 ms weight-read floor** and -matches the ~3–7% decode-efficiency band. The 471 ms is not a hardware bandwidth wall — it is the -FP4-MMA kernel running deep in compute-bound territory at single-digit efficiency. +Decode runs **one `mul_mat_q` per weight, M=128** (all 128 slots' single tokens fused into one +ubatch — confirmed `mul_mat_q(M=128)` in `GDN_DECODE_VERIFY.md`, not 128× M=1). The NVFP4 path: +`mmq.cu` `use_native_fp4` gate (L125) → `quantize_mmq_fp4_cuda` act-quant (L138 dense / L200 id; +**track A's fuse target**) → `mul_mat_q` → `vec_dot_fp4_fp4_mma` (`mmq.cuh:997`) → +`mma_block_scaled_fp4` (`mma.cuh:1126`). -**Where vLLM sits:** step 328 ms → if its native-FP4 cutlass GEMM is at the ~66 ms BW floor, the -GEMM is only ~20% of vLLM's step; the rest (~262 ms) is GDN + full-attn + host. vLLM's **whole step -(328 ms) ≈ llama's GEMM bucket alone (471 ms)** minus a bit. The entire 2.42× gap is the GEMM. +**Confirmed W4A4 (this corrects an earlier "A is 8-bit-class" framing):** `block_fp4_mmq` +(`mmq.cuh:53`) is `uint32_t d4[4]` (four `ue4m3` block scales) + `int8_t qs[4*32]` = **256 FP4 (e2m1) +values packed 2-per-byte**. `quantize_mmq_fp4_cuda` (`quantize.cu:422`) emits FP4 via +`ggml_cuda_float_to_fp4_e2m1`. The MMA is +`mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3` +(`mma.cuh:1145`) — **both operands e2m1, ue4m3 block scales**. So llama's dense FP4-MMA path is +already the *same arithmetic class as vLLM's cutlass W4A4 dense*. The `sizeof(block_fp4_mmq) == +sizeof(block_q8_1_mmq)` static_assert is a shared-tile-footprint convention, **not** an 8-bit +activation. **Consequence: there is no "make activations 4-bit" work to do and no activation-traffic +halving to win — that is already banked. The entire dense deficit is scheduling/occupancy.** -**Dense parity arithmetic** (795 ms = GEMM 471 + act 65 + GDN 83 + attn 14 + rest 162): -- B alone (GEMM → 66 ms BW floor, requires ~21% FP4 eff): step 728→… = 66+65+83+14+162 = **390 ms → 328 tok/s = 84% of vLLM**. -- **B + A** (GEMM 66 ms floor **and** act-quant fused away): 66+83+14+162 = **325 ms → 394 tok/s = 101% of vLLM → PARITY/BEAT.** -- B+A at the softer 17% FP4 (GEMM 81 ms, the *demonstrated* prefill ceiling, not the 21% floor): 340 ms → **376 tok/s = 96% of vLLM.** - -**Dense robust band: 90–103% of vLLM**, and it is insensitive to the 273-vs-216 GB/s uncertainty -(at 216 GB/s the floor is 83 ms → step 357 ms → 359 tok/s = 92%). The conclusion holds. - ---- - -## 2. MoE Qwen3.6-35B-A3B — the roofline at decode M=128 - -At npl128, 128 tokens × top-8 over 256 experts ⇒ P(expert unused) = (1−8/256)^128 ≈ 1.7%, so -**~98% of experts are read** → ~22 GB/step (essentially the full weight set), the same -weight-read regime as dense. The grouped GEMM (`mmid.cu` / `mul_mat_q` id-branch) reads each -routed expert's weight **once** for the ~128·8/256 = **4 tokens/expert** on average. - -**(a) Weight-read floor:** -- @273 GB/s: 22 / 273 = **80.6 ms → 1,588 tok/s** -- @216 GB/s: 102 ms → 1,255 tok/s - -**(b) Compute floor:** only ~3B active params/token → FLOPs = 2·128·3e9 = 0.77 TFLOP → 1.5 ms @peak. -**Trivial.** MoE decode is **purely bandwidth/occupancy bound**, never compute-bound. The hard part -is that per-expert M ≈ 4: the grouped GEMM must saturate ~273 GB/s while feeding tiny ragged M-tiles -— the regime where ggml's dense-tuned `mmq_x=128` underfills (see `MOE_GROUPED_GEMM_SCOPE.md`). - -**Where llama sits:** GEMM = 59% × 384 = **227 ms** → effective BW 22 GB / 0.227 s = -**97 GB/s = 35% of 273** (less compute-bound than dense, but only 1/3 of peak bandwidth — an -occupancy/tile-fill loss, exactly the `MOE_GROUPED_GEMM_SCOPE.md` M-tile finding). - -**Where vLLM sits:** step 158 ms ≈ GEMM at the ~80 ms floor (grouped Marlin-NvFp4, 51% of its step) -+ ~78 ms non-GEMM. So vLLM is already pushing the MoE bandwidth floor. - -**MoE parity arithmetic** (384 ms = GEMM 227 + act 31 + GDN 38 + attn 8 + rest 81): -- B + A, GEMM → 80 ms floor + act fused: 80+38+8+81 = **207 ms → 618 tok/s = 76% of vLLM.** -- This is the **ceiling from the GEMM track**: even with a *perfect* MoE weight-read-floor GEMM, - llama's non-GEMM (GDN 38 + attn 8 + rest 81 = 127 ms) is **1.6× vLLM's whole non-GEMM (~78 ms)**, - so the step cannot drop below ~207 ms. To reach vLLM's 158 ms needs the non-GEMM buckets too - (GDN state I/O is intrinsic and vLLM pays it identically — `GDN_DECODE_VERIFY.md` — so the - remaining ~49 ms is elementwise + host loop, **outside track B**). - -**MoE band from B+A: ~60–76% of vLLM.** Full MoE parity is **not reachable from the GEMM alone.** - ---- - -## 2b. The precise code-level inefficiencies (source-read, the "why slower than vLLM") - -Decode runs **one `mul_mat_q` per weight, M=128** (all 128 slots' single tokens are fused into one -ubatch — confirmed `mul_mat_q(M=128)` in `GDN_DECODE_VERIFY.md`, not 128 × M=1). The NVFP4 path: -`mmq.cu` `use_native_fp4` gate → `quantize_mmq_fp4_cuda` (act-quant) → `mul_mat_q` → -`vec_dot_fp4_fp4_mma` (`mmq.cuh:997`) → `mma_block_scaled_fp4` (`mma.cuh:1126`, PTX -`mma.sync...kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3`). -Geometry: `get_mmq_x_max=128`, `mmq_y=128`, `nwarps=256/32=8`, `iter_k=MMQ_ITER_K_FP4=512`. Tiles: +Geometry (`vec_dot_fp4_fp4_mma`): `MMQ_NWARPS=8`, `iter_k=MMQ_ITER_K_FP4=512`, tiles `tile_A<16,8,int>` (weights, 16 N-rows × 64 FP4-in-K), `tile_B<8,8,int>` (acts, 8 M-cols × 64 -FP4-in-K), `tile_C<16,8,float>` (16 N-rows × 8 M-cols), `nfrags=32/8=4`. - -1. **Separate activation-quant pass (track A's target).** `quantize_mmq_fp4_cuda` writes the *entire* - activation tensor to `block_fp4_mmq` in a standalone kernel before `mul_mat_q`. vLLM fuses - `scaled_fp4_quant` into the preceding RMSNorm/SiLU epilogue (`rms_quant_fusion`/`act_quant_fusion`) - — no separate pass, no extra activation read+write+launch. 8.2% of the npl128 step. **B must consume - A's in-place `block_fp4_mmq` y-tile** so the fusion saves the round-trip, not just the launch. - -2. **No weight-load software pipeline → exposed latency at thin M (the #1 kernel lever).** - `load_tiles_nvfp4_nvfp4` does plain shared stores → `__syncthreads` → `vec_dot_fp4_fp4_mma` - (`load_ldmatrix` + MMAs): a **load→sync→compute→repeat** cadence with **no `cp.async` - double-buffering** overlapping the next K-block weight load with the current MMA. At M=128 the per- - tile MMA work is small (8 M-cols per `tile_C::J`), so serialized weight-load latency dominates → - the ~3% (dense) / 35%-of-BW (MoE) result. vLLM's Marlin runs a 4-stage `cp.async` pipeline. **The - defining caveat:** a *deep* pipeline + XOR-swizzle collapses GB10 occupancy - (`W4A16_MARLIN_KERNEL_PLAN.md`); the fix is a **shallow 2-stage prefetch + skew-pad**, not Marlin's 4. - -3. **`mmq_x` selector maximizes the M-tile — the opposite of the GB10 occupancy rule.** - `mul_mat_q_case` picks `mmq_x` by *minimizing* `ntiles_x = ceil(ncols_max/mmq_x)`, so it always - takes the *largest* tile that fits shared. Dense decode → `mmq_x=128`, `mmq_y=128`: a heavy 128×128 - tile (8 warps) → low occupancy on the occupancy-dominated GB10. No padding waste and no redundant - weight read (`ntx=1` → each weight row-tile read once), so the loss is pure occupancy; a smaller - `mmq_x` with more resident CTAs may hide load latency better (P1 host-only sweep, zero kernel risk). - -4. **MoE per-expert M-tile waste (the structural MoE gap).** Stock applies the 128-wide tile *per - expert*; per-expert density is ~4 tokens (top-8 of 256 @npl128), so the 128-wide accumulator is - ~3% filled and only ~1 `tile_C` N-fragment is live (`tile_C::J=8`), the rest masked `need_check` - tails. Patch 0015 (`MOE_DENSITY_AUTO_TILE.md`) auto-caps to 64 at decode, but the ideal is - ~tokens/expert ≈ 8 — even 64 is ~8× too big. vLLM uses a small per-expert `BLOCK_SIZE_M` (16/32). - At ≤1 col-tile/expert a smaller tile costs **no** extra weight re-read → strictly occupancy-positive. - (Inefficiency 4 is the MoE arm of 3; at dense M=128, 128/8=16 N-frags are fully used — no dense - M-waste.) - -5. **`iter_k=512` (FP4) vs 256 couples to occupancy.** The FP4 main loop stages 512 K-elements/iter → - larger shared footprint → fewer iters but more pressure on the occupancy-bound part. A P5 knob. - -**Ruled out (so B does not chase them):** redundant weight reads (none — dense `ntx=1`, MoE ≤1 -col-tile/expert; the low effective BW is latency/occupancy, not re-reads); stream-K fixup (it *helps* -fill the small GB10 grid, cheap at thin M); raw FP4-MMA peak rate (the path already beats Q4-MMQ and -is BW-bound at batch 1 — at M=128 latency-hiding binds first, not MMA throughput). +FP4-in-K), `tile_C<16,8,float>` (16 N-rows × 8 M-cols), `nfrags = MMQ_TILE_NE_K/tile_A::J`. The M loop +is `for (j0=0; j0` FP4-MMA kernel**, co-delivered with track A. Honest expectation: **90–103% of - vLLM (parity within error), not a guaranteed beat.** Go condition: it is contingent on reaching - ~17–21% FP4 efficiency at M=128 (top of the demonstrated GB10 envelope) — set a P2 kill-gate - (below). -- **MoE — PARTIAL / NO-GO for parity-from-B.** Track B (the M-tile work already scoped in - `MOE_GROUPED_GEMM_SCOPE.md`) buys MoE → ~60–76% of vLLM and is worth doing, but **cannot deliver - MoE parity by itself**; do not promise 811. Full MoE parity requires B + the non-GEMM tracks - (elementwise/host CUDA-graph, GDN state I/O bf16) and is a multi-track effort. +- **Weight-read floor:** 22/273 = **80.6 ms → 1,588 tok/s** (@216: 102 ms → 1,255). +- **Compute floor:** only ~3B active params ⇒ 0.77 TFLOP ⇒ 1.5 ms @peak — **trivial. MoE decode is + purely bandwidth/occupancy-bound, never compute-bound.** The hard part is saturating 273 GB/s while + feeding ragged M≈4 tiles. +- **Where llama sits:** GEMM = 59% × 384 = **227 ms = 97 GB/s = 35% of peak BW** (occupancy/tile-fill + loss, not compute). +- **Where vLLM sits:** step 158 ms ≈ grouped Marlin-NvFp4 at the ~80 ms floor + ~78 ms non-GEMM — + already pushing the MoE BW floor. -**Bottom line for the "TRUE PARITY" ask:** GB10 **can** plausibly deliver **dense** decode parity -with vLLM via a tuned FP4-MMA decode GEMM **+ track A**, at the edge of the demonstrated efficiency -envelope and with no margin. GB10 **cannot** deliver **MoE** decode parity from the GEMM track -alone (ceiling ~76%); MoE parity is a B-plus-non-GEMM program. The hardware (273 GB/s) is **not** the -ceiling — the GB10 FP4-MMA occupancy efficiency is, and it is a "reach" for dense and a "partial" for -MoE. +**Both weight-read floors (dense ~1,940, MoE ~1,590 tok/s) sit 4–6× ABOVE vLLM's 391/811. Bandwidth +is not the wall; the GB10 FP4-MMA occupancy efficiency is.** --- -## 4. Build-ready plan (do NOT implement here) +## 3. The code-level inefficiencies, and the M-tile asymmetry that drives the whole plan -The kernels already exist; track B is a **tune + fuse of the FP4-MMA `mul_mat_q` path at the decode -M-tile**, not a new kernel. This respects every GB10 occupancy lesson (small shared, high occupancy, -skew-pad, stay on `block_fp4_mmq`; never deep `cp.async` / XOR-swizzle). +The selection is `mul_mat_q_case` (`mmq.cuh:4108`): it loops `mmq_x = 8..mmq_x_max(=128) step 8` and +keeps the `mmq_x` that **minimizes `ntiles_x = ceil(ncols_max/mmq_x)`**, stopping at `ntiles_x==1`. +`mmq_y` (the weight-row tile) is pinned at **128** by `get_mmq_y_host` (L143). This produces the +single most important structural fact for track B: -### Files (DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`) -- `mmq.cuh` — `block_fp4_mmq` (L53), `load_tiles_nvfp4_nvfp4` (L948), `vec_dot_fp4_fp4_mma` (L997), - the stream-k `mul_mat_q` kernel + `mul_mat_q_case` / `launch_mul_mat_q` tile selection (~L3320–4055, - all under `BLACKWELL_MMA_AVAILABLE`). -- `mmq.cu` — dense + id dispatch; `use_native_fp4` gate (L125), `quantize_mmq_fp4_cuda` act-quant - (L138/L200 — **track A's fuse target**). -- `mmid.cu` — `mm_ids_helper` MoE token-sort (the MoE M-tile lever, scoped in `MOE_GROUPED_GEMM_SCOPE.md`). +> **`mmq_x` tiles M (tokens / output columns) — shrinking it RE-READS the weights `ntiles_x` times. +> `mmq_y` tiles N (weight rows / output rows) — shrinking it does NOT re-read weights (each weight row +> lives in exactly one row-tile); it only lowers shared footprint and raises occupancy.** The two +> regimes pick opposite knobs: -### Phases (each ends with: `test-backend-ops -o MUL_MAT[/_ID] -b CUDA0` bit-exact + a decode bench) +| | dense decode (M=128, no `expert_bounds`) | MoE decode (per-expert M≈4) | +|---|---|---| +| selection picks | `mmq_x=128` → `ntiles_x=1` → **weights read ONCE** (the one-read optimum) | `mmq_x=128` applied **per expert** → tile ~3% filled | +| shrink `mmq_x`? | **NO — re-reads 18 GB ×`ntiles_x`**, fatal in the BW-bound regime | **YES, FREE** — 1 col-tile/expert regardless, no re-read → strictly occupancy-positive | +| FP4-MMA M-frag fill | **full** (128/`tile_C::J`=16 frag-groups, all live) → no fragment waste | **wasted** (~1 of 8/16 frag-groups live, rest masked tails) | +| BW-neutral occupancy lever | **`mmq_y`↓** (more resident CTAs, weights still read once) — kernel-structure change | **`mmq_x`↓** (toward density ≈8) — host-side template switch | +| dominant loss | **occupancy** at the heavy 128×128 tile (exposed weight-load latency) | **tile-fill** (dense-tuned M-tile applied to ragged M≈4) | -| Phase | Work | Expected payoff | Risk | -|---|---|---|---| -| **P0** harness | Capture per-shape baseline at the **decode shape** (`test-backend-ops perf -o MUL_MAT`, type NVFP4, **n=128**, FFN K/N) + nsys decode window. Lock 1103/1103 parity + the 14.7 TFLOP/s baseline. Decode-M is the canonical target, not prefill n=512. | None (gate). | Low | -| **P1** decode M-tile selection (dense) | In `mul_mat_q_case`/`launch_mul_mat_q`, pick `mmq_x`/`mmq_y` from the **decode M=128** shape rather than the prefill-tuned config. M=128 with FP4 N-frag 8 wants a small, occupancy-friendly tile; the prefill `mmq_x=128` likely underfills SM occupancy at decode. Host-side template selection, **zero new kernel**, mirrors `MOE_GROUPED_GEMM_SCOPE.md` [1]. | Lift dense FP4 eff from ~3% toward 10–17%; no extra weight read (one col-tile). | Low | -| **P2** occupancy/pipeline tune | Sweep warps/tile/skew-pad on the FP4-MMA decode kernel to push toward the **66 ms BW floor (~21% FP4 eff)**. Honor GB10 rules: small shared, high occupancy, skew-pad +4, **no** deep cp.async / XOR-swizzle. **KILL-GATE:** if decode FP4 eff plateaus < ~15% (GEMM > ~110 ms) after the sweep, dense parity is off — stop and report partial. | The dense parity make-or-break. Target GEMM 471→66–81 ms. | **Med-high** (the occupancy wall is real; ncu unavailable on DGX → empirical sweep only) | -| **P3** co-land track A | Verify the fused act-quant (track A) composes with the tuned GEMM (the requant folds into the FP4 GEMM prologue, removing the 8.2% bucket). | Dense 376–394 tok/s = 90–103% vLLM. | Low (track A owns the fuse) | -| **P4** MoE M-tile | Land the `MOE_GROUPED_GEMM_SCOPE.md` expert-aware `mmq_x` ([1]) + block-pad align ([2]). | MoE → ~60–76% vLLM (not parity). | Med | +This asymmetry is the spine of the plan: **MoE's lever is host-only `mmq_x`↓ (already landed as patch +0015 auto-cap→64; ideal ≈8–16); dense's lever is `mmq_y`↓ + occupancy, a bounded kernel change.** -### Parity gate (every phase) -`GGML_CUDA_*` flag set and unset → `test-backend-ops test -o MUL_MAT -b CUDA0` = **1103/1103**, -byte-identical when unset. Add **decode-shape (n=128) + ragged small-M** cases if absent. End-to-end: -`llama-batched-bench -fa on -npp 512 -ntg 256 -npl 128` on `q36-27b-nvfp4.gguf`, confirm decode -agg climbs toward ~376–394 and stays bit-stable vs the CPU oracle (within the GB10 greedy-decode -non-determinism band). All bench/parity scripts **dev-tree-only**. +The five inefficiencies, ranked: + +1. **Separate activation-quant pass (track A's bucket, 8.2%).** `quantize_mmq_fp4_cuda` writes the + whole activation tensor to `block_fp4_mmq` in a standalone kernel; vLLM fuses `scaled_fp4_quant` + into the preceding RMSNorm/SiLU epilogue. **Handoff (track A → B):** B must consume A's prequantized + `block_fp4_mmq` y-tile in place of calling `quantize_mmq_fp4_cuda`, so the fusion saves the + activation round-trip, not just the launch (see §4.4). + +2. **No weight-load software pipeline → exposed latency at thin M (the #1 dense kernel lever).** + `load_tiles_nvfp4_nvfp4` (`mmq.cuh:946`) does plain global→shared stores → `__syncthreads` → + `vec_dot_fp4_fp4_mma` (`load_ldmatrix` of A + MMA): a **load→sync→compute→repeat** cadence with **no + `cp.async` double-buffering** overlapping the next k-block weight load with the current MMA. At + M=128 the per-tile MMA work is small, so serialized weight-load latency dominates → 2.9% (dense) / + 35%-of-BW (MoE). **Caveat (the GB10 wall):** a *deep* pipeline + XOR-swizzle collapses GB10 + occupancy (`W4A16_MARLIN_KERNEL_PLAN.md`). The fix is **occupancy-first** (raise resident CTAs to + hide latency via CTA-parallelism), **shallow 2-stage prefetch second**, never Marlin's 4-stage. + +3. **`mmq_x` maximized for dense = occupancy-heavy, but pinned by the one-read constraint.** At dense + decode the 128×128 tile (8 warps, large shared) is low-occupancy on the occupancy-dominated GB10 — + but you cannot shrink `mmq_x` without doubling the 18 GB weight read. So the dense occupancy fix is + **`mmq_y`↓** (BW-neutral), not `mmq_x`↓. + +4. **MoE per-expert M-tile waste (the structural MoE gap).** The 128-wide (or patch-0015 64-wide) + tile is applied per expert at density ≈4, so the accumulator is ~3–6% filled and ~1 `tile_C` frag- + group is live, the rest masked `need_check` tails. Ideal `mmq_x` ≈ tokens/expert ≈ 8 (= `tile_C::J`). + At ≤1 col-tile/expert this costs **no** extra weight read → strictly occupancy-positive. (This is + the MoE arm of inefficiency 3; scoped in `MOE_GROUPED_GEMM_SCOPE.md`.) + +5. **`iter_k=512` (FP4) couples to occupancy.** The FP4 main loop stages 512 K-elements/iter → larger + shared footprint → adverse in the occupancy-bound regime. A P2 tuning knob. + +**Ruled out (do not chase):** redundant weight reads on the *current* selection (none — dense +`ntiles_x=1`, MoE ≤1 col-tile/expert); stream-K fixup (it *helps* fill the small GB10 grid at thin M); +raw FP4-MMA peak rate (already beats Q4-MMQ and is BW-bound at batch 1 — latency-hiding binds first). + +--- + +## 4. The specific build-ready changes + +All against DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`. Every change is gated and defaults to exact +stock behavior until proven. + +### 4.1 Dense M-tile / occupancy (the make-or-break) + +- **Keep `mmq_x=128` at dense decode** (the one-weight-read optimum; do **not** shrink it — that + re-reads 18 GB). Lock this as an invariant in P0. +- **Make `mmq_y` decode-selectable** (`get_mmq_y_host`/`get_mmq_y_device`, L143/L157). Today pinned + 128; try **64** (and 96) at decode. `mmq_y` is coupled to `nwarps × tile_C::I` via the MMQ + static_assert, so this is a **warp/fragment remap** (bounded kernel change), not a pure host switch: + fewer N-frags per warp or fewer warps → smaller per-CTA shared → **more resident CTAs → latency + hidden by CTA-parallelism**, with **weights still read once** (BW-neutral). This is the primary + dense occupancy lever and respects every GB10 rule. +- **Host-only knobs first (P1, zero kernel):** the `mmq_get_granularity_host` choice (L274 — sets + `rows_per_warp=2·granularity`, `ntx`), and the stream-k-vs-xy-tiling threshold (`launch_mul_mat_q` + ~L3954, `tiles_efficiency_percent` L4001). Plus one **empirical A/B**: does eating a 2× weight + re-read at `mmq_x=64` buy enough occupancy to net positive? (Diagnostic: if yes, occupancy is badly + broken and P2 `mmq_y`↓ has large upside; if no, the tile is already BW-saturated and P2's ceiling is + lower.) All behind `GGML_CUDA_FP4_MMQ_Y` / `GGML_CUDA_FP4_GRAN` / `GGML_CUDA_FP4_FORCE_STREAMK`. + +### 4.2 FP4-MMA fragment usage + +- Fragments stay `tile_A<16,8,int>` / `tile_B<8,8,int>` / `tile_C<16,8,float>` — these match the + `m16n8k64` block-scaled FP4 MMA and must not change (they are the instruction shape). At dense M=128 + all 16 `tile_C::J`-groups are live → **no dense fragment work needed**. The lever is *how many of + these tiles are resident per SM* (occupancy), set by `mmq_y`/`nwarps`/granularity, not the fragment + shape. +- MoE: shrink `mmq_x` toward `tile_C::J`=8 so the live frag-group count matches density (§4.3). + +### 4.3 MoE M-tile (`MOE_GROUPED_GEMM_SCOPE.md`, partly landed) + +- **Patch 0015 already auto-caps `mmq_x`→64 at decode** via per-expert density in `mul_mat_q_case` + (the `expert_bounds != nullptr` block, L4118-4165; env `LLAMA_MOE_DECODE_TILE`, + `LLAMA_MOE_DENSITY_MAX`). Tighten the decode tile toward **8–16** (= density) and sweep. +- **Optional [2]: block-padded `mm_ids_helper`** (`mmid.cu`) — pad each expert segment to a multiple + of the tile, removing `need_check` masked tails and tightening the stream-k schedule. Medium risk + (scatter + write-back masking); behind `LLAMA_MOE_BLOCK_ALIGN`. + +### 4.4 Scale handling + the act-quant fusion handoff (the track A → B ABI contract) + +- **Weight scales** (`ue4m3`, one per 16 weights) load in `load_tiles_nvfp4_nvfp4` into `x_sc` + (`x_u32 + 64 + kbx`), consumed as `scaleA` in `vec_dot_fp4_fp4_mma` and passed as the block-scale + operand to `mma_block_scaled_fp4`. **No change** — already a first-class MMA scale operand. +- **Activation scales** (`ue4m3`) live in the `block_fp4_mmq` y-tile `d4[4]`, consumed as `scaleB`. +- **The handoff contract:** track B must hold the **`block_fp4_mmq` y-tile layout invariant** + (`uint32_t d4[4]` ue4m3 scales + `int8_t qs[128]` = 256 packed FP4, `mmq.cuh:53`). Track A's fused + `rms_norm+mul+nvfp4-quant` producer (task 39) writes exactly this struct; track B's "prequantized + MMQ consumer" (task 40) makes `mul_mat_q` accept a prebuilt `src1_q8_1` buffer and **skip the + `quantize_mmq_fp4_cuda` call** (`mmq.cu:138`/`200`). The numerics must be **bit-identical** to the + unfused path (same `e2m1` rounding, same `ue4m3` block scale per 16) so the parity gate stays green + with the fusion on or off. B owns the consumer seam; A owns the producer kernel; the `block_fp4_mmq` + struct is the frozen interface between them. + +### 4.5 GB10-fit rules (binding constraints on every kernel change) + +- **Small shared mem + high occupancy.** Do **not** add deep `cp.async` stages or XOR-swizzle shared + layouts — they are exactly what collapsed W4A16 on GB10 (`W4A16_MARLIN_KERNEL_PLAN.md`: a 16 KB + XOR-swizzle dropped q4_K from 6.63→2.84 TFLOPS). +- **Preserve the skew-pad** (`MMQ_MMA_TILE_X_K_FP4 = 2·MMQ_TILE_NE_K + 8 + 4`, the `% 8 == 4` + padding, `mmq.cuh:221/233`) — conflict-free `ldmatrix` at ~zero shared cost. +- **Stay on the FP4-MMA path** (`block_fp4_mmq` / `mma_block_scaled_fp4`) — the only path at GB10's + FP4 = 2× INT8/BF16 rate. Never descend to BF16/INT8 (1:1 on GB10). +- **Occupancy beats a conflict-free-but-wide layout.** Buy latency-hiding with *more resident CTAs* + (smaller `mmq_y`, smaller shared), not a deeper pipeline. +- Tuning is **empirical** — `nsys` (throughput) is available, **`ncu` is not** on the DGX (no driver + perms). Sweep configs, measure decode_agg, bracket thermals (same-session cold A/B only). + +--- + +## 5. Correctness / parity gate (every phase) + +- **Primary, bit-exact:** `test-backend-ops test -o MUL_MAT -b CUDA0` and + `test-backend-ops test -o MUL_MAT_ID -b CUDA0` must stay **1103/1103** with the flag set **and** + unset, and **byte-identical** when unset. The CPU reference is the deterministic oracle; the op test + is exact (the GB10 greedy-decode non-determinism band applies only to end-to-end, never to the op + test). +- **Add decode-shape cases if absent:** `type_a ∈ {NVFP4, MXFP4}`, `type_b = F32`, dense **n=128** at + the real FFN K/N; for `_ID`, `n_mats=128, n_expert_used=8, n_tokens ∈ {8,32,64,128}` **plus ragged + small-M** (experts with 0/1/2 tokens, `n_tokens` not a multiple of `mmq_x`) — exactly where `mmq_x`/ + `mmq_y` changes and block-pad masking can leak. +- **Fusion-handoff parity (P3):** with track A's fused producer on, the prequantized-consumer path + must produce dst **identical** to the unfused `quantize_mmq_fp4_cuda` path (same `e2m1`/`ue4m3` + rounding). +- **End-to-end:** `llama-batched-bench -fa on -npp 512 -ntg 256 -npl 128` on `q36-27b-nvfp4.gguf` + (dense) and `q36-35b-a3b-nvfp4.gguf` (MoE); confirm decode_agg climbs per §6 and output stays within + the documented CUDA batch-shape non-determinism band vs the CPU oracle. All scripts **dev-tree-only**. + +--- + +## 6. Phased plan, with expected decode_agg at each phase + +Per-step model used (ms @npl128): **dense 795** = GEMM 471 + act 65 + GDN 83 + attn 14 + rest 162; +**MoE 384** = GEMM 227 + act 31 + GDN 38 + attn 8 + rest 81. `decode_agg = 128 / step_s`. + +### DENSE (parity target 391) + +| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 391** | risk | +|---|---|---:|---:|---:|---:|---| +| **P0** harness | Lock baseline: 1103/1103, decode n=128 perf, nsys window, the 471 ms / 2.9% eff datum. Pin `mmq_x=128` one-read invariant. | 471 | 795 | **161** | 41% | low | +| **P1** host-only tile/grid + re-read A/B | granularity + stream-k threshold sweep; the `mmq_x=64` re-read-vs-occupancy diagnostic. **Honest: small** — `mmq_x` is pinned, so this mostly de-risks P2. | ~400 | ~724 | **~177** | ~45% | low | +| **P2** `mmq_y`↓ + occupancy/shallow-prefetch | The make-or-break: raise resident CTAs (`mmq_y` 128→64, granularity, shallow 2-stage weight prefetch, skew-pad), push GEMM toward the **66–81 ms BW floor (17–21% FP4 eff)**. **KILL-GATE: if eff plateaus <15% (GEMM >110 ms) → dense parity OFF, report partial.** | **66–81** | 390–405 | **316–328** | **81–84%** | **med-high** | +| **P3** co-land track A | Consume A's prequantized `block_fp4_mmq` y-tile; the 65 ms act bucket folds away. | 66–81 | **325–340** | **376–394** | **96–101%** | low | + +Dense climb: **161 → ~177 → 316–328 → 376–394** tok/s = **41% → 45% → 81–84% → 96–101% of vLLM 391.** +Robust to the 273-vs-216 GB/s uncertainty (@216 GB/s P3 → ~359 tok/s = 92%). **Parity within error, +contingent on P2 clearing the kill-gate and on A landing.** + +### MoE (parity target 811) + +| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 811** | risk | +|---|---|---:|---:|---:|---:|---| +| **P0** harness | Lock 1103/1103 + the monotonic `85→1771` batched-bench curve + 227 ms / 35%-BW datum. | 227 | 384 | **333** | 41% | low | +| **P1/P4** MoE `mmq_x`↓ (patch 0015 → tighten to 8–16) | Free per-expert tile shrink (no re-read); reclaim the 3–6% fill waste, raise occupancy. | ~140 | ~297 | **~431** | ~53% | low | +| **P2** block-pad align + occupancy | Remove `need_check` tails, tighten stream-k; push toward the 80 ms floor. | ~100 | ~257 | **~498** | ~61% | med | +| **P3** co-land track A | act bucket (31 ms) folds away; GEMM at the ~80 ms floor. | 80 | **207** | **618** | **76% — CEILING** | low | + +MoE climb: **333 → ~431 → ~498 → 618** tok/s = **41% → 53% → 61% → 76% of vLLM 811.** **The 76% is the +hard ceiling from the GEMM track:** even a *perfect* weight-read-floor grouped GEMM leaves llama's +non-GEMM (GDN 38 + attn 8 + rest 81 = 127 ms) at **1.6× vLLM's whole ~78 ms non-GEMM**, so the step +cannot drop below ~207 ms. The remaining ~49 ms to vLLM's 158 ms step is elementwise + host-loop +(GDN state I/O is intrinsic and vLLM pays it identically — `GDN_DECODE_VERIFY.md`), **outside track B.** ### Explicitly NOT in scope (and why) -- A from-scratch W4A16 / CUTLASS collective — the FP4-MMA path already exists and is BW-optimal at - batch 1; rewriting repeats the W4A16 occupancy dead-end (`W4A16_MARLIN_KERNEL_PLAN.md`: STOPPED). -- Deep multi-stage `cp.async` / XOR-swizzle shared layouts — proven to collapse GB10 occupancy. -- The non-GEMM MoE residual (elementwise, host CUDA-graph, GDN bf16 state) — needed for MoE parity - but **separate tracks**; track B owns the GEMM only. + +- A from-scratch W4A16 / CUTLASS SM120 collective — repeats the STOPPED occupancy dead-end and + CUTLASS's grouped FP4 is broken on sm_121. +- Deep multi-stage `cp.async` / XOR-swizzle — proven to collapse GB10 occupancy. +- "Make activations 4-bit" — already W4A4; no work, no win there. +- The non-GEMM MoE residual (elementwise, host CUDA-graph, GDN bf16 state) — needed for MoE parity but + **separate tracks**; B owns the GEMM only. --- -## 5. Honest one-paragraph summary +## 7. The honest ceiling — does B reach TRUE PARITY? + +- **DENSE: TRUE PARITY is PLAUSIBLY REACHABLE, conditional, no margin.** The entire 2.42× gap is the + GEMM bucket; its ideal floor (66 ms) is 7× below the current 471 ms and is **bandwidth-bound, not + hardware-capped**. **B (GEMM → BW floor) + A (act-fuse) lands 376–394 tok/s = 90–103% of vLLM 391.** + The catch: it needs **~17–21% FP4-MMA efficiency at decode M=128**, and GB10 has only demonstrated + ~17% — and that at the *easier* prefill M=512 tile. It is a **reach, not a lock**, gated by the P2 + occupancy kill-gate and contingent on track A. **GO (conditional).** + +- **MoE: full parity is NOT reachable from track B.** Realistic ceiling **~76% of vLLM (618 vs 811)** + even with a perfect weight-read-floor grouped GEMM, because (1) the MoE floor is the hardest + grouped-GEMM regime (M≈4/expert, vLLM ships purpose-built Marlin-NvFp4) and (2) ~24% of the step is + non-GEMM outside this track. Worth doing (333 → ~618, a 1.85× and a real win), but it **cannot + deliver 811 alone.** **PARTIAL / NO-GO for parity-from-B.** + +- **The 273 GB/s is not the ceiling — the GB10 FP4-MMA occupancy efficiency is.** Decode M=128 is a + *different* regime from the dead W4A16 path: bandwidth/occupancy-bound (saturate LPDDR5x at a thin + M-tile via resident CTAs), not compute-throughput-bound (pack MMAs). The existing path is already at + the BW floor at batch 1 (88 ms), so the work is **keeping it bandwidth-bound as M grows to 128** + (occupancy via `mmq_y`↓ + shallow prefetch), a **tune of a working path**, not the greenfield + rewrite. The binding risk is whether that occupancy can be bought without tripping the GB10 wall — + which is exactly what the P2 kill-gate measures. + +**Bottom line for the "TRUE PARITY" ask:** GB10 **can** plausibly deliver **dense** decode parity with +vLLM via a tuned FP4-MMA decode GEMM **+ track A**, at the top of the demonstrated efficiency envelope +with no margin. GB10 **cannot** deliver **MoE** decode parity from the GEMM track alone (ceiling ~76%); +MoE parity is a B-plus-non-GEMM program. **Verdict: GO for dense (conditional, B+A, kill-gated), +PARTIAL for MoE.** + +--- + +## 8. One-paragraph summary The decode GEMM at M=128 is **bandwidth-bound on paper** (crossover M*≈611 ≫ 128) with weight-read floors 4–6× above vLLM, so **273 GB/s is not the wall** — but llama's FP4-MMA kernel runs at ~3% of -FP4 peak, putting it in **self-inflicted compute-bound territory** (471 ms vs a 66 ms floor). Closing -that is the entire dense gap: **track B (tune the FP4-MMA decode M-tile to the BW floor) + track A -(fuse act-quant)** plausibly reaches **90–103% of vLLM dense (391)** — TRUE PARITY is on the table for -dense, but only at the **top of the demonstrated GB10 FP4-efficiency envelope (~17–21%)** and with -**no margin**, gated by the occupancy wall. **MoE parity is not reachable from the GEMM alone** -(ceiling ~60–76% of 811), because its floor sits in the hardest grouped-GEMM regime and ~24% of its -step is non-GEMM work outside this track. Verdict: **GO for dense (conditional, B+A), PARTIAL for MoE.** +FP4 peak, in **self-inflicted compute-bound territory** (471 ms vs a 66 ms floor). The path is already +**W4A4** and already **beats vLLM at batch-1 prefill**, so the fix is **tuning the existing +`mul_mat_q`**, not a cutlass rewrite (a proven GB10 dead-end, and broken on sm_121 anyway). The +M-tile asymmetry sets the levers: **dense** is pinned at `mmq_x=128` (one weight read) so its occupancy +win is **`mmq_y`↓ + shallow prefetch** (BW-neutral), while **MoE**'s win is the free per-expert +**`mmq_x`↓** (patch 0015). **Track B (GEMM → BW floor) + track A (fuse act-quant)** plausibly reaches +**90–103% of vLLM dense (391)** — TRUE PARITY on the table for dense, but only at the **top of the +demonstrated GB10 FP4-efficiency envelope (~17–21%)**, with **no margin**, gated by the P2 occupancy +kill-gate. **MoE parity is not reachable from the GEMM alone** (ceiling ~76% of 811), because its floor +sits in the hardest grouped-GEMM regime and ~24% of its step is non-GEMM. **Verdict: GO for dense +(conditional, B+A), PARTIAL for MoE.**