mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-25 17:12:10 -04:00
docs(paged): build-ready track-B FP4-GEMM scope - kernel decision + per-phase decode_agg
Rewrite the track-B scope into the definitive build-ready plan for the
NVFP4 FP4-MMA decode GEMM toward vLLM GB10 parity. Source-read of the
mmq.cuh/mma.cuh/quantize.cu FP4 path on the dgx paged dev tree settles two
load-bearing facts the prior draft got partly wrong:
- llama's dense path is already TRUE W4A4 (block_fp4_mmq packs 256 e2m1
values + ue4m3 scales; the MMA is kind::mxf4nvf4 e2m1.e2m1...ue4m3), so
there is no activation-bit-width work to do; the whole dense deficit is
scheduling/occupancy.
- the mmq_x selector minimizes ntiles_x, which PINS dense decode at
mmq_x=128 (weights read once). Shrinking mmq_x re-reads the 18 GB
weights, so the dense occupancy lever is mmq_y-down (BW-neutral), NOT
mmq_x-down; MoE's free lever is the per-expert mmq_x-down (patch 0015).
Adds the explicit kernel-approach decision (tune the existing FP4-MMA
mul_mat_q; reject the cutlass-SM120 rewrite, dead on GB10 and broken on
sm_121; reject the BF16-Marlin descent), the concrete build-ready changes
(mmq_y/granularity/stream-k knobs, FP4-MMA fragment invariants, the
ue4m3 scale path, and the block_fp4_mmq y-tile ABI contract for the
track-A act-quant fusion handoff), the GB10-fit rules, the bit-exact
test-backend-ops gate with decode-shape + ragged-M cases, and per-phase
expected decode_agg tables.
Verdict (honest, roofline-grounded): the decode GEMM is bandwidth-bound on
the hardware roofline (M=128 << crossover 611; weight-read floors 4-6x
above vLLM) but compute-bound in practice at ~3% FP4 eff, so 273 GB/s is
not the wall. DENSE: GO (conditional) - B+A reaches 376-394 tok/s =
90-103% of vLLM 391, gated by a P2 occupancy kill-gate (<15% FP4 eff ->
parity off). MoE: PARTIAL/NO-GO - ceiling ~76% of 811 (618) from the GEMM
alone; full MoE parity needs the non-GEMM tracks too.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -1,271 +1,362 @@
|
||||
# Track B: the FP4-MMA weight-GEMM for GB10 decode parity with vLLM — roofline + go/no-go
|
||||
# Track B: the FP4-MMA weight-GEMM for GB10 decode parity with vLLM — build-ready scope + honest go/no-go
|
||||
|
||||
Scope only (build-ready plan + honest verdict). **Not implemented in this workflow.** This is the
|
||||
residual-kernel track after track A (fuse the standalone `quantize_mmq_nvfp4` activation-requant,
|
||||
the 8.2% bucket) is handled separately. Track B asks the load-bearing question and answers it
|
||||
quantitatively: at the decode batch shape (M≈128 tokens, NVFP4 weights), is the weight GEMM
|
||||
**compute-bound** (FP4-MMA throughput is the lever → parity reachable with a better kernel) or
|
||||
**bandwidth-bound** (273 GB/s weight-read is a hard floor → parity capped)? And given the prior
|
||||
GB10 occupancy history, can a better FP4-MMA decode GEMM actually reach vLLM's 391 (dense) / 811
|
||||
(MoE) tok/s, or only partway?
|
||||
Scope only (build-ready plan + honest verdict). **Not implemented in this workflow.** Track B is the
|
||||
residual-kernel track after track A (fuse the standalone `quantize_mmq_fp4` activation-requant, the
|
||||
8.2% decode bucket — tasks 38-41, the fused `rms_norm+mul+nvfp4-quant` producer + prequantized-MMQ
|
||||
consumer) is handled separately. Track B owns the **weight GEMM**, the ~59% bucket.
|
||||
|
||||
**The load-bearing question, restated:** at the decode batch shape (M≈128 tokens fused into one
|
||||
ubatch, NVFP4 weights), is the weight GEMM **compute-bound** (FP4-MMA throughput is the lever →
|
||||
parity reachable with a better kernel) or **bandwidth-bound** (273 GB/s weight-read is a hard floor →
|
||||
parity capped)? And given the GB10 occupancy history, can a better FP4-MMA decode GEMM actually reach
|
||||
vLLM's **391 (dense) / 811 (MoE)** decode-agg tok/s @npl128, or only partway?
|
||||
|
||||
Hardware: NVIDIA GB10 / DGX Spark, sm_121 (CC 1210 = `GGML_CUDA_CC_DGX_SPARK`), unified LPDDR5x.
|
||||
Dev tree `~/llama-paged-dev` (branch `paged`, build-cuda sm_121). All numbers are reasoned from the
|
||||
committed nsys decomposition + measured GB10 specs; **no new GPU benchmarks were run** (track A is on
|
||||
the box).
|
||||
committed nsys decomposition + measured GB10 specs + a source read of the FP4-MMA kernel; **no new GPU
|
||||
benchmarks were run** (track A is on the box).
|
||||
|
||||
## 0. The grounded inputs (measured, committed)
|
||||
## 0. Grounded inputs (measured, committed)
|
||||
|
||||
| quantity | value | source |
|
||||
|---|---|---|
|
||||
| LPDDR5x bandwidth (spec) | **273 GB/s** | `BLACKWELL_KERNEL_GAPS.md`, `VLLM_DECODE_GROUNDING.md` |
|
||||
| LPDDR5x bandwidth (achieved, batch-1) | **~216 GB/s** (19 GB / ~88 ms irreducible) | prior batch-1 weight-read study |
|
||||
| FP4 (NVFP4/MXFP4) dense peak | **~427–500 TFLOP/s** (2× BF16; GB10 is 1:1:2 BF16:INT8:FP4) | `BLACKWELL_KERNEL_GAPS.md` §2 (measured) |
|
||||
| BF16 peak | ~213 TFLOP/s | same |
|
||||
| Demonstrated GB10 FP4-MMA efficiency | **~17%** of FP4 peak at prefill M=512 (MXFP4 dense 1153 t/s); ~3–7% at decode; ~5% MoE | `BLACKWELL_KERNEL_GAPS.md` §6, `GDN_DECODE_VERIFY.md` |
|
||||
| Demonstrated GB10 INT8-MMQ efficiency | ~21% of BF16 peak | `BLACKWELL_KERNEL_GAPS.md` §3 |
|
||||
| Dense Qwen3.6-27B NVFP4 weights | **18.8 GB** file (`q36-27b-nvfp4.gguf`); ~18 GB matmul tensors | `du` on DGX |
|
||||
| MoE Qwen3.6-35B-A3B NVFP4 weights | **23.85 GB** file; ~22 GB read/step at npl128 (≈98% experts hit) | `du` on DGX |
|
||||
| Decode step decomposition (dense npl128, nsys, GPU 92.7% busy) | GEMM_weight **59.2%**, act_quant 8.2%, GDN(recurrent+conv) 10.4%, full-attn 1.8%, elementwise/norm/rope 13.5%, embed 2.9%, copy 1.8% | `GDN_DECODE_VERIFY.md` §3a |
|
||||
| Measured per-step times @npl128 | dense **~795 ms** (llama) → **~328 ms** (vLLM); MoE **~384 ms** → **~158 ms** | `VLLM_DECODE_GROUNDING.md` |
|
||||
| Aggregate decode @npl128 | dense 161 (llama) vs **391** (vLLM); MoE 333 vs **811** | `QWEN36_NVFP4_BENCH.md` |
|
||||
| LPDDR5x bandwidth (achieved, batch-1 weight read) | **~216 GB/s** (19 GB / ~88 ms irreducible) | prior batch-1 study |
|
||||
| FP4 (NVFP4/MXFP4) dense peak | **~427–500 TFLOP/s** (2× BF16; GB10 is 1:1:2 BF16:INT8:FP4) | `BLACKWELL_KERNEL_GAPS.md` §2 |
|
||||
| BF16 / INT8 peak | ~213 TFLOP/s / ~215 TOPS (INT8 == BF16 on GB10) | same §2 |
|
||||
| Demonstrated GB10 FP4-MMA efficiency | **~17%** of FP4 peak at prefill M=512 (MXFP4 dense 1153 t/s); **~3% dense / ~35%-of-BW MoE at decode** | `BLACKWELL_KERNEL_GAPS.md` §6, `GDN_DECODE_VERIFY.md` |
|
||||
| Dense Qwen3.6-27B NVFP4 weights | **18.8 GB** file; ~18 GB matmul tensors | `du` on DGX |
|
||||
| MoE Qwen3.6-35B-A3B NVFP4 weights | **23.85 GB** file; ~22 GB read/step @npl128 (~98% experts hit) | `du` on DGX |
|
||||
| Decode step decomposition (dense npl128, nsys, GPU 92.7% busy) | GEMM_weight **59.2%**, act_quant 8.2%, GDN 10.4%, full-attn 1.8%, elementwise/norm/rope 13.5%, embed 2.9%, copy 1.8% | `GDN_DECODE_VERIFY.md` §3a |
|
||||
| Measured per-step @npl128 | dense **~795 ms** (llama) → **~328 ms** (vLLM); MoE **~384 ms** → **~158 ms** | `VLLM_DECODE_GROUNDING.md` |
|
||||
| Aggregate decode @npl128 (the parity scoreboard) | dense **161** (llama) vs **391** (vLLM); MoE **333** vs **811** | `QWEN36_NVFP4_BENCH.md` |
|
||||
|
||||
Crossover formula used throughout (per-GEMM and whole-model are identical):
|
||||
`M* = b · peak / (2 · BW)` where `b` = bytes per weight element. Below `M*` the GEMM is
|
||||
bandwidth-bound; above it, compute-bound.
|
||||
`decode_agg = npl / step_s = 128 / step_s`. Crossover formula throughout:
|
||||
`M* = b · peak / (2 · BW)`, `b` = bytes per weight element. Below `M*` bandwidth-bound, above it
|
||||
compute-bound.
|
||||
|
||||
---
|
||||
|
||||
## 1. DENSE Qwen3.6-27B — the roofline at decode M=128
|
||||
## 1. The kernel-approach decision: TUNE the existing FP4-MMA `mul_mat_q`, do NOT write a cutlass kernel
|
||||
|
||||
`b = 18e9 B / 27e9 params = 0.667 B/param`. FLOPs/step `= 2·M·P = 2·128·27e9 = 6.91 TFLOP`.
|
||||
This is the first thing track B must settle, and the evidence settles it decisively.
|
||||
|
||||
**(a) Weight-read floor** (weights read ONCE for all 128 tokens):
|
||||
- @273 GB/s: 18 GB / 273 = **65.9 ms/step → 1,942 tok/s ceiling**
|
||||
- @216 GB/s (achieved): 18 / 216 = **83 ms/step → 1,542 tok/s**
|
||||
| option | verdict | why |
|
||||
|---|---|---|
|
||||
| **(A) Tune the existing `mul_mat_q<NVFP4>` FP4-MMA path** | **CHOSEN — the tractable spine** | The kernel already exists, is **bit-exact** (`test-backend-ops MUL_MAT` 1103/1103), is genuine **W4A4** (below), and already **beats vLLM at batch-1 prefill** (MXFP4 1153 t/s vs vLLM's 800 W4A16 — vLLM has no FP4 cubins on sm_121). The deficit is **decode-shape scheduling**, not the math op. Host-side selection + a bounded occupancy tune respects the GB10 lessons and is build-ready against known files/lines. |
|
||||
| **(B) New cutlass-style SM120 FP4 collective** | **REJECTED** | Repeats the **proven GB10 dead-end**: the from-scratch W4A16 BF16 GEMM hit only ~9–15 TFLOP/s (¼ of MMQ) and was **STOPPED** (`W4A16_MARLIN_KERNEL_PLAN.md`) because deep `cp.async` + XOR-swizzle **collapse GB10 occupancy**. Worse, **CUTLASS's own SM120 grouped block-scaled FP4 GEMM is broken on consumer Blackwell** (garbage/init-fail — CUTLASS #3096/#2800) — it is the exact reason vLLM falls back to **BF16 Marlin** for its MoE on sm_121. "Port cutlass" is not even a working option for the MoE arm. |
|
||||
| **(C) Marlin-style W4A16 (FP4→BF16 dequant + BF16 HMMA)** | **REJECTED for the win, noted for context** | This is what **vLLM's MoE actually runs** on sm_121 (W4A16, BF16 activations, dequant-in-mainloop). On GB10 **INT8 == BF16 == ½ FP4 rate**, so a BF16-HMMA path concedes the 2× FP4 advantage llama already has. We do not want to *descend* to vLLM's slower arithmetic class; we want to keep the FP4-MMA class and schedule it better. |
|
||||
|
||||
**(b) Compute floor:**
|
||||
- @FP4 peak 500 TF/s: 6.91 / 500 = **13.8 ms → 9,275 tok/s**
|
||||
- @17% FP4 (85 TF/s, the demonstrated prefill ceiling): 81 ms → 1,580 tok/s
|
||||
- @5% FP4 (25 TF/s, measured decode regime): 276 ms → 464 tok/s
|
||||
**Decision: track B = tune `mul_mat_q<NVFP4>` (dense, `mmq.cu`/`mmq.cuh`) + the grouped `mul_mat_q`
|
||||
id-branch (MoE, `mmid.cu` + the same `mmq.cuh`).** No new kernel, no rewrite, no descent to BF16.
|
||||
The win is kernel *engineering around an FP4-MMA llama already possesses*, so there is **no
|
||||
hardware-instruction wall** — but it is gated by whether MMQ's occupancy-bound design can be pushed
|
||||
to the bandwidth floor at the thin decode M-tile.
|
||||
|
||||
**(c) Crossover:**
|
||||
- At FP4 **peak**: `M* = 0.667·500e12 / (2·273e9) = 611`. **M=128 ≪ 611 → an ideal FP4 GEMM at decode is BANDWIDTH-BOUND.**
|
||||
- At the kernel's **achieved** efficiency the effective peak collapses, dragging `M*` down: 17% → M*≈104; 5% → M*≈30. So **at its current ~3–7% efficiency the kernel is COMPUTE-BOUND at M=128** (limited by its own poor FP4-MMA throughput), even though the hardware says it should be bandwidth-bound.
|
||||
### What "the existing path" actually is (source-read, DGX `ggml/src/ggml-cuda/`)
|
||||
|
||||
**Where llama actually sits:** GEMM = 59.2% × 795 ms = **471 ms**. Achieved = 6.91e12 / 0.471 =
|
||||
**14.7 TFLOP/s = 2.9% of FP4 peak**. That is **7.1× slower than the 66 ms weight-read floor** and
|
||||
matches the ~3–7% decode-efficiency band. The 471 ms is not a hardware bandwidth wall — it is the
|
||||
FP4-MMA kernel running deep in compute-bound territory at single-digit efficiency.
|
||||
Decode runs **one `mul_mat_q` per weight, M=128** (all 128 slots' single tokens fused into one
|
||||
ubatch — confirmed `mul_mat_q(M=128)` in `GDN_DECODE_VERIFY.md`, not 128× M=1). The NVFP4 path:
|
||||
`mmq.cu` `use_native_fp4` gate (L125) → `quantize_mmq_fp4_cuda` act-quant (L138 dense / L200 id;
|
||||
**track A's fuse target**) → `mul_mat_q` → `vec_dot_fp4_fp4_mma` (`mmq.cuh:997`) →
|
||||
`mma_block_scaled_fp4` (`mma.cuh:1126`).
|
||||
|
||||
**Where vLLM sits:** step 328 ms → if its native-FP4 cutlass GEMM is at the ~66 ms BW floor, the
|
||||
GEMM is only ~20% of vLLM's step; the rest (~262 ms) is GDN + full-attn + host. vLLM's **whole step
|
||||
(328 ms) ≈ llama's GEMM bucket alone (471 ms)** minus a bit. The entire 2.42× gap is the GEMM.
|
||||
**Confirmed W4A4 (this corrects an earlier "A is 8-bit-class" framing):** `block_fp4_mmq`
|
||||
(`mmq.cuh:53`) is `uint32_t d4[4]` (four `ue4m3` block scales) + `int8_t qs[4*32]` = **256 FP4 (e2m1)
|
||||
values packed 2-per-byte**. `quantize_mmq_fp4_cuda` (`quantize.cu:422`) emits FP4 via
|
||||
`ggml_cuda_float_to_fp4_e2m1`. The MMA is
|
||||
`mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3`
|
||||
(`mma.cuh:1145`) — **both operands e2m1, ue4m3 block scales**. So llama's dense FP4-MMA path is
|
||||
already the *same arithmetic class as vLLM's cutlass W4A4 dense*. The `sizeof(block_fp4_mmq) ==
|
||||
sizeof(block_q8_1_mmq)` static_assert is a shared-tile-footprint convention, **not** an 8-bit
|
||||
activation. **Consequence: there is no "make activations 4-bit" work to do and no activation-traffic
|
||||
halving to win — that is already banked. The entire dense deficit is scheduling/occupancy.**
|
||||
|
||||
**Dense parity arithmetic** (795 ms = GEMM 471 + act 65 + GDN 83 + attn 14 + rest 162):
|
||||
- B alone (GEMM → 66 ms BW floor, requires ~21% FP4 eff): step 728→… = 66+65+83+14+162 = **390 ms → 328 tok/s = 84% of vLLM**.
|
||||
- **B + A** (GEMM 66 ms floor **and** act-quant fused away): 66+83+14+162 = **325 ms → 394 tok/s = 101% of vLLM → PARITY/BEAT.**
|
||||
- B+A at the softer 17% FP4 (GEMM 81 ms, the *demonstrated* prefill ceiling, not the 21% floor): 340 ms → **376 tok/s = 96% of vLLM.**
|
||||
|
||||
**Dense robust band: 90–103% of vLLM**, and it is insensitive to the 273-vs-216 GB/s uncertainty
|
||||
(at 216 GB/s the floor is 83 ms → step 357 ms → 359 tok/s = 92%). The conclusion holds.
|
||||
|
||||
---
|
||||
|
||||
## 2. MoE Qwen3.6-35B-A3B — the roofline at decode M=128
|
||||
|
||||
At npl128, 128 tokens × top-8 over 256 experts ⇒ P(expert unused) = (1−8/256)^128 ≈ 1.7%, so
|
||||
**~98% of experts are read** → ~22 GB/step (essentially the full weight set), the same
|
||||
weight-read regime as dense. The grouped GEMM (`mmid.cu` / `mul_mat_q` id-branch) reads each
|
||||
routed expert's weight **once** for the ~128·8/256 = **4 tokens/expert** on average.
|
||||
|
||||
**(a) Weight-read floor:**
|
||||
- @273 GB/s: 22 / 273 = **80.6 ms → 1,588 tok/s**
|
||||
- @216 GB/s: 102 ms → 1,255 tok/s
|
||||
|
||||
**(b) Compute floor:** only ~3B active params/token → FLOPs = 2·128·3e9 = 0.77 TFLOP → 1.5 ms @peak.
|
||||
**Trivial.** MoE decode is **purely bandwidth/occupancy bound**, never compute-bound. The hard part
|
||||
is that per-expert M ≈ 4: the grouped GEMM must saturate ~273 GB/s while feeding tiny ragged M-tiles
|
||||
— the regime where ggml's dense-tuned `mmq_x=128` underfills (see `MOE_GROUPED_GEMM_SCOPE.md`).
|
||||
|
||||
**Where llama sits:** GEMM = 59% × 384 = **227 ms** → effective BW 22 GB / 0.227 s =
|
||||
**97 GB/s = 35% of 273** (less compute-bound than dense, but only 1/3 of peak bandwidth — an
|
||||
occupancy/tile-fill loss, exactly the `MOE_GROUPED_GEMM_SCOPE.md` M-tile finding).
|
||||
|
||||
**Where vLLM sits:** step 158 ms ≈ GEMM at the ~80 ms floor (grouped Marlin-NvFp4, 51% of its step)
|
||||
+ ~78 ms non-GEMM. So vLLM is already pushing the MoE bandwidth floor.
|
||||
|
||||
**MoE parity arithmetic** (384 ms = GEMM 227 + act 31 + GDN 38 + attn 8 + rest 81):
|
||||
- B + A, GEMM → 80 ms floor + act fused: 80+38+8+81 = **207 ms → 618 tok/s = 76% of vLLM.**
|
||||
- This is the **ceiling from the GEMM track**: even with a *perfect* MoE weight-read-floor GEMM,
|
||||
llama's non-GEMM (GDN 38 + attn 8 + rest 81 = 127 ms) is **1.6× vLLM's whole non-GEMM (~78 ms)**,
|
||||
so the step cannot drop below ~207 ms. To reach vLLM's 158 ms needs the non-GEMM buckets too
|
||||
(GDN state I/O is intrinsic and vLLM pays it identically — `GDN_DECODE_VERIFY.md` — so the
|
||||
remaining ~49 ms is elementwise + host loop, **outside track B**).
|
||||
|
||||
**MoE band from B+A: ~60–76% of vLLM.** Full MoE parity is **not reachable from the GEMM alone.**
|
||||
|
||||
---
|
||||
|
||||
## 2b. The precise code-level inefficiencies (source-read, the "why slower than vLLM")
|
||||
|
||||
Decode runs **one `mul_mat_q` per weight, M=128** (all 128 slots' single tokens are fused into one
|
||||
ubatch — confirmed `mul_mat_q(M=128)` in `GDN_DECODE_VERIFY.md`, not 128 × M=1). The NVFP4 path:
|
||||
`mmq.cu` `use_native_fp4` gate → `quantize_mmq_fp4_cuda` (act-quant) → `mul_mat_q` →
|
||||
`vec_dot_fp4_fp4_mma` (`mmq.cuh:997`) → `mma_block_scaled_fp4` (`mma.cuh:1126`, PTX
|
||||
`mma.sync...kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3`).
|
||||
Geometry: `get_mmq_x_max=128`, `mmq_y=128`, `nwarps=256/32=8`, `iter_k=MMQ_ITER_K_FP4=512`. Tiles:
|
||||
Geometry (`vec_dot_fp4_fp4_mma`): `MMQ_NWARPS=8`, `iter_k=MMQ_ITER_K_FP4=512`, tiles
|
||||
`tile_A<16,8,int>` (weights, 16 N-rows × 64 FP4-in-K), `tile_B<8,8,int>` (acts, 8 M-cols × 64
|
||||
FP4-in-K), `tile_C<16,8,float>` (16 N-rows × 8 M-cols), `nfrags=32/8=4`.
|
||||
|
||||
1. **Separate activation-quant pass (track A's target).** `quantize_mmq_fp4_cuda` writes the *entire*
|
||||
activation tensor to `block_fp4_mmq` in a standalone kernel before `mul_mat_q`. vLLM fuses
|
||||
`scaled_fp4_quant` into the preceding RMSNorm/SiLU epilogue (`rms_quant_fusion`/`act_quant_fusion`)
|
||||
— no separate pass, no extra activation read+write+launch. 8.2% of the npl128 step. **B must consume
|
||||
A's in-place `block_fp4_mmq` y-tile** so the fusion saves the round-trip, not just the launch.
|
||||
|
||||
2. **No weight-load software pipeline → exposed latency at thin M (the #1 kernel lever).**
|
||||
`load_tiles_nvfp4_nvfp4` does plain shared stores → `__syncthreads` → `vec_dot_fp4_fp4_mma`
|
||||
(`load_ldmatrix` + MMAs): a **load→sync→compute→repeat** cadence with **no `cp.async`
|
||||
double-buffering** overlapping the next K-block weight load with the current MMA. At M=128 the per-
|
||||
tile MMA work is small (8 M-cols per `tile_C::J`), so serialized weight-load latency dominates →
|
||||
the ~3% (dense) / 35%-of-BW (MoE) result. vLLM's Marlin runs a 4-stage `cp.async` pipeline. **The
|
||||
defining caveat:** a *deep* pipeline + XOR-swizzle collapses GB10 occupancy
|
||||
(`W4A16_MARLIN_KERNEL_PLAN.md`); the fix is a **shallow 2-stage prefetch + skew-pad**, not Marlin's 4.
|
||||
|
||||
3. **`mmq_x` selector maximizes the M-tile — the opposite of the GB10 occupancy rule.**
|
||||
`mul_mat_q_case` picks `mmq_x` by *minimizing* `ntiles_x = ceil(ncols_max/mmq_x)`, so it always
|
||||
takes the *largest* tile that fits shared. Dense decode → `mmq_x=128`, `mmq_y=128`: a heavy 128×128
|
||||
tile (8 warps) → low occupancy on the occupancy-dominated GB10. No padding waste and no redundant
|
||||
weight read (`ntx=1` → each weight row-tile read once), so the loss is pure occupancy; a smaller
|
||||
`mmq_x` with more resident CTAs may hide load latency better (P1 host-only sweep, zero kernel risk).
|
||||
|
||||
4. **MoE per-expert M-tile waste (the structural MoE gap).** Stock applies the 128-wide tile *per
|
||||
expert*; per-expert density is ~4 tokens (top-8 of 256 @npl128), so the 128-wide accumulator is
|
||||
~3% filled and only ~1 `tile_C` N-fragment is live (`tile_C::J=8`), the rest masked `need_check`
|
||||
tails. Patch 0015 (`MOE_DENSITY_AUTO_TILE.md`) auto-caps to 64 at decode, but the ideal is
|
||||
~tokens/expert ≈ 8 — even 64 is ~8× too big. vLLM uses a small per-expert `BLOCK_SIZE_M` (16/32).
|
||||
At ≤1 col-tile/expert a smaller tile costs **no** extra weight re-read → strictly occupancy-positive.
|
||||
(Inefficiency 4 is the MoE arm of 3; at dense M=128, 128/8=16 N-frags are fully used — no dense
|
||||
M-waste.)
|
||||
|
||||
5. **`iter_k=512` (FP4) vs 256 couples to occupancy.** The FP4 main loop stages 512 K-elements/iter →
|
||||
larger shared footprint → fewer iters but more pressure on the occupancy-bound part. A P5 knob.
|
||||
|
||||
**Ruled out (so B does not chase them):** redundant weight reads (none — dense `ntx=1`, MoE ≤1
|
||||
col-tile/expert; the low effective BW is latency/occupancy, not re-reads); stream-K fixup (it *helps*
|
||||
fill the small GB10 grid, cheap at thin M); raw FP4-MMA peak rate (the path already beats Q4-MMQ and
|
||||
is BW-bound at batch 1 — at M=128 latency-hiding binds first, not MMA throughput).
|
||||
FP4-in-K), `tile_C<16,8,float>` (16 N-rows × 8 M-cols), `nfrags = MMQ_TILE_NE_K/tile_A::J`. The M loop
|
||||
is `for (j0=0; j0<mmq_x; j0 += ntx*tile_C::J)` — M tiled in steps of `tile_C::J=8`.
|
||||
|
||||
---
|
||||
|
||||
## 3. The load-bearing verdict
|
||||
## 2. The roofline — answering the load-bearing question
|
||||
|
||||
**Q: compute-bound or bandwidth-bound at M=128?**
|
||||
At the **hardware** roofline the decode GEMM is **bandwidth-bound** (M=128 ≪ crossover 515–611).
|
||||
At the **current kernel's** ~3–7% FP4 efficiency it is **compute-bound by its own inefficiency**
|
||||
(effective M*≈30). The two weight-read floors — **dense ~1,940 tok/s, MoE ~1,590 tok/s** — both sit
|
||||
**4–6× ABOVE vLLM's 391/811.** So **the 273 GB/s bandwidth is NOT the wall at the parity target.**
|
||||
There is large bandwidth headroom; the gap is the FP4-MMA kernel achieving single-digit % of peak
|
||||
where the roofline permits ~20%+ before bandwidth even binds.
|
||||
**Answer: BANDWIDTH-bound on the hardware roofline, but COMPUTE-bound in practice by the kernel's own
|
||||
under-occupancy. The 273 GB/s is NOT the wall at the parity target.**
|
||||
|
||||
**Q: can a better FP4-MMA GEMM reach vLLM — TRUE PARITY?**
|
||||
### 2a. DENSE Qwen3.6-27B, M=128
|
||||
|
||||
- **DENSE: parity is PLAUSIBLY REACHABLE, but at the edge of the demonstrated envelope.** The entire
|
||||
2.42× gap is the GEMM bucket; its ideal floor (66 ms) is 7× below the current 471 ms and is
|
||||
bandwidth-bound, not hardware-capped. **B (GEMM → BW floor) + A (act-fuse) lands 376–394 tok/s ≈
|
||||
vLLM's 391 (90–103%).** The catch: hitting the floor needs **~21% FP4-MMA efficiency at decode
|
||||
M=128**, and GB10 has only ever demonstrated ~17% (and that at prefill M=512, a *larger, easier*
|
||||
tile). Decode M=128 is a smaller M than prefill, so the same kernel must hold efficiency at a
|
||||
thinner tile. This is a **reach, not a lock**: parity is on the table but with **no comfortable
|
||||
margin** and **contingent on track A landing too**.
|
||||
`b = 18e9/27e9 = 0.667 B/param`; FLOPs/step `= 2·128·27e9 = 6.91 TFLOP`.
|
||||
|
||||
- **MoE: full parity is NOT reachable from track B.** Realistic ceiling **~60–76% of vLLM** (618 vs
|
||||
811) even with a perfect weight-read-floor grouped GEMM, because (1) the MoE GEMM floor at M≈4/expert
|
||||
demands near-**full** BW saturation in the hardest grouped-GEMM regime, where llama is at 35% of peak
|
||||
BW and vLLM ships a purpose-built grouped Marlin-NvFp4, and (2) ~24% of the residual is non-GEMM
|
||||
(elementwise + host loop) outside track B. MoE parity needs B **plus** the non-GEMM tracks.
|
||||
- **Weight-read floor** (18 GB read ONCE for all 128 tokens): @273 GB/s = **65.9 ms → 1,942 tok/s**;
|
||||
@216 GB/s = 83 ms → 1,542 tok/s.
|
||||
- **Crossover** at FP4 peak: `M* = 0.667·500e12/(2·273e9) = 611`. **M=128 ≪ 611 → an ideal FP4 GEMM
|
||||
at decode is BANDWIDTH-bound.** At the kernel's *achieved* ~3% efficiency the effective peak
|
||||
collapses and drags M* to ≈30, putting the *current* kernel in self-inflicted compute-bound
|
||||
territory.
|
||||
- **Where llama sits:** GEMM = 59.2% × 795 ms = **471 ms = 14.7 TFLOP/s = 2.9% of FP4 peak = 7.1×
|
||||
slower than the 66 ms weight-read floor.** Not a bandwidth wall — a kernel running deep in
|
||||
compute-bound territory at single-digit efficiency.
|
||||
- **Where vLLM sits:** step 328 ms ≈ llama's GEMM bucket (471 ms) alone. The **entire 2.42× gap is
|
||||
the GEMM.**
|
||||
|
||||
**Q: the GB10 occupancy wall — does it cap this?** Yes, it is the binding constraint, not bandwidth.
|
||||
History (`W4A16_MARLIN_KERNEL_PLAN.md`, `BLACKWELL_KERNEL_GAPS.md`): the from-scratch W4A16 BF16 GEMM
|
||||
hit only ~9–15 TFLOP/s (¼ of MMQ) because deep `cp.async` pipelines + XOR-swizzle **collapse GB10
|
||||
occupancy**; skew-pad + small-shared + high-occupancy won. **Crucially, decode M=128 is a different
|
||||
regime from that dead path:** it is bandwidth/occupancy-bound, not compute-throughput-bound, so the
|
||||
lever is **saturating LPDDR5x at a thin M-tile via occupancy**, not packing MMAs. The existing
|
||||
FP4-MMA path (`block_fp4_mmq` / `vec_dot_fp4_fp4_mma`) is **already at the BW floor at batch 1**
|
||||
(88 ms irreducible) — so the kernel *can* saturate bandwidth at M=1; the work is keeping it
|
||||
bandwidth-bound as M grows to 128 instead of degrading to compute-bound at 3% efficiency. That is a
|
||||
**tune/fix of a working path**, not the dead greenfield W4A16 rewrite.
|
||||
### 2b. MoE Qwen3.6-35B-A3B, M=128
|
||||
|
||||
### Go / No-Go
|
||||
@npl128, 128 tok × top-8 / 256 experts ⇒ ~98% experts read ⇒ ~22 GB/step (the full weight set), per-
|
||||
expert M ≈ **4 tokens**.
|
||||
|
||||
- **DENSE — GO (conditional).** Build track B as a **decode-M-tile tune of the existing
|
||||
`mul_mat_q<NVFP4>` FP4-MMA kernel**, co-delivered with track A. Honest expectation: **90–103% of
|
||||
vLLM (parity within error), not a guaranteed beat.** Go condition: it is contingent on reaching
|
||||
~17–21% FP4 efficiency at M=128 (top of the demonstrated GB10 envelope) — set a P2 kill-gate
|
||||
(below).
|
||||
- **MoE — PARTIAL / NO-GO for parity-from-B.** Track B (the M-tile work already scoped in
|
||||
`MOE_GROUPED_GEMM_SCOPE.md`) buys MoE → ~60–76% of vLLM and is worth doing, but **cannot deliver
|
||||
MoE parity by itself**; do not promise 811. Full MoE parity requires B + the non-GEMM tracks
|
||||
(elementwise/host CUDA-graph, GDN state I/O bf16) and is a multi-track effort.
|
||||
- **Weight-read floor:** 22/273 = **80.6 ms → 1,588 tok/s** (@216: 102 ms → 1,255).
|
||||
- **Compute floor:** only ~3B active params ⇒ 0.77 TFLOP ⇒ 1.5 ms @peak — **trivial. MoE decode is
|
||||
purely bandwidth/occupancy-bound, never compute-bound.** The hard part is saturating 273 GB/s while
|
||||
feeding ragged M≈4 tiles.
|
||||
- **Where llama sits:** GEMM = 59% × 384 = **227 ms = 97 GB/s = 35% of peak BW** (occupancy/tile-fill
|
||||
loss, not compute).
|
||||
- **Where vLLM sits:** step 158 ms ≈ grouped Marlin-NvFp4 at the ~80 ms floor + ~78 ms non-GEMM —
|
||||
already pushing the MoE BW floor.
|
||||
|
||||
**Bottom line for the "TRUE PARITY" ask:** GB10 **can** plausibly deliver **dense** decode parity
|
||||
with vLLM via a tuned FP4-MMA decode GEMM **+ track A**, at the edge of the demonstrated efficiency
|
||||
envelope and with no margin. GB10 **cannot** deliver **MoE** decode parity from the GEMM track
|
||||
alone (ceiling ~76%); MoE parity is a B-plus-non-GEMM program. The hardware (273 GB/s) is **not** the
|
||||
ceiling — the GB10 FP4-MMA occupancy efficiency is, and it is a "reach" for dense and a "partial" for
|
||||
MoE.
|
||||
**Both weight-read floors (dense ~1,940, MoE ~1,590 tok/s) sit 4–6× ABOVE vLLM's 391/811. Bandwidth
|
||||
is not the wall; the GB10 FP4-MMA occupancy efficiency is.**
|
||||
|
||||
---
|
||||
|
||||
## 4. Build-ready plan (do NOT implement here)
|
||||
## 3. The code-level inefficiencies, and the M-tile asymmetry that drives the whole plan
|
||||
|
||||
The kernels already exist; track B is a **tune + fuse of the FP4-MMA `mul_mat_q` path at the decode
|
||||
M-tile**, not a new kernel. This respects every GB10 occupancy lesson (small shared, high occupancy,
|
||||
skew-pad, stay on `block_fp4_mmq`; never deep `cp.async` / XOR-swizzle).
|
||||
The selection is `mul_mat_q_case` (`mmq.cuh:4108`): it loops `mmq_x = 8..mmq_x_max(=128) step 8` and
|
||||
keeps the `mmq_x` that **minimizes `ntiles_x = ceil(ncols_max/mmq_x)`**, stopping at `ntiles_x==1`.
|
||||
`mmq_y` (the weight-row tile) is pinned at **128** by `get_mmq_y_host` (L143). This produces the
|
||||
single most important structural fact for track B:
|
||||
|
||||
### Files (DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`)
|
||||
- `mmq.cuh` — `block_fp4_mmq` (L53), `load_tiles_nvfp4_nvfp4` (L948), `vec_dot_fp4_fp4_mma` (L997),
|
||||
the stream-k `mul_mat_q` kernel + `mul_mat_q_case` / `launch_mul_mat_q` tile selection (~L3320–4055,
|
||||
all under `BLACKWELL_MMA_AVAILABLE`).
|
||||
- `mmq.cu` — dense + id dispatch; `use_native_fp4` gate (L125), `quantize_mmq_fp4_cuda` act-quant
|
||||
(L138/L200 — **track A's fuse target**).
|
||||
- `mmid.cu` — `mm_ids_helper` MoE token-sort (the MoE M-tile lever, scoped in `MOE_GROUPED_GEMM_SCOPE.md`).
|
||||
> **`mmq_x` tiles M (tokens / output columns) — shrinking it RE-READS the weights `ntiles_x` times.
|
||||
> `mmq_y` tiles N (weight rows / output rows) — shrinking it does NOT re-read weights (each weight row
|
||||
> lives in exactly one row-tile); it only lowers shared footprint and raises occupancy.** The two
|
||||
> regimes pick opposite knobs:
|
||||
|
||||
### Phases (each ends with: `test-backend-ops -o MUL_MAT[/_ID] -b CUDA0` bit-exact + a decode bench)
|
||||
| | dense decode (M=128, no `expert_bounds`) | MoE decode (per-expert M≈4) |
|
||||
|---|---|---|
|
||||
| selection picks | `mmq_x=128` → `ntiles_x=1` → **weights read ONCE** (the one-read optimum) | `mmq_x=128` applied **per expert** → tile ~3% filled |
|
||||
| shrink `mmq_x`? | **NO — re-reads 18 GB ×`ntiles_x`**, fatal in the BW-bound regime | **YES, FREE** — 1 col-tile/expert regardless, no re-read → strictly occupancy-positive |
|
||||
| FP4-MMA M-frag fill | **full** (128/`tile_C::J`=16 frag-groups, all live) → no fragment waste | **wasted** (~1 of 8/16 frag-groups live, rest masked tails) |
|
||||
| BW-neutral occupancy lever | **`mmq_y`↓** (more resident CTAs, weights still read once) — kernel-structure change | **`mmq_x`↓** (toward density ≈8) — host-side template switch |
|
||||
| dominant loss | **occupancy** at the heavy 128×128 tile (exposed weight-load latency) | **tile-fill** (dense-tuned M-tile applied to ragged M≈4) |
|
||||
|
||||
| Phase | Work | Expected payoff | Risk |
|
||||
|---|---|---|---|
|
||||
| **P0** harness | Capture per-shape baseline at the **decode shape** (`test-backend-ops perf -o MUL_MAT`, type NVFP4, **n=128**, FFN K/N) + nsys decode window. Lock 1103/1103 parity + the 14.7 TFLOP/s baseline. Decode-M is the canonical target, not prefill n=512. | None (gate). | Low |
|
||||
| **P1** decode M-tile selection (dense) | In `mul_mat_q_case`/`launch_mul_mat_q`, pick `mmq_x`/`mmq_y` from the **decode M=128** shape rather than the prefill-tuned config. M=128 with FP4 N-frag 8 wants a small, occupancy-friendly tile; the prefill `mmq_x=128` likely underfills SM occupancy at decode. Host-side template selection, **zero new kernel**, mirrors `MOE_GROUPED_GEMM_SCOPE.md` [1]. | Lift dense FP4 eff from ~3% toward 10–17%; no extra weight read (one col-tile). | Low |
|
||||
| **P2** occupancy/pipeline tune | Sweep warps/tile/skew-pad on the FP4-MMA decode kernel to push toward the **66 ms BW floor (~21% FP4 eff)**. Honor GB10 rules: small shared, high occupancy, skew-pad +4, **no** deep cp.async / XOR-swizzle. **KILL-GATE:** if decode FP4 eff plateaus < ~15% (GEMM > ~110 ms) after the sweep, dense parity is off — stop and report partial. | The dense parity make-or-break. Target GEMM 471→66–81 ms. | **Med-high** (the occupancy wall is real; ncu unavailable on DGX → empirical sweep only) |
|
||||
| **P3** co-land track A | Verify the fused act-quant (track A) composes with the tuned GEMM (the requant folds into the FP4 GEMM prologue, removing the 8.2% bucket). | Dense 376–394 tok/s = 90–103% vLLM. | Low (track A owns the fuse) |
|
||||
| **P4** MoE M-tile | Land the `MOE_GROUPED_GEMM_SCOPE.md` expert-aware `mmq_x` ([1]) + block-pad align ([2]). | MoE → ~60–76% vLLM (not parity). | Med |
|
||||
This asymmetry is the spine of the plan: **MoE's lever is host-only `mmq_x`↓ (already landed as patch
|
||||
0015 auto-cap→64; ideal ≈8–16); dense's lever is `mmq_y`↓ + occupancy, a bounded kernel change.**
|
||||
|
||||
### Parity gate (every phase)
|
||||
`GGML_CUDA_*` flag set and unset → `test-backend-ops test -o MUL_MAT -b CUDA0` = **1103/1103**,
|
||||
byte-identical when unset. Add **decode-shape (n=128) + ragged small-M** cases if absent. End-to-end:
|
||||
`llama-batched-bench -fa on -npp 512 -ntg 256 -npl 128` on `q36-27b-nvfp4.gguf`, confirm decode
|
||||
agg climbs toward ~376–394 and stays bit-stable vs the CPU oracle (within the GB10 greedy-decode
|
||||
non-determinism band). All bench/parity scripts **dev-tree-only**.
|
||||
The five inefficiencies, ranked:
|
||||
|
||||
1. **Separate activation-quant pass (track A's bucket, 8.2%).** `quantize_mmq_fp4_cuda` writes the
|
||||
whole activation tensor to `block_fp4_mmq` in a standalone kernel; vLLM fuses `scaled_fp4_quant`
|
||||
into the preceding RMSNorm/SiLU epilogue. **Handoff (track A → B):** B must consume A's prequantized
|
||||
`block_fp4_mmq` y-tile in place of calling `quantize_mmq_fp4_cuda`, so the fusion saves the
|
||||
activation round-trip, not just the launch (see §4.4).
|
||||
|
||||
2. **No weight-load software pipeline → exposed latency at thin M (the #1 dense kernel lever).**
|
||||
`load_tiles_nvfp4_nvfp4` (`mmq.cuh:946`) does plain global→shared stores → `__syncthreads` →
|
||||
`vec_dot_fp4_fp4_mma` (`load_ldmatrix` of A + MMA): a **load→sync→compute→repeat** cadence with **no
|
||||
`cp.async` double-buffering** overlapping the next k-block weight load with the current MMA. At
|
||||
M=128 the per-tile MMA work is small, so serialized weight-load latency dominates → 2.9% (dense) /
|
||||
35%-of-BW (MoE). **Caveat (the GB10 wall):** a *deep* pipeline + XOR-swizzle collapses GB10
|
||||
occupancy (`W4A16_MARLIN_KERNEL_PLAN.md`). The fix is **occupancy-first** (raise resident CTAs to
|
||||
hide latency via CTA-parallelism), **shallow 2-stage prefetch second**, never Marlin's 4-stage.
|
||||
|
||||
3. **`mmq_x` maximized for dense = occupancy-heavy, but pinned by the one-read constraint.** At dense
|
||||
decode the 128×128 tile (8 warps, large shared) is low-occupancy on the occupancy-dominated GB10 —
|
||||
but you cannot shrink `mmq_x` without doubling the 18 GB weight read. So the dense occupancy fix is
|
||||
**`mmq_y`↓** (BW-neutral), not `mmq_x`↓.
|
||||
|
||||
4. **MoE per-expert M-tile waste (the structural MoE gap).** The 128-wide (or patch-0015 64-wide)
|
||||
tile is applied per expert at density ≈4, so the accumulator is ~3–6% filled and ~1 `tile_C` frag-
|
||||
group is live, the rest masked `need_check` tails. Ideal `mmq_x` ≈ tokens/expert ≈ 8 (= `tile_C::J`).
|
||||
At ≤1 col-tile/expert this costs **no** extra weight read → strictly occupancy-positive. (This is
|
||||
the MoE arm of inefficiency 3; scoped in `MOE_GROUPED_GEMM_SCOPE.md`.)
|
||||
|
||||
5. **`iter_k=512` (FP4) couples to occupancy.** The FP4 main loop stages 512 K-elements/iter → larger
|
||||
shared footprint → adverse in the occupancy-bound regime. A P2 tuning knob.
|
||||
|
||||
**Ruled out (do not chase):** redundant weight reads on the *current* selection (none — dense
|
||||
`ntiles_x=1`, MoE ≤1 col-tile/expert); stream-K fixup (it *helps* fill the small GB10 grid at thin M);
|
||||
raw FP4-MMA peak rate (already beats Q4-MMQ and is BW-bound at batch 1 — latency-hiding binds first).
|
||||
|
||||
---
|
||||
|
||||
## 4. The specific build-ready changes
|
||||
|
||||
All against DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`. Every change is gated and defaults to exact
|
||||
stock behavior until proven.
|
||||
|
||||
### 4.1 Dense M-tile / occupancy (the make-or-break)
|
||||
|
||||
- **Keep `mmq_x=128` at dense decode** (the one-weight-read optimum; do **not** shrink it — that
|
||||
re-reads 18 GB). Lock this as an invariant in P0.
|
||||
- **Make `mmq_y` decode-selectable** (`get_mmq_y_host`/`get_mmq_y_device`, L143/L157). Today pinned
|
||||
128; try **64** (and 96) at decode. `mmq_y` is coupled to `nwarps × tile_C::I` via the MMQ
|
||||
static_assert, so this is a **warp/fragment remap** (bounded kernel change), not a pure host switch:
|
||||
fewer N-frags per warp or fewer warps → smaller per-CTA shared → **more resident CTAs → latency
|
||||
hidden by CTA-parallelism**, with **weights still read once** (BW-neutral). This is the primary
|
||||
dense occupancy lever and respects every GB10 rule.
|
||||
- **Host-only knobs first (P1, zero kernel):** the `mmq_get_granularity_host` choice (L274 — sets
|
||||
`rows_per_warp=2·granularity`, `ntx`), and the stream-k-vs-xy-tiling threshold (`launch_mul_mat_q`
|
||||
~L3954, `tiles_efficiency_percent` L4001). Plus one **empirical A/B**: does eating a 2× weight
|
||||
re-read at `mmq_x=64` buy enough occupancy to net positive? (Diagnostic: if yes, occupancy is badly
|
||||
broken and P2 `mmq_y`↓ has large upside; if no, the tile is already BW-saturated and P2's ceiling is
|
||||
lower.) All behind `GGML_CUDA_FP4_MMQ_Y` / `GGML_CUDA_FP4_GRAN` / `GGML_CUDA_FP4_FORCE_STREAMK`.
|
||||
|
||||
### 4.2 FP4-MMA fragment usage
|
||||
|
||||
- Fragments stay `tile_A<16,8,int>` / `tile_B<8,8,int>` / `tile_C<16,8,float>` — these match the
|
||||
`m16n8k64` block-scaled FP4 MMA and must not change (they are the instruction shape). At dense M=128
|
||||
all 16 `tile_C::J`-groups are live → **no dense fragment work needed**. The lever is *how many of
|
||||
these tiles are resident per SM* (occupancy), set by `mmq_y`/`nwarps`/granularity, not the fragment
|
||||
shape.
|
||||
- MoE: shrink `mmq_x` toward `tile_C::J`=8 so the live frag-group count matches density (§4.3).
|
||||
|
||||
### 4.3 MoE M-tile (`MOE_GROUPED_GEMM_SCOPE.md`, partly landed)
|
||||
|
||||
- **Patch 0015 already auto-caps `mmq_x`→64 at decode** via per-expert density in `mul_mat_q_case`
|
||||
(the `expert_bounds != nullptr` block, L4118-4165; env `LLAMA_MOE_DECODE_TILE`,
|
||||
`LLAMA_MOE_DENSITY_MAX`). Tighten the decode tile toward **8–16** (= density) and sweep.
|
||||
- **Optional [2]: block-padded `mm_ids_helper`** (`mmid.cu`) — pad each expert segment to a multiple
|
||||
of the tile, removing `need_check` masked tails and tightening the stream-k schedule. Medium risk
|
||||
(scatter + write-back masking); behind `LLAMA_MOE_BLOCK_ALIGN`.
|
||||
|
||||
### 4.4 Scale handling + the act-quant fusion handoff (the track A → B ABI contract)
|
||||
|
||||
- **Weight scales** (`ue4m3`, one per 16 weights) load in `load_tiles_nvfp4_nvfp4` into `x_sc`
|
||||
(`x_u32 + 64 + kbx`), consumed as `scaleA` in `vec_dot_fp4_fp4_mma` and passed as the block-scale
|
||||
operand to `mma_block_scaled_fp4`. **No change** — already a first-class MMA scale operand.
|
||||
- **Activation scales** (`ue4m3`) live in the `block_fp4_mmq` y-tile `d4[4]`, consumed as `scaleB`.
|
||||
- **The handoff contract:** track B must hold the **`block_fp4_mmq` y-tile layout invariant**
|
||||
(`uint32_t d4[4]` ue4m3 scales + `int8_t qs[128]` = 256 packed FP4, `mmq.cuh:53`). Track A's fused
|
||||
`rms_norm+mul+nvfp4-quant` producer (task 39) writes exactly this struct; track B's "prequantized
|
||||
MMQ consumer" (task 40) makes `mul_mat_q` accept a prebuilt `src1_q8_1` buffer and **skip the
|
||||
`quantize_mmq_fp4_cuda` call** (`mmq.cu:138`/`200`). The numerics must be **bit-identical** to the
|
||||
unfused path (same `e2m1` rounding, same `ue4m3` block scale per 16) so the parity gate stays green
|
||||
with the fusion on or off. B owns the consumer seam; A owns the producer kernel; the `block_fp4_mmq`
|
||||
struct is the frozen interface between them.
|
||||
|
||||
### 4.5 GB10-fit rules (binding constraints on every kernel change)
|
||||
|
||||
- **Small shared mem + high occupancy.** Do **not** add deep `cp.async` stages or XOR-swizzle shared
|
||||
layouts — they are exactly what collapsed W4A16 on GB10 (`W4A16_MARLIN_KERNEL_PLAN.md`: a 16 KB
|
||||
XOR-swizzle dropped q4_K from 6.63→2.84 TFLOPS).
|
||||
- **Preserve the skew-pad** (`MMQ_MMA_TILE_X_K_FP4 = 2·MMQ_TILE_NE_K + 8 + 4`, the `% 8 == 4`
|
||||
padding, `mmq.cuh:221/233`) — conflict-free `ldmatrix` at ~zero shared cost.
|
||||
- **Stay on the FP4-MMA path** (`block_fp4_mmq` / `mma_block_scaled_fp4`) — the only path at GB10's
|
||||
FP4 = 2× INT8/BF16 rate. Never descend to BF16/INT8 (1:1 on GB10).
|
||||
- **Occupancy beats a conflict-free-but-wide layout.** Buy latency-hiding with *more resident CTAs*
|
||||
(smaller `mmq_y`, smaller shared), not a deeper pipeline.
|
||||
- Tuning is **empirical** — `nsys` (throughput) is available, **`ncu` is not** on the DGX (no driver
|
||||
perms). Sweep configs, measure decode_agg, bracket thermals (same-session cold A/B only).
|
||||
|
||||
---
|
||||
|
||||
## 5. Correctness / parity gate (every phase)
|
||||
|
||||
- **Primary, bit-exact:** `test-backend-ops test -o MUL_MAT -b CUDA0` and
|
||||
`test-backend-ops test -o MUL_MAT_ID -b CUDA0` must stay **1103/1103** with the flag set **and**
|
||||
unset, and **byte-identical** when unset. The CPU reference is the deterministic oracle; the op test
|
||||
is exact (the GB10 greedy-decode non-determinism band applies only to end-to-end, never to the op
|
||||
test).
|
||||
- **Add decode-shape cases if absent:** `type_a ∈ {NVFP4, MXFP4}`, `type_b = F32`, dense **n=128** at
|
||||
the real FFN K/N; for `_ID`, `n_mats=128, n_expert_used=8, n_tokens ∈ {8,32,64,128}` **plus ragged
|
||||
small-M** (experts with 0/1/2 tokens, `n_tokens` not a multiple of `mmq_x`) — exactly where `mmq_x`/
|
||||
`mmq_y` changes and block-pad masking can leak.
|
||||
- **Fusion-handoff parity (P3):** with track A's fused producer on, the prequantized-consumer path
|
||||
must produce dst **identical** to the unfused `quantize_mmq_fp4_cuda` path (same `e2m1`/`ue4m3`
|
||||
rounding).
|
||||
- **End-to-end:** `llama-batched-bench -fa on -npp 512 -ntg 256 -npl 128` on `q36-27b-nvfp4.gguf`
|
||||
(dense) and `q36-35b-a3b-nvfp4.gguf` (MoE); confirm decode_agg climbs per §6 and output stays within
|
||||
the documented CUDA batch-shape non-determinism band vs the CPU oracle. All scripts **dev-tree-only**.
|
||||
|
||||
---
|
||||
|
||||
## 6. Phased plan, with expected decode_agg at each phase
|
||||
|
||||
Per-step model used (ms @npl128): **dense 795** = GEMM 471 + act 65 + GDN 83 + attn 14 + rest 162;
|
||||
**MoE 384** = GEMM 227 + act 31 + GDN 38 + attn 8 + rest 81. `decode_agg = 128 / step_s`.
|
||||
|
||||
### DENSE (parity target 391)
|
||||
|
||||
| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 391** | risk |
|
||||
|---|---|---:|---:|---:|---:|---|
|
||||
| **P0** harness | Lock baseline: 1103/1103, decode n=128 perf, nsys window, the 471 ms / 2.9% eff datum. Pin `mmq_x=128` one-read invariant. | 471 | 795 | **161** | 41% | low |
|
||||
| **P1** host-only tile/grid + re-read A/B | granularity + stream-k threshold sweep; the `mmq_x=64` re-read-vs-occupancy diagnostic. **Honest: small** — `mmq_x` is pinned, so this mostly de-risks P2. | ~400 | ~724 | **~177** | ~45% | low |
|
||||
| **P2** `mmq_y`↓ + occupancy/shallow-prefetch | The make-or-break: raise resident CTAs (`mmq_y` 128→64, granularity, shallow 2-stage weight prefetch, skew-pad), push GEMM toward the **66–81 ms BW floor (17–21% FP4 eff)**. **KILL-GATE: if eff plateaus <15% (GEMM >110 ms) → dense parity OFF, report partial.** | **66–81** | 390–405 | **316–328** | **81–84%** | **med-high** |
|
||||
| **P3** co-land track A | Consume A's prequantized `block_fp4_mmq` y-tile; the 65 ms act bucket folds away. | 66–81 | **325–340** | **376–394** | **96–101%** | low |
|
||||
|
||||
Dense climb: **161 → ~177 → 316–328 → 376–394** tok/s = **41% → 45% → 81–84% → 96–101% of vLLM 391.**
|
||||
Robust to the 273-vs-216 GB/s uncertainty (@216 GB/s P3 → ~359 tok/s = 92%). **Parity within error,
|
||||
contingent on P2 clearing the kill-gate and on A landing.**
|
||||
|
||||
### MoE (parity target 811)
|
||||
|
||||
| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 811** | risk |
|
||||
|---|---|---:|---:|---:|---:|---|
|
||||
| **P0** harness | Lock 1103/1103 + the monotonic `85→1771` batched-bench curve + 227 ms / 35%-BW datum. | 227 | 384 | **333** | 41% | low |
|
||||
| **P1/P4** MoE `mmq_x`↓ (patch 0015 → tighten to 8–16) | Free per-expert tile shrink (no re-read); reclaim the 3–6% fill waste, raise occupancy. | ~140 | ~297 | **~431** | ~53% | low |
|
||||
| **P2** block-pad align + occupancy | Remove `need_check` tails, tighten stream-k; push toward the 80 ms floor. | ~100 | ~257 | **~498** | ~61% | med |
|
||||
| **P3** co-land track A | act bucket (31 ms) folds away; GEMM at the ~80 ms floor. | 80 | **207** | **618** | **76% — CEILING** | low |
|
||||
|
||||
MoE climb: **333 → ~431 → ~498 → 618** tok/s = **41% → 53% → 61% → 76% of vLLM 811.** **The 76% is the
|
||||
hard ceiling from the GEMM track:** even a *perfect* weight-read-floor grouped GEMM leaves llama's
|
||||
non-GEMM (GDN 38 + attn 8 + rest 81 = 127 ms) at **1.6× vLLM's whole ~78 ms non-GEMM**, so the step
|
||||
cannot drop below ~207 ms. The remaining ~49 ms to vLLM's 158 ms step is elementwise + host-loop
|
||||
(GDN state I/O is intrinsic and vLLM pays it identically — `GDN_DECODE_VERIFY.md`), **outside track B.**
|
||||
|
||||
### Explicitly NOT in scope (and why)
|
||||
- A from-scratch W4A16 / CUTLASS collective — the FP4-MMA path already exists and is BW-optimal at
|
||||
batch 1; rewriting repeats the W4A16 occupancy dead-end (`W4A16_MARLIN_KERNEL_PLAN.md`: STOPPED).
|
||||
- Deep multi-stage `cp.async` / XOR-swizzle shared layouts — proven to collapse GB10 occupancy.
|
||||
- The non-GEMM MoE residual (elementwise, host CUDA-graph, GDN bf16 state) — needed for MoE parity
|
||||
but **separate tracks**; track B owns the GEMM only.
|
||||
|
||||
- A from-scratch W4A16 / CUTLASS SM120 collective — repeats the STOPPED occupancy dead-end and
|
||||
CUTLASS's grouped FP4 is broken on sm_121.
|
||||
- Deep multi-stage `cp.async` / XOR-swizzle — proven to collapse GB10 occupancy.
|
||||
- "Make activations 4-bit" — already W4A4; no work, no win there.
|
||||
- The non-GEMM MoE residual (elementwise, host CUDA-graph, GDN bf16 state) — needed for MoE parity but
|
||||
**separate tracks**; B owns the GEMM only.
|
||||
|
||||
---
|
||||
|
||||
## 5. Honest one-paragraph summary
|
||||
## 7. The honest ceiling — does B reach TRUE PARITY?
|
||||
|
||||
- **DENSE: TRUE PARITY is PLAUSIBLY REACHABLE, conditional, no margin.** The entire 2.42× gap is the
|
||||
GEMM bucket; its ideal floor (66 ms) is 7× below the current 471 ms and is **bandwidth-bound, not
|
||||
hardware-capped**. **B (GEMM → BW floor) + A (act-fuse) lands 376–394 tok/s = 90–103% of vLLM 391.**
|
||||
The catch: it needs **~17–21% FP4-MMA efficiency at decode M=128**, and GB10 has only demonstrated
|
||||
~17% — and that at the *easier* prefill M=512 tile. It is a **reach, not a lock**, gated by the P2
|
||||
occupancy kill-gate and contingent on track A. **GO (conditional).**
|
||||
|
||||
- **MoE: full parity is NOT reachable from track B.** Realistic ceiling **~76% of vLLM (618 vs 811)**
|
||||
even with a perfect weight-read-floor grouped GEMM, because (1) the MoE floor is the hardest
|
||||
grouped-GEMM regime (M≈4/expert, vLLM ships purpose-built Marlin-NvFp4) and (2) ~24% of the step is
|
||||
non-GEMM outside this track. Worth doing (333 → ~618, a 1.85× and a real win), but it **cannot
|
||||
deliver 811 alone.** **PARTIAL / NO-GO for parity-from-B.**
|
||||
|
||||
- **The 273 GB/s is not the ceiling — the GB10 FP4-MMA occupancy efficiency is.** Decode M=128 is a
|
||||
*different* regime from the dead W4A16 path: bandwidth/occupancy-bound (saturate LPDDR5x at a thin
|
||||
M-tile via resident CTAs), not compute-throughput-bound (pack MMAs). The existing path is already at
|
||||
the BW floor at batch 1 (88 ms), so the work is **keeping it bandwidth-bound as M grows to 128**
|
||||
(occupancy via `mmq_y`↓ + shallow prefetch), a **tune of a working path**, not the greenfield
|
||||
rewrite. The binding risk is whether that occupancy can be bought without tripping the GB10 wall —
|
||||
which is exactly what the P2 kill-gate measures.
|
||||
|
||||
**Bottom line for the "TRUE PARITY" ask:** GB10 **can** plausibly deliver **dense** decode parity with
|
||||
vLLM via a tuned FP4-MMA decode GEMM **+ track A**, at the top of the demonstrated efficiency envelope
|
||||
with no margin. GB10 **cannot** deliver **MoE** decode parity from the GEMM track alone (ceiling ~76%);
|
||||
MoE parity is a B-plus-non-GEMM program. **Verdict: GO for dense (conditional, B+A, kill-gated),
|
||||
PARTIAL for MoE.**
|
||||
|
||||
---
|
||||
|
||||
## 8. One-paragraph summary
|
||||
|
||||
The decode GEMM at M=128 is **bandwidth-bound on paper** (crossover M*≈611 ≫ 128) with weight-read
|
||||
floors 4–6× above vLLM, so **273 GB/s is not the wall** — but llama's FP4-MMA kernel runs at ~3% of
|
||||
FP4 peak, putting it in **self-inflicted compute-bound territory** (471 ms vs a 66 ms floor). Closing
|
||||
that is the entire dense gap: **track B (tune the FP4-MMA decode M-tile to the BW floor) + track A
|
||||
(fuse act-quant)** plausibly reaches **90–103% of vLLM dense (391)** — TRUE PARITY is on the table for
|
||||
dense, but only at the **top of the demonstrated GB10 FP4-efficiency envelope (~17–21%)** and with
|
||||
**no margin**, gated by the occupancy wall. **MoE parity is not reachable from the GEMM alone**
|
||||
(ceiling ~60–76% of 811), because its floor sits in the hardest grouped-GEMM regime and ~24% of its
|
||||
step is non-GEMM work outside this track. Verdict: **GO for dense (conditional, B+A), PARTIAL for MoE.**
|
||||
FP4 peak, in **self-inflicted compute-bound territory** (471 ms vs a 66 ms floor). The path is already
|
||||
**W4A4** and already **beats vLLM at batch-1 prefill**, so the fix is **tuning the existing
|
||||
`mul_mat_q<NVFP4>`**, not a cutlass rewrite (a proven GB10 dead-end, and broken on sm_121 anyway). The
|
||||
M-tile asymmetry sets the levers: **dense** is pinned at `mmq_x=128` (one weight read) so its occupancy
|
||||
win is **`mmq_y`↓ + shallow prefetch** (BW-neutral), while **MoE**'s win is the free per-expert
|
||||
**`mmq_x`↓** (patch 0015). **Track B (GEMM → BW floor) + track A (fuse act-quant)** plausibly reaches
|
||||
**90–103% of vLLM dense (391)** — TRUE PARITY on the table for dense, but only at the **top of the
|
||||
demonstrated GB10 FP4-efficiency envelope (~17–21%)**, with **no margin**, gated by the P2 occupancy
|
||||
kill-gate. **MoE parity is not reachable from the GEMM alone** (ceiling ~76% of 811), because its floor
|
||||
sits in the hardest grouped-GEMM regime and ~24% of its step is non-GEMM. **Verdict: GO for dense
|
||||
(conditional, B+A), PARTIAL for MoE.**
|
||||
|
||||
Reference in New Issue
Block a user