docs(paged): scope track B - FP4-MMA decode-GEMM roofline + parity go/no-go

Roofline at the decode batch shape (M=128, NVFP4 weights) on GB10 (sm_121): the dense weight-read floor (~1,940 tok/s) and MoE floor (~1,590 tok/s) sit 4-6x above vLLM's 391/811, so 273 GB/s is NOT the wall. At FP4 peak the GEMM is bandwidth-bound (crossover M*~611 >> 128); at the kernel's ~3% achieved FP4 efficiency it is compute-bound by its own inefficiency (471 ms vs a 66 ms floor). Verdict: dense decode parity is plausibly reachable via a tuned FP4-MMA decode M-tile (track B) + fused act-quant (track A), landing 376-394 tok/s = 90-103% of vLLM 391, but only at the top of the demonstrated GB10 FP4 envelope (~17-21%) and with no margin (occupancy wall is the binding constraint, not bandwidth). MoE parity is NOT reachable from the GEMM alone (ceiling ~60-76% of 811): its floor is the hardest grouped-GEMM regime and ~24% of its step is non-GEMM work outside track B. GO (conditional) for dense, PARTIAL for MoE. Build-ready phased plan included; tune the existing block_fp4_mmq path, not a W4A16 rewrite. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-25 00:59:28 -04:00 · 2026-06-24 14:09:41 +00:00
parent e4c63179e0
commit ea634ee958
1 changed files with 221 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md
+++ b/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md
@@ -0,0 +1,221 @@
+# Track B: the FP4-MMA weight-GEMM for GB10 decode parity with vLLM — roofline + go/no-go
+
+Scope only (build-ready plan + honest verdict). **Not implemented in this workflow.** This is the
+residual-kernel track after track A (fuse the standalone `quantize_mmq_nvfp4` activation-requant,
+the 8.2% bucket) is handled separately. Track B asks the load-bearing question and answers it
+quantitatively: at the decode batch shape (M≈128 tokens, NVFP4 weights), is the weight GEMM
+**compute-bound** (FP4-MMA throughput is the lever → parity reachable with a better kernel) or
+**bandwidth-bound** (273 GB/s weight-read is a hard floor → parity capped)? And given the prior
+GB10 occupancy history, can a better FP4-MMA decode GEMM actually reach vLLM's 391 (dense) / 811
+(MoE) tok/s, or only partway?
+
+Hardware: NVIDIA GB10 / DGX Spark, sm_121 (CC 1210 = `GGML_CUDA_CC_DGX_SPARK`), unified LPDDR5x.
+Dev tree `~/llama-paged-dev` (branch `paged`, build-cuda sm_121). All numbers are reasoned from the
+committed nsys decomposition + measured GB10 specs; **no new GPU benchmarks were run** (track A is on
+the box).
+
+## 0. The grounded inputs (measured, committed)
+
+| quantity | value | source |
+|---|---|---|
+| LPDDR5x bandwidth (spec) | **273 GB/s** | `BLACKWELL_KERNEL_GAPS.md`, `VLLM_DECODE_GROUNDING.md` |
+| LPDDR5x bandwidth (achieved, batch-1) | **~216 GB/s** (19 GB / ~88 ms irreducible) | prior batch-1 weight-read study |
+| FP4 (NVFP4/MXFP4) dense peak | **~427–500 TFLOP/s** (2× BF16; GB10 is 1:1:2 BF16:INT8:FP4) | `BLACKWELL_KERNEL_GAPS.md` §2 (measured) |
+| BF16 peak | ~213 TFLOP/s | same |
+| Demonstrated GB10 FP4-MMA efficiency | **~17%** of FP4 peak at prefill M=512 (MXFP4 dense 1153 t/s); ~3–7% at decode; ~5% MoE | `BLACKWELL_KERNEL_GAPS.md` §6, `GDN_DECODE_VERIFY.md` |
+| Demonstrated GB10 INT8-MMQ efficiency | ~21% of BF16 peak | `BLACKWELL_KERNEL_GAPS.md` §3 |
+| Dense Qwen3.6-27B NVFP4 weights | **18.8 GB** file (`q36-27b-nvfp4.gguf`); ~18 GB matmul tensors | `du` on DGX |
+| MoE Qwen3.6-35B-A3B NVFP4 weights | **23.85 GB** file; ~22 GB read/step at npl128 (≈98% experts hit) | `du` on DGX |
+| Decode step decomposition (dense npl128, nsys, GPU 92.7% busy) | GEMM_weight **59.2%**, act_quant 8.2%, GDN(recurrent+conv) 10.4%, full-attn 1.8%, elementwise/norm/rope 13.5%, embed 2.9%, copy 1.8% | `GDN_DECODE_VERIFY.md` §3a |
+| Measured per-step times @npl128 | dense **~795 ms** (llama) → **~328 ms** (vLLM); MoE **~384 ms** → **~158 ms** | `VLLM_DECODE_GROUNDING.md` |
+| Aggregate decode @npl128 | dense 161 (llama) vs **391** (vLLM); MoE 333 vs **811** | `QWEN36_NVFP4_BENCH.md` |
+
+Crossover formula used throughout (per-GEMM and whole-model are identical):
+`M* = b · peak / (2 · BW)` where `b` = bytes per weight element. Below `M*` the GEMM is
+bandwidth-bound; above it, compute-bound.
+
+---
+
+## 1. DENSE Qwen3.6-27B — the roofline at decode M=128
+
+`b = 18e9 B / 27e9 params = 0.667 B/param`. FLOPs/step `= 2·M·P = 2·128·27e9 = 6.91 TFLOP`.
+
+**(a) Weight-read floor** (weights read ONCE for all 128 tokens):
+- @273 GB/s: 18 GB / 273 = **65.9 ms/step → 1,942 tok/s ceiling**
+- @216 GB/s (achieved): 18 / 216 = **83 ms/step → 1,542 tok/s**
+
+**(b) Compute floor:**
+- @FP4 peak 500 TF/s: 6.91 / 500 = **13.8 ms → 9,275 tok/s**
+- @17% FP4 (85 TF/s, the demonstrated prefill ceiling): 81 ms → 1,580 tok/s
+- @5% FP4 (25 TF/s, measured decode regime): 276 ms → 464 tok/s
+
+**(c) Crossover:**
+- At FP4 **peak**: `M* = 0.667·500e12 / (2·273e9) = 611`. **M=128 ≪ 611 → an ideal FP4 GEMM at decode is BANDWIDTH-BOUND.**
+- At the kernel's **achieved** efficiency the effective peak collapses, dragging `M*` down: 17% → M*≈104; 5% → M*≈30. So **at its current ~3–7% efficiency the kernel is COMPUTE-BOUND at M=128** (limited by its own poor FP4-MMA throughput), even though the hardware says it should be bandwidth-bound.
+
+**Where llama actually sits:** GEMM = 59.2% × 795 ms = **471 ms**. Achieved = 6.91e12 / 0.471 =
+**14.7 TFLOP/s = 2.9% of FP4 peak**. That is **7.1× slower than the 66 ms weight-read floor** and
+matches the ~3–7% decode-efficiency band. The 471 ms is not a hardware bandwidth wall — it is the
+FP4-MMA kernel running deep in compute-bound territory at single-digit efficiency.
+
+**Where vLLM sits:** step 328 ms → if its native-FP4 cutlass GEMM is at the ~66 ms BW floor, the
+GEMM is only ~20% of vLLM's step; the rest (~262 ms) is GDN + full-attn + host. vLLM's **whole step
+(328 ms) ≈ llama's GEMM bucket alone (471 ms)** minus a bit. The entire 2.42× gap is the GEMM.
+
+**Dense parity arithmetic** (795 ms = GEMM 471 + act 65 + GDN 83 + attn 14 + rest 162):
+- B alone (GEMM → 66 ms BW floor, requires ~21% FP4 eff): step 728→… = 66+65+83+14+162 = **390 ms → 328 tok/s = 84% of vLLM**.
+- **B + A** (GEMM 66 ms floor **and** act-quant fused away): 66+83+14+162 = **325 ms → 394 tok/s = 101% of vLLM → PARITY/BEAT.**
+- B+A at the softer 17% FP4 (GEMM 81 ms, the *demonstrated* prefill ceiling, not the 21% floor): 340 ms → **376 tok/s = 96% of vLLM.**
+
+**Dense robust band: 90–103% of vLLM**, and it is insensitive to the 273-vs-216 GB/s uncertainty
+(at 216 GB/s the floor is 83 ms → step 357 ms → 359 tok/s = 92%). The conclusion holds.
+
+---
+
+## 2. MoE Qwen3.6-35B-A3B — the roofline at decode M=128
+
+At npl128, 128 tokens × top-8 over 256 experts ⇒ P(expert unused) = (1−8/256)^128 ≈ 1.7%, so
+**~98% of experts are read** → ~22 GB/step (essentially the full weight set), the same
+weight-read regime as dense. The grouped GEMM (`mmid.cu` / `mul_mat_q` id-branch) reads each
+routed expert's weight **once** for the ~128·8/256 = **4 tokens/expert** on average.
+
+**(a) Weight-read floor:**
+- @273 GB/s: 22 / 273 = **80.6 ms → 1,588 tok/s**
+- @216 GB/s: 102 ms → 1,255 tok/s
+
+**(b) Compute floor:** only ~3B active params/token → FLOPs = 2·128·3e9 = 0.77 TFLOP → 1.5 ms @peak.
+**Trivial.** MoE decode is **purely bandwidth/occupancy bound**, never compute-bound. The hard part
+is that per-expert M ≈ 4: the grouped GEMM must saturate ~273 GB/s while feeding tiny ragged M-tiles
+— the regime where ggml's dense-tuned `mmq_x=128` underfills (see `MOE_GROUPED_GEMM_SCOPE.md`).
+
+**Where llama sits:** GEMM = 59% × 384 = **227 ms** → effective BW 22 GB / 0.227 s =
+**97 GB/s = 35% of 273** (less compute-bound than dense, but only 1/3 of peak bandwidth — an
+occupancy/tile-fill loss, exactly the `MOE_GROUPED_GEMM_SCOPE.md` M-tile finding).
+
+**Where vLLM sits:** step 158 ms ≈ GEMM at the ~80 ms floor (grouped Marlin-NvFp4, 51% of its step)
+ ~78 ms non-GEMM. So vLLM is already pushing the MoE bandwidth floor.
+
+**MoE parity arithmetic** (384 ms = GEMM 227 + act 31 + GDN 38 + attn 8 + rest 81):
+- B + A, GEMM → 80 ms floor + act fused: 80+38+8+81 = **207 ms → 618 tok/s = 76% of vLLM.**
+- This is the **ceiling from the GEMM track**: even with a *perfect* MoE weight-read-floor GEMM,
+  llama's non-GEMM (GDN 38 + attn 8 + rest 81 = 127 ms) is **1.6× vLLM's whole non-GEMM (~78 ms)**,
+  so the step cannot drop below ~207 ms. To reach vLLM's 158 ms needs the non-GEMM buckets too
+  (GDN state I/O is intrinsic and vLLM pays it identically — `GDN_DECODE_VERIFY.md` — so the
+  remaining ~49 ms is elementwise + host loop, **outside track B**).
+
+**MoE band from B+A: ~60–76% of vLLM.** Full MoE parity is **not reachable from the GEMM alone.**
+
+---
+
+## 3. The load-bearing verdict
+
+**Q: compute-bound or bandwidth-bound at M=128?**
+At the **hardware** roofline the decode GEMM is **bandwidth-bound** (M=128 ≪ crossover 515–611).
+At the **current kernel's** ~3–7% FP4 efficiency it is **compute-bound by its own inefficiency**
+(effective M*≈30). The two weight-read floors — **dense ~1,940 tok/s, MoE ~1,590 tok/s** — both sit
+**4–6× ABOVE vLLM's 391/811.** So **the 273 GB/s bandwidth is NOT the wall at the parity target.**
+There is large bandwidth headroom; the gap is the FP4-MMA kernel achieving single-digit % of peak
+where the roofline permits ~20%+ before bandwidth even binds.
+
+**Q: can a better FP4-MMA GEMM reach vLLM — TRUE PARITY?**
+
+- **DENSE: parity is PLAUSIBLY REACHABLE, but at the edge of the demonstrated envelope.** The entire
+  2.42× gap is the GEMM bucket; its ideal floor (66 ms) is 7× below the current 471 ms and is
+  bandwidth-bound, not hardware-capped. **B (GEMM → BW floor) + A (act-fuse) lands 376–394 tok/s ≈
+  vLLM's 391 (90–103%).** The catch: hitting the floor needs **~21% FP4-MMA efficiency at decode
+  M=128**, and GB10 has only ever demonstrated ~17% (and that at prefill M=512, a *larger, easier*
+  tile). Decode M=128 is a smaller M than prefill, so the same kernel must hold efficiency at a
+  thinner tile. This is a **reach, not a lock**: parity is on the table but with **no comfortable
+  margin** and **contingent on track A landing too**.
+
+- **MoE: full parity is NOT reachable from track B.** Realistic ceiling **~60–76% of vLLM** (618 vs
+  811) even with a perfect weight-read-floor grouped GEMM, because (1) the MoE GEMM floor at M≈4/expert
+  demands near-**full** BW saturation in the hardest grouped-GEMM regime, where llama is at 35% of peak
+  BW and vLLM ships a purpose-built grouped Marlin-NvFp4, and (2) ~24% of the residual is non-GEMM
+  (elementwise + host loop) outside track B. MoE parity needs B **plus** the non-GEMM tracks.
+
+**Q: the GB10 occupancy wall — does it cap this?** Yes, it is the binding constraint, not bandwidth.
+History (`W4A16_MARLIN_KERNEL_PLAN.md`, `BLACKWELL_KERNEL_GAPS.md`): the from-scratch W4A16 BF16 GEMM
+hit only ~9–15 TFLOP/s (¼ of MMQ) because deep `cp.async` pipelines + XOR-swizzle **collapse GB10
+occupancy**; skew-pad + small-shared + high-occupancy won. **Crucially, decode M=128 is a different
+regime from that dead path:** it is bandwidth/occupancy-bound, not compute-throughput-bound, so the
+lever is **saturating LPDDR5x at a thin M-tile via occupancy**, not packing MMAs. The existing
+FP4-MMA path (`block_fp4_mmq` / `vec_dot_fp4_fp4_mma`) is **already at the BW floor at batch 1**
+(88 ms irreducible) — so the kernel *can* saturate bandwidth at M=1; the work is keeping it
+bandwidth-bound as M grows to 128 instead of degrading to compute-bound at 3% efficiency. That is a
+**tune/fix of a working path**, not the dead greenfield W4A16 rewrite.
+
+### Go / No-Go
+
+- **DENSE — GO (conditional).** Build track B as a **decode-M-tile tune of the existing
+  `mul_mat_q<NVFP4>` FP4-MMA kernel**, co-delivered with track A. Honest expectation: **90–103% of
+  vLLM (parity within error), not a guaranteed beat.** Go condition: it is contingent on reaching
+  ~17–21% FP4 efficiency at M=128 (top of the demonstrated GB10 envelope) — set a P2 kill-gate
+  (below).
+- **MoE — PARTIAL / NO-GO for parity-from-B.** Track B (the M-tile work already scoped in
+  `MOE_GROUPED_GEMM_SCOPE.md`) buys MoE → ~60–76% of vLLM and is worth doing, but **cannot deliver
+  MoE parity by itself**; do not promise 811. Full MoE parity requires B + the non-GEMM tracks
+  (elementwise/host CUDA-graph, GDN state I/O bf16) and is a multi-track effort.
+
+**Bottom line for the "TRUE PARITY" ask:** GB10 **can** plausibly deliver **dense** decode parity
+with vLLM via a tuned FP4-MMA decode GEMM **+ track A**, at the edge of the demonstrated efficiency
+envelope and with no margin. GB10 **cannot** deliver **MoE** decode parity from the GEMM track
+alone (ceiling ~76%); MoE parity is a B-plus-non-GEMM program. The hardware (273 GB/s) is **not** the
+ceiling — the GB10 FP4-MMA occupancy efficiency is, and it is a "reach" for dense and a "partial" for
+MoE.
+
+---
+
+## 4. Build-ready plan (do NOT implement here)
+
+The kernels already exist; track B is a **tune + fuse of the FP4-MMA `mul_mat_q` path at the decode
+M-tile**, not a new kernel. This respects every GB10 occupancy lesson (small shared, high occupancy,
+skew-pad, stay on `block_fp4_mmq`; never deep `cp.async` / XOR-swizzle).
+
+### Files (DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`)
+- `mmq.cuh` — `block_fp4_mmq` (L53), `load_tiles_nvfp4_nvfp4` (L948), `vec_dot_fp4_fp4_mma` (L997),
+  the stream-k `mul_mat_q` kernel + `mul_mat_q_case` / `launch_mul_mat_q` tile selection (~L3320–4055,
+  all under `BLACKWELL_MMA_AVAILABLE`).
+- `mmq.cu` — dense + id dispatch; `use_native_fp4` gate (L125), `quantize_mmq_fp4_cuda` act-quant
+  (L138/L200 — **track A's fuse target**).
+- `mmid.cu` — `mm_ids_helper` MoE token-sort (the MoE M-tile lever, scoped in `MOE_GROUPED_GEMM_SCOPE.md`).
+
+### Phases (each ends with: `test-backend-ops -o MUL_MAT[/_ID] -b CUDA0` bit-exact + a decode bench)
+
+| Phase | Work | Expected payoff | Risk |
+|---|---|---|---|
+| **P0** harness | Capture per-shape baseline at the **decode shape** (`test-backend-ops perf -o MUL_MAT`, type NVFP4, **n=128**, FFN K/N) + nsys decode window. Lock 1103/1103 parity + the 14.7 TFLOP/s baseline. Decode-M is the canonical target, not prefill n=512. | None (gate). | Low |
+| **P1** decode M-tile selection (dense) | In `mul_mat_q_case`/`launch_mul_mat_q`, pick `mmq_x`/`mmq_y` from the **decode M=128** shape rather than the prefill-tuned config. M=128 with FP4 N-frag 8 wants a small, occupancy-friendly tile; the prefill `mmq_x=128` likely underfills SM occupancy at decode. Host-side template selection, **zero new kernel**, mirrors `MOE_GROUPED_GEMM_SCOPE.md` [1]. | Lift dense FP4 eff from ~3% toward 10–17%; no extra weight read (one col-tile). | Low |
+| **P2** occupancy/pipeline tune | Sweep warps/tile/skew-pad on the FP4-MMA decode kernel to push toward the **66 ms BW floor (~21% FP4 eff)**. Honor GB10 rules: small shared, high occupancy, skew-pad +4, **no** deep cp.async / XOR-swizzle. **KILL-GATE:** if decode FP4 eff plateaus < ~15% (GEMM > ~110 ms) after the sweep, dense parity is off — stop and report partial. | The dense parity make-or-break. Target GEMM 471→66–81 ms. | **Med-high** (the occupancy wall is real; ncu unavailable on DGX → empirical sweep only) |
+| **P3** co-land track A | Verify the fused act-quant (track A) composes with the tuned GEMM (the requant folds into the FP4 GEMM prologue, removing the 8.2% bucket). | Dense 376–394 tok/s = 90–103% vLLM. | Low (track A owns the fuse) |
+| **P4** MoE M-tile | Land the `MOE_GROUPED_GEMM_SCOPE.md` expert-aware `mmq_x` ([1]) + block-pad align ([2]). | MoE → ~60–76% vLLM (not parity). | Med |
+
+### Parity gate (every phase)
+`GGML_CUDA_*` flag set and unset → `test-backend-ops test -o MUL_MAT -b CUDA0` = **1103/1103**,
+byte-identical when unset. Add **decode-shape (n=128) + ragged small-M** cases if absent. End-to-end:
+`llama-batched-bench -fa on -npp 512 -ntg 256 -npl 128` on `q36-27b-nvfp4.gguf`, confirm decode
+agg climbs toward ~376–394 and stays bit-stable vs the CPU oracle (within the GB10 greedy-decode
+non-determinism band). All bench/parity scripts **dev-tree-only**.
+
+### Explicitly NOT in scope (and why)
+- A from-scratch W4A16 / CUTLASS collective — the FP4-MMA path already exists and is BW-optimal at
+  batch 1; rewriting repeats the W4A16 occupancy dead-end (`W4A16_MARLIN_KERNEL_PLAN.md`: STOPPED).
+- Deep multi-stage `cp.async` / XOR-swizzle shared layouts — proven to collapse GB10 occupancy.
+- The non-GEMM MoE residual (elementwise, host CUDA-graph, GDN bf16 state) — needed for MoE parity
+  but **separate tracks**; track B owns the GEMM only.
+
+---
+
+## 5. Honest one-paragraph summary
+
+The decode GEMM at M=128 is **bandwidth-bound on paper** (crossover M*≈611 ≫ 128) with weight-read
+floors 4–6× above vLLM, so **273 GB/s is not the wall** — but llama's FP4-MMA kernel runs at ~3% of
+FP4 peak, putting it in **self-inflicted compute-bound territory** (471 ms vs a 66 ms floor). Closing
+that is the entire dense gap: **track B (tune the FP4-MMA decode M-tile to the BW floor) + track A
+(fuse act-quant)** plausibly reaches **90–103% of vLLM dense (391)** — TRUE PARITY is on the table for
+dense, but only at the **top of the demonstrated GB10 FP4-efficiency envelope (~17–21%)** and with
+**no margin**, gated by the occupancy wall. **MoE parity is not reachable from the GEMM alone**
+(ceiling ~60–76% of 811), because its floor sits in the hardest grouped-GEMM regime and ~24% of its
+step is non-GEMM work outside this track. Verdict: **GO for dense (conditional, B+A), PARTIAL for MoE.**
+</content>
+</invoke>