docs(paged): adversarial review of track-B FP4-GEMM parity go/no-go

Append section 9 (skeptical staff-CUDA-engineer review) to FP4_GEMM_SCOPE_B.md, stress-testing the dense/MoE parity verdict against the committed grounding. Key findings: - Not the W4A16 wall: the npl-sweep (dense 99/56/46/41% of vLLM at npl 8/32/64/128) shows llama's FP4-MMA kernel HITS the weight-read floor at M=8 and FALLS OFF it as M grows, while vLLM HOLDS it. Working-path tune, dual existence proof (M=8 + vLLM M=128), not a greenfield build. Same binding constraint as W4A16 though (hide LPDDR5x latency at the larger tile on an occupancy-dominated part). - The dense gap is ~82-87% GEMM, ~13-18% non-GEMM (467 ms total = 383-405 GEMM + 62-84 non-GEMM). B alone caps ~80%; track A is what tips dense over the parity line. - Sharpest omission: vLLM's M=128 floor is reached via cutlass TMA + deep pipeline - the technique the doc forbids on GB10. TMA != manual cp.async (lower occupancy cost); it must be an in-scope P2 fallback, not categorically banned. - Honest landing: dense ~80-90% (parity the optimistic tail, contingent on B+A+floor), MoE ~55-65% (parity not reachable from B). Low-regret: even a tripped P2 kill-gate lands B+A ~89%, doubling today's 41%. - Sequencing fix: land A first (defines B's interface + baseline + kill-gate), then run B's P2 against the post-A number. Verdict: DENSE conditional GO (scope as GEMM-gap-closing, not true parity; A-first, gate at P2, add TMA); MoE NO-GO for parity from B (do the cheap mmq_x-down win as a 1.7-1.85x, not parity). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-24 16:49:06 -04:00 · 2026-06-24 14:31:35 +00:00
parent 7434d64c75
commit 39e16cc2c4
1 changed files with 170 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md
+++ b/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md
@@ -360,3 +360,173 @@ demonstrated GB10 FP4-efficiency envelope (~17–21%)**, with **no margin**, gat
 kill-gate. **MoE parity is not reachable from the GEMM alone** (ceiling ~76% of 811), because its floor
 sits in the hardest grouped-GEMM regime and ~24% of its step is non-GEMM. **Verdict: GO for dense
 (conditional, B+A), PARTIAL for MoE.**
+
+---
+
+## 9. Adversarial review (skeptical staff CUDA engineer, post-W4A16): the parity go / no-go
+
+Reviewer stance: I lived through the W4A16 GB10 effort that plateaued at ~9-15 TFLOP/s (~21% of the
+BF16 ceiling) after multi-week work and was STOPPED at the occupancy wall. I read this scope and the
+grounding (`QWEN36_NVFP4_BENCH`, `VLLM_DECODE_GROUNDING`, `GDN_DECODE_VERIFY`, `DECODE_GAP_STUDY`,
+`BLACKWELL_KERNEL_GAPS`, `W4A16_MARLIN_KERNEL_PLAN`) and stress-tested the verdict against them. Net:
+the plan is **directionally right and tractably scoped**, the kernel-approach decision (tune, do not
+rewrite) is correct, but the **"GO for dense, TRUE PARITY 96-103%" headline outruns its own caveats**.
+The honest landing is **dense ~80-90% (parity is the optimistic tail), MoE ~55-65% (parity not
+reachable from B)**. The decision to commit to B is nonetheless sound, for a reason the doc under-sells
+(low regret), and there is **one technical gap (TMA) and one sequencing error (A last) that must be
+fixed**.
+
+### 9.1 Is this the W4A16 wall again? No - and the batch-scaling signature proves why
+
+The decisive evidence the doc has but does not fully exploit is the **npl-sweep** (`QWEN36_NVFP4_BENCH`):
+dense llama-as-%-of-vLLM = **99 / 56 / 46 / 41** at npl 8 / 32 / 64 / 128. At **npl8 the kernels are at
+parity** (99%); the gap **opens monotonically as M grows**. Decompose this:
+
+- At M=8 the dense GEMM is weight-read-bound at the floor (~88 ms, same as batch-1). llama == vLLM there,
+  so **llama's FP4-MMA kernel demonstrably HITS the weight-read floor at small M.** This is the existence
+  proof the W4A16 path never had: it is a *working, floor-reaching* FP4-MMA kernel, not a greenfield
+  build stuck at 1/4 of MMQ.
+- At M=128 vLLM's GEMM **stays at ~88 ms** (flat: it amortizes the one weight read over 128 tokens and
+  hides the MMA behind the load), while **llama's balloons to 471 ms** (5.4x). llama **falls off the
+  floor** as M grows; vLLM **holds it**.
+
+So the problem is **not** "build a fast 4-bit GEMM from scratch on an occupancy-hostile part" (the dead
+W4A16 problem). It is **"keep a working FP4-MMA kernel on the bandwidth floor as the M-tile grows from 8
+to 128"** - a tune of a working path. **Verdict: this is NOT the W4A16 wall** (different regime, working
+path, dual existence proof at M=8 and from vLLM at M=128). **But it shares W4A16's one binding
+constraint:** holding the floor as M grows requires hiding LPDDR5x weight-load latency at the larger
+tile, which is the same occupancy / latency-hiding game GB10 historically loses. The doc is right that
+it is a different and more tractable regime; it under-states that the *binding risk is identical*.
+
+### 9.2 Why is vLLM 2.4x faster if both share 273 GB/s? Compute-side scheduling, and the gap is ~82% (not 100%) GEMM
+
+The load-bearing question, settled by 9.1: at M=128 the gap is **not** that vLLM beats the shared
+bandwidth floor - it is that **llama falls off the floor into self-inflicted compute/occupancy-bound
+territory while vLLM stays on it.** The lever is therefore latency-hiding at the M=128 tile
+(compute-side scheduling: occupancy, prefetch, tile shape), with the 273 GB/s weight-read floor as the
+hard target both engines share. This confirms the doc's roofline and its central claim that the kernel,
+not the hardware, is the limiter.
+
+**But the doc's "the entire 2.42x dense gap is the GEMM" is an ~82% truth, not a 100% one.** Decompose
+the dense step (numbers from the doc's own inputs):
+
+```
+llama step @npl128            795 ms   (decode_agg 161)
+vLLM step  @npl128            328 ms   (decode_agg 391)
+total gap                     467 ms
+
+llama GEMM                    471 ms
+vLLM GEMM (at the floor)      ~66-88 ms   (66 @273 GB/s spec, 88 @216 GB/s achieved)
+=> GEMM gap                   383-405 ms  = 82-87% of the 467 ms total gap
+=> non-GEMM gap                62-84 ms   = 13-18% of the total gap
+```
+
+So **B alone (GEMM -> floor) caps near ~80-84%** (step 412-390 ms = 311-328 t/s), **not parity.** Parity
+needs the non-GEMM 62-84 ms too: ~65 ms of it is track A's act-quant bucket, the residual ~0-19 ms is
+elementwise + host outside both A and B. This is the crux of the sequencing answer (9.6): **B is
+necessary but on its own lands ~80%; it is track A that tips dense over the parity line, not B.** The
+parity story is *entirely* contingent on A, which the P3 framing buries.
+
+### 9.3 The sharpest risk the doc misses: vLLM's existence proof uses the technique the doc forbids (TMA)
+
+vLLM holds the M=128 floor with **cutlass SM120 = TMA + a warp-specialized deep async producer/consumer
+pipeline** (Research 1). That deep pipeline is **exactly what the doc forbids on GB10** (rule 4.5: "do
+not add deep cp.async stages ... they collapsed W4A16"). So **B's chosen GB10-friendly route (`mmq_y`-down
+occupancy + a shallow 2-stage prefetch) is a different bet from the one that produced the existence
+proof.** Reaching the same floor by a friendlier route is plausible but **unproven**, and if the
+occupancy-only route plateaus short of the floor, B underperforms its target with no fallback in scope.
+
+The doc conflates two different things under "deep pipeline":
+- **manual `cp.async` + XOR-swizzle** - register/shared-hungry, **collapsed W4A16 occupancy on GB10**
+  (correctly banned).
+- **TMA (tensor-memory-accelerator) bulk async copy** - a single descriptor drives the copy, **far lower
+  register/occupancy cost**, and it is precisely how cutlass gets pipeline depth **without** the
+  occupancy hit (Research 1 says this explicitly). TMA is available on sm_120/121.
+
+**Recommendation (binding):** B must put a **TMA-driven weight feed in scope as a first-class P2 option**,
+not categorically forbid pipeline depth. The occupancy-only route is the right *first* experiment
+(cheapest, respects the W4A16 lesson), but if P2 plateaus below the floor, **TMA is the demonstrated way
+to get depth without the occupancy collapse** and is what the vLLM existence proof actually uses.
+Declaring the floor "unreachable" without trying TMA would repeat the W4A16 mistake in reverse:
+abandoning the path that works because the *manual* version of it failed.
+
+### 9.4 Tractability: bounded tune, confirmed - with the TMA caveat
+
+The proposed changes are genuinely **bounded and build-ready**, not a greenfield kernel:
+- **MoE arm = DEMONSTRATED tractable.** Patch 0015 already auto-caps `mmq_x` per-expert and is committed
+  and measured. Tightening to 8-16 + block-pad is the same lever, lower risk. This is real, banked
+  evidence that the "tune `mul_mat_q`" approach works on this exact kernel family.
+- **Dense arm = plausibly bounded.** `mmq_y`-down is a warp/fragment remap that touches the
+  `nwarps x tile_C::I == mmq_y` static_assert coupling, so it is a contained *kernel* edit (not a pure
+  host switch, as the doc itself notes). The host-only P1 knobs are zero-risk. The **prefetch piece is
+  where the residual occupancy risk lives** - and per 9.3, TMA belongs here.
+- **Rejecting (B) cutlass-rewrite and (C) BF16-Marlin-descent is correct.** Cutlass grouped FP4 is broken
+  on sm_121 (the reason vLLM itself falls to Marlin for MoE); BF16 Marlin concedes GB10's 2x FP4 edge.
+
+**Verdict: tractable, not greenfield.** The MoE arm is proven; the dense arm is a contained edit with a
+real but bounded occupancy risk, gated by the P2 kill-gate. The one scope gap is TMA (9.3).
+
+### 9.5 Honest expected outcome (the numbers I would defend)
+
+| | B alone | B + A (median) | B + A (optimistic, spec BW) | parity? |
+|---|---:|---:|---:|---|
+| **DENSE** (target 391) | ~80-84% (311-328 t/s) | **~92-95% (360-372 t/s)** | ~101% (394 t/s) | **optimistic tail only** |
+| **MoE** (target 811) | ~53-61% (431-498 t/s) | **~70-76% (570-618 t/s)** | 76% (618 t/s, CEILING) | **no** |
+
+Reconciliation with the doc: the doc's B+A = "96-103%" uses the **spec-BW (66 ms floor)** end. At the
+**achieved 216 GB/s (88 ms floor)** the same arithmetic gives **~94%**, and that still assumes B hits the
+floor. So the honest dense median is **~92-95%, with TRUE PARITY as the upside, not the expectation**,
+contingent on a conjunction of three things: (a) P2 clears the occupancy kill-gate to the floor, (b) the
+GB10-friendly *or* TMA feed actually reaches the cutlass floor (9.3), and (c) track A lands. Three ANDs =
+tail, not median.
+
+**The low-regret point the doc under-sells (and the real reason to commit):** even the *kill-gate-tripped*
+outcome is a large win. At the doc's own 15%-FP4-eff kill threshold (GEMM ~110 ms), B+A still lands
+**~89%** (step 369 ms); at a merely-partial occupancy win (eff 3% -> 5%, GEMM ~276 ms) B+A still lands
+**~61%**. Since the M=8 parity proof guarantees the floor is reachable in principle and patch 0015 proves
+the tune works, **getting *some* improvement at M=128 is high-probability; the only open question is how
+close to the floor.** So the outcome distribution is heavily positive (very likely 60-90%, possibly
+parity) with a bounded downside - B is **low-regret**, which matters more for the go decision than whether
+the parity tail hits.
+
+### 9.6 Sequencing vs track A: land A FIRST (the doc has this backwards)
+
+The doc runs A as a parallel track merging at **P3 (last)**. That is backwards for de-risking, for three
+reasons:
+1. **A defines B's interface.** B's "prequantized-MMQ consumer" consumes A's fused `block_fp4_mmq`
+   producer (the frozen struct in 4.4). Building B against a not-yet-landed producer means B's consumer
+   seam is speculative until P3.
+2. **A defines B's baseline and the kill-gate threshold.** A alone (act-fuse, folding the 65 ms /8.2%
+   bucket, plus any of the elementwise/host it captures) plausibly moves dense **41% -> ~50-55%** before
+   B touches a kernel. B's *true residual is the GEMM after A removed the act round-trip*, not the raw
+   59%. Running B's P2 against the stock 41% baseline mis-sizes the required GEMM speedup and the
+   <15%-eff kill-gate.
+3. **A is lower-risk and independently shippable.** It is the safe win; it should not wait behind the
+   risky kernel tune.
+
+**Recommendation:** land A (tasks 38-41) first, **re-measure** the decode_agg and the GEMM share
+post-A, **then** run B's P2 and recompute the kill-gate against the post-A number. This makes the
+make-or-break decision cheaper, better-informed, and bankable-either-way.
+
+### 9.7 Verdict (go / no-go)
+
+- **DENSE: CONDITIONAL GO - commit to B, but scope and message it as "close most of the GEMM gap"
+  (expected ~80-90%, parity the upside), NOT "true parity."** Justified because: the approach is
+  bounded/tractable (9.4), it is a working-path tune with a dual existence proof (9.1), and the outcome
+  is low-regret (9.5) - even a tripped kill-gate roughly doubles today's 41%. Conditions: (i) **land A
+  first** (9.6); (ii) **gate hard at P2** (eff < 15% -> stop chasing parity, but keep the partial win);
+  (iii) **put TMA in scope** as the floor-reaching fallback before declaring the floor unreachable (9.3).
+
+- **MoE: NO-GO for parity from B (confirmed).** The doc's ~76% ceiling is honest, arguably optimistic
+  (it assumes the ragged M~4/expert grouped GEMM hits its 80 ms floor, the hardest regime, where vLLM
+  ships purpose-built Marlin). Realistic B+A landing **~70-76%**, B alone ~55-61%. Still worth doing -
+  the `mmq_x`-down / block-pad work is cheap and partly landed (patch 0015) - but it must be sold as a
+  **1.7-1.85x win, not parity**; MoE parity is a **B-plus-non-GEMM** program (elementwise fusion, host
+  CUDA-graph, GDN bf16 state).
+
+- **One line for the parent:** GB10 can plausibly reach **dense** decode parity with vLLM only at the
+  **top of its FP4 envelope and only as B + A together** (B alone caps ~80%; A is what tips it over),
+  and **cannot** reach **MoE** parity from the GEMM track alone (ceiling ~76%). **Commit to B** as a
+  high-value, low-regret, bounded GEMM-gap-closing tune (honest expected landing **dense ~80-90%, MoE
+  ~55-65%**), **sequence track A first**, **gate at P2**, and **add a TMA weight-feed option** so the
+  occupancy-only route is not the only shot at the floor that vLLM's TMA pipeline demonstrably reaches.