From 6c6a925213722a49cb11ceb5275e469fd892dc27 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Fri, 26 Jun 2026 20:14:30 +0000 Subject: [PATCH] docs(paged): MoE-vs-vLLM DECIDE synthesis - reject W4A16 Marlin, the GEMM is a llama win Cross-agent synthesis on top of the both-engine nsys decomposition (3b5957157): settle the user's "can we do what vLLM does on MoE?" question with the three converging investigations (groundtruth measurement + vllm-marlin source-read + marlin-port feasibility). Verdict: vLLM's ~15% MoE-decode lead is NOT the Marlin GEMM (that bucket is a -1.7 ms llama WIN: native FP4-MMA W4A4 47.3 vs Marlin W4A16 50.0 at the ragged tiny-M decode shape, both at the LPDDR5x BW floor). The gap is bf16 dense-projection bandwidth (+6.5), recurrence state-gather plumbing (+6.6, led by k_get_rows 5.2), graph/stream-overlap overhead (~+7), W4A4 act-quant tax (+3.3), and router/glue (+5.4). A W4A16/Marlin grouped MoE GEMM is REJECTED (default and opt-in): it would regress the 27% GEMM bucket to half-rate bf16 MMA, re-enter the GB10 occupancy wall the dense scaffold already STOPPED at, and its entire intrinsic upside is the ~2% act-quant tax - smaller than the bit-exact +1.9% the 0025 re-graph already banked, and closeable bit-exactly by fusing the act-quant. Recommended build (none a new MoE GEMM): (1) fuse the k_get_rows SSM-state gather (bit-exact, ~+5, biggest single-kernel win); (2) extend CUDA-graph coverage + stream overlap (bit-exact, ~+7); (3) fuse the W4A4 act-quant into RMSNorm/SiLU (bit-exact, +3.3); (4) NVFP4-quantize the still-bf16 GDN/attn projections + lm_head (bit-changing, +6.5, the same NVFP4-dense-quant move vLLM makes). Bit-exact levers alone reach ~94% of vLLM; with the projection quant ~96-97%, parity-or-better physically in reach since both heaviest kernels (SSM core, MoE GEMM) are already llama wins. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../patches/paged/MOE_GAP_VS_VLLM.md | 115 ++++++++++++++++++ 1 file changed, 115 insertions(+) diff --git a/backend/cpp/llama-cpp/patches/paged/MOE_GAP_VS_VLLM.md b/backend/cpp/llama-cpp/patches/paged/MOE_GAP_VS_VLLM.md index f9a3bf1bc..df47aa39f 100644 --- a/backend/cpp/llama-cpp/patches/paged/MOE_GAP_VS_VLLM.md +++ b/backend/cpp/llama-cpp/patches/paged/MOE_GAP_VS_VLLM.md @@ -117,4 +117,119 @@ The MoE grouped GEMM (the brief's hypothesis) is a **-1.7 ms llama win**, so it **(c) fuller CUDA-graph coverage + stream overlap** (~+7, bit-exact); **(d) fuse the W4A4 act-quant into the preceding op** (+3.3, bit-exact). None of these is a new MoE GEMM. +--- + +# FINAL DECISION (cross-agent synthesis) - "can we do what vLLM does on MoE?" + +Three agents converged on the same verdict from independent angles: `moe-gap-groundtruth` +(the measured both-engine nsys decomposition above), `vllm-marlin-study` (source-read of vLLM's +`moe_wna16_marlin_gemm` / `moe_align_block_size` / `prepare_nvfp4_moe_layer_for_marlin` on the DGX), +and `marlin-port-feasibility` (read-only assessment of the dense W4A16 scaffold + prior STOP). All +three agree, and the measurement is the arbiter. Below is the decision the user asked for. + +## (1) WHERE the 15% lives - decisive + +The gap is **27.8 ms/step (llama at 83.6% of vLLM)** and it is **NOT one kernel - it is a sum of small +deltas, and the MoE grouped GEMM is on llama's side of the ledger.** Ranked: + +| rank | lever | Δ ms/step | bit-exact? | this is... | +|-----:|-------|----------:|:----------:|------------| +| 1 | Graph coverage + cross-stream overlap | ~+7.0 | **yes** | scheduler/runtime (idle +3.8, overlap +3.2) | +| 2 | Recurrence state-gather/conv plumbing (`k_get_rows_float` 5.2 + conv) | +6.6 | **yes** | llama-only kernels; vLLM updates state in-place | +| 3 | Dense GDN/attn projections + lm_head (bf16 vs NVFP4) + convert glue | +6.5 | **no** | the NVFP4-dense-quant lever, on the projections | +| 4 | Router GEMM + norms/combine/memcpy glue | +5.4 | mostly yes | llama router = full FP4 GEMM; vLLM fuses topk/align | +| 5 | W4A4 act-quant tax (`quantize_mmq_nvfp4`) | +3.3 | **yes** | vLLM's W4A16 makes this structurally 0 | +| - | **MoE-expert grouped GEMM** | **-1.7** | - | **llama WIN** - native FP4-MMA W4A4 47.3 vs Marlin W4A16 50.0 | + +**The Marlin GEMM is explicitly ruled out as the source of the gap.** Both engines read the same ~22 GB +of ~4-bit expert weights once per step and are LPDDR5x-bandwidth-bound; on that weight stream they tie, +and llama's 2x-rate FP4-MMA edges Marlin's half-rate bf16 MMA. It is **not the projections-vs-Marlin +distinction in the experts, it is the projections in the DENSE path, the recurrence plumbing, and the +runtime/graph** that cost llama the 15%. Not distributed, not the expert GEMM, not routing alone. + +## (2) Can llama MATCH it - and HOW + +**Yes - to within a few percent, and NOT with a Marlin/W4A16 MoE GEMM.** The two biggest *compute* +kernels (the gated-DeltaNet SSM core 70.0 vs 71.1, and the MoE grouped GEMM 47.3 vs 50.0) are **already +llama wins.** The gap is overhead/scheduling/precision-of-the-other-tensors, all of which llama can +attack on its existing W4A4 FP4-MMA expert path. The four levers, in recommended build order: + +| order | build | gain | bit-exact / gate | effort | +|------:|-------|-----:|------------------|--------| +| 1st | **Fuse away the recurrent-state gather `k_get_rows_float`** (update SSM state in-place in the GDN decode path, fold `ssm_conv_update`) | ~+5 ms (~3% of step) - biggest single-kernel win | **bit-exact** (no md5 rebaseline) | medium - CUDA, the GDN decode kernel | +| 2nd | **Fuller CUDA-graph coverage + stream overlap** (extend the 0025 re-graph to the remaining MoE/projection nodes, overlap independent streams) | ~+7 ms combined; 0025 already banked ~+1.9% | **bit-exact** | medium - scheduler, partly done | +| 3rd | **NVFP4-quantize the still-bf16 GDN/attn projections + lm_head** (the same move vLLM makes on its dense path; 4-bit weight read ~4x less BW, kills the 2.9 ms bf16<->f32 convert) | ~+6.5 ms - biggest *bucket* | **bit-changing** (re-baselines md5 gates; precision-UPGRADE, see below) | medium-high - new NVFP4 weight path for non-expert linears | +| 4th | **Fuse the W4A4 act-quant into the preceding RMSNorm/SiLU** (as vLLM fuses act-quant) | +3.3 ms | **bit-exact** | low-medium | + +**Reach:** the three bit-exact levers (1+2+4 ~= +15.3 ms) alone close the gap to ~154.5 ms/step +=> ~830 t/s = **~94% of vLLM, with zero precision change and zero md5 rebaseline.** Adding the +NVFP4-projection lever (3, +6.5) reaches ~148 ms => ~865 t/s = **~96-97% of vLLM**, with the residual +being router/glue and the irreducible cross-stream-overlap that is structural to how ggml schedules +host-launched nodes vs vLLM's single fused graph. Because llama's two heaviest kernels are already +ahead, **parity-or-better is physically reachable** once the plumbing/overhead is removed; vLLM has no +arithmetic advantage on this hardware (its W4A16 is half-rate FP4 - it only wins on overhead and on the +dense-path weight-read BW). + +## (3) The leading lever, in full - and the Marlin question, settled + +**The user's specific hypothesis - "do what vLLM does = a Marlin-style W4A16 grouped MoE GEMM" - is +REJECTED, by measurement and by feasibility.** + +- **It is not where the gap is.** The MoE GEMM is a **-1.7 ms llama win.** A W4A16 Marlin MoE GEMM would + make that bucket SLOWER (half-rate bf16 MMA on the ~27% GEMM bucket), not faster. +- **Its entire intrinsic upside is the ~2% act-quant tax** (W4A16 has no activation quantize). That + +2% ceiling is **smaller than the +1.9% the bit-exact 0025 re-graph already banked**, at vastly higher + effort and with a precision change. And the act-quant tax is independently closeable bit-exactly by + lever 4 (fuse it into the preceding op) without touching the GEMM. +- **The scaffold does not help.** `paged/kernel/w4a16/marlin-w4a16.cu` is dense-only, Q4_0/Q4_K, with no + grouped/MUL_MAT_ID path and no NVFP4 dequant. A real MoE Marlin is effectively a from-scratch port of + `moe_wna16_marlin_gemm` (per-expert M-tiles, block-padded `moe_align` token-sort, stream-K over ragged + segments, NVFP4->bf16 in-kernel dequant). vLLM only reaches the BW floor via cutlass-SM120 TMA + + warp-specialized pipelining; the GB10 occupancy-only route the dense scaffold tried **plateaued at + ~9 TFLOPS / 178 t/s (~5x under MMQ)** and STOPPED at the occupancy wall (XOR-swizzle + deep cp.async + collapse GB10 occupancy). Realistic outcome of an MoE port: **a net REGRESSION** on the 27% GEMM + bucket. Multi-week, high-risk, DGX-only, no `ncu`, for a +2% ceiling. **Do not build it.** + +**Why vLLM runs W4A16 at all:** not because it is better - because sm_121 (consumer Blackwell / GB10) +has no working cutlass FP4 MoE cubins (vLLM whitelists only sm_100/103 datacenter Blackwell for native +FP4 MoE; the engine literally warns it is falling back to "Weight-only FP4 ... Marlin kernel"). On GB10, +W4A16 is HALF the FP4-MMA rate. **llama's native W4A4 FP4-MMA is the higher hardware tier; matching vLLM +does NOT mean copying its W4A16 fallback.** + +**Precision / gate (the brief's key nuance, assessed honestly):** the observation that W4A16 (bf16 acts) +is a strict activation-precision UPGRADE over W4A4 (FP4 acts), with better KL-to-f32, is **correct but +unmonetizable here.** (a) The current W4A4 MoE default is **already bit-exact to the f32 reference** +(test-backend-ops MUL_MAT_ID 806/806, greedy md5 stable on both models) - you get no quality credit for +being more precise than a default that already passes, and the precision-sensitive site is the +gated-DeltaNet SSM *state* (a different op, addressed by the separate 0026 bf16-SSM opt-in), not the MoE +GEMM. (b) W4A16 is **non-bit-exact vs the W4A4 default, so adopting it re-baselines every shipped md5 +gate** - a real cost for a +2% throughput ceiling that is itself likely negative. So the precision angle +does not flip the verdict: it would be a precision upgrade nobody needs, bought with a slower, +occupancy-hostile, gate-rebaselining kernel. The one genuinely precision-positive AND throughput-positive +move that quantizes weights is **lever 3 (NVFP4 projections)** - and that is W4A16 on the DENSE linears +(where it cuts weight-read BW), not on the experts. + +## (4) HONEST VERDICT + recommended build + +**VERDICT: We can essentially match vLLM on MoE decode (~94% bit-exact, ~96-97% with the projection +quant, parity-or-better physically in reach), but NOT by doing "what vLLM does" in the sense the question +implies. A Marlin/W4A16 grouped MoE GEMM is the wrong lever - the MoE GEMM is already a llama win and a +W4A16 port would regress it. The 15% is bf16 dense-projection bandwidth + recurrence-gather plumbing + +graph/overlap overhead + a 2% act-quant tax + router glue. Every piece is closeable on llama's existing +native-FP4 expert path, mostly bit-exactly.** + +**Recommended build (ship order, none of it a new MoE GEMM):** +1. **`k_get_rows` SSM-state-gather fusion** - bit-exact, ~+5 ms, biggest single-kernel win, no rebaseline. **Do first.** +2. **Extend CUDA-graph coverage + stream overlap** beyond 0025 - bit-exact, ~+7 ms combined, partly banked. +3. **Fuse the W4A4 act-quant into the preceding RMSNorm/SiLU** - bit-exact, +3.3 ms, erases the act-quant tax (the only thing W4A16 would have bought) without W4A16. +4. **NVFP4-quantize the bf16 GDN/attn projections + lm_head** - +6.5 ms (biggest bucket), bit-changing + (re-gate md5; precision-UPGRADE, the same NVFP4-dense-quant move vLLM makes). Ship as default after + re-gating, or as an opt-in if the md5 rebaseline is undesirable. + +**Do NOT build:** the W4A16/Marlin grouped MoE GEMM (`paged/kernel/w4a16/` scaffold is dense-only and not +reusable). Neither default nor opt-in: +2% ceiling < the already-banked bit-exact +1.9%, likely a net +regression on the 27% GEMM bucket, multi-week high-risk, and it rebaselines every gate. The dense +`w4a16-marlin` STOP transfers to MoE, and MORE strongly (the tiny-M decode shape is purely BW-bound, so +the FP4-vs-bf16 tier is a wash that the weight-read floor erases - leaving only the half-rate downside). + Assisted-by: Claude:opus-4.8 [Claude Code]