docs(paged): MoE-vs-vLLM DECIDE synthesis - reject W4A16 Marlin, the GEMM is a llama win

Cross-agent synthesis on top of the both-engine nsys decomposition (3b5957157): settle the user's "can we do what vLLM does on MoE?" question with the three converging investigations (groundtruth measurement + vllm-marlin source-read + marlin-port feasibility). Verdict: vLLM's ~15% MoE-decode lead is NOT the Marlin GEMM (that bucket is a -1.7 ms llama WIN: native FP4-MMA W4A4 47.3 vs Marlin W4A16 50.0 at the ragged tiny-M decode shape, both at the LPDDR5x BW floor). The gap is bf16 dense-projection bandwidth (+6.5), recurrence state-gather plumbing (+6.6, led by k_get_rows 5.2), graph/stream-overlap overhead (~+7), W4A4 act-quant tax (+3.3), and router/glue (+5.4). A W4A16/Marlin grouped MoE GEMM is REJECTED (default and opt-in): it would regress the 27% GEMM bucket to half-rate bf16 MMA, re-enter the GB10 occupancy wall the dense scaffold already STOPPED at, and its entire intrinsic upside is the ~2% act-quant tax - smaller than the bit-exact +1.9% the 0025 re-graph already banked, and closeable bit-exactly by fusing the act-quant. Recommended build (none a new MoE GEMM): (1) fuse the k_get_rows SSM-state gather (bit-exact, ~+5, biggest single-kernel win); (2) extend CUDA-graph coverage + stream overlap (bit-exact, ~+7); (3) fuse the W4A4 act-quant into RMSNorm/SiLU (bit-exact, +3.3); (4) NVFP4-quantize the still-bf16 GDN/attn projections + lm_head (bit-changing, +6.5, the same NVFP4-dense-quant move vLLM makes). Bit-exact levers alone reach ~94% of vLLM; with the projection quant ~96-97%, parity-or-better physically in reach since both heaviest kernels (SSM core, MoE GEMM) are already llama wins. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 09:57:14 -04:00 · 2026-06-26 20:14:30 +00:00
parent 3b59571579
commit 6c6a925213
1 changed files with 115 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/MOE_GAP_VS_VLLM.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_GAP_VS_VLLM.md
@@ -117,4 +117,119 @@ The MoE grouped GEMM (the brief's hypothesis) is a **-1.7 ms llama win**, so it
   **(c) fuller CUDA-graph coverage + stream overlap** (~+7, bit-exact); **(d) fuse the W4A4 act-quant
   into the preceding op** (+3.3, bit-exact). None of these is a new MoE GEMM.

+---
+
+# FINAL DECISION (cross-agent synthesis) - "can we do what vLLM does on MoE?"
+
+Three agents converged on the same verdict from independent angles: `moe-gap-groundtruth`
+(the measured both-engine nsys decomposition above), `vllm-marlin-study` (source-read of vLLM's
+`moe_wna16_marlin_gemm` / `moe_align_block_size` / `prepare_nvfp4_moe_layer_for_marlin` on the DGX),
+and `marlin-port-feasibility` (read-only assessment of the dense W4A16 scaffold + prior STOP). All
+three agree, and the measurement is the arbiter. Below is the decision the user asked for.
+
+## (1) WHERE the 15% lives - decisive
+
+The gap is **27.8 ms/step (llama at 83.6% of vLLM)** and it is **NOT one kernel - it is a sum of small
+deltas, and the MoE grouped GEMM is on llama's side of the ledger.** Ranked:
+
+| rank | lever | Δ ms/step | bit-exact? | this is... |
+|-----:|-------|----------:|:----------:|------------|
+| 1 | Graph coverage + cross-stream overlap | ~+7.0 | **yes** | scheduler/runtime (idle +3.8, overlap +3.2) |
+| 2 | Recurrence state-gather/conv plumbing (`k_get_rows_float` 5.2 + conv) | +6.6 | **yes** | llama-only kernels; vLLM updates state in-place |
+| 3 | Dense GDN/attn projections + lm_head (bf16 vs NVFP4) + convert glue | +6.5 | **no** | the NVFP4-dense-quant lever, on the projections |
+| 4 | Router GEMM + norms/combine/memcpy glue | +5.4 | mostly yes | llama router = full FP4 GEMM; vLLM fuses topk/align |
+| 5 | W4A4 act-quant tax (`quantize_mmq_nvfp4`) | +3.3 | **yes** | vLLM's W4A16 makes this structurally 0 |
+| - | **MoE-expert grouped GEMM** | **-1.7** | - | **llama WIN** - native FP4-MMA W4A4 47.3 vs Marlin W4A16 50.0 |
+
+**The Marlin GEMM is explicitly ruled out as the source of the gap.** Both engines read the same ~22 GB
+of ~4-bit expert weights once per step and are LPDDR5x-bandwidth-bound; on that weight stream they tie,
+and llama's 2x-rate FP4-MMA edges Marlin's half-rate bf16 MMA. It is **not the projections-vs-Marlin
+distinction in the experts, it is the projections in the DENSE path, the recurrence plumbing, and the
+runtime/graph** that cost llama the 15%. Not distributed, not the expert GEMM, not routing alone.
+
+## (2) Can llama MATCH it - and HOW
+
+**Yes - to within a few percent, and NOT with a Marlin/W4A16 MoE GEMM.** The two biggest *compute*
+kernels (the gated-DeltaNet SSM core 70.0 vs 71.1, and the MoE grouped GEMM 47.3 vs 50.0) are **already
+llama wins.** The gap is overhead/scheduling/precision-of-the-other-tensors, all of which llama can
+attack on its existing W4A4 FP4-MMA expert path. The four levers, in recommended build order:
+
+| order | build | gain | bit-exact / gate | effort |
+|------:|-------|-----:|------------------|--------|
+| 1st | **Fuse away the recurrent-state gather `k_get_rows_float`** (update SSM state in-place in the GDN decode path, fold `ssm_conv_update`) | ~+5 ms (~3% of step) - biggest single-kernel win | **bit-exact** (no md5 rebaseline) | medium - CUDA, the GDN decode kernel |
+| 2nd | **Fuller CUDA-graph coverage + stream overlap** (extend the 0025 re-graph to the remaining MoE/projection nodes, overlap independent streams) | ~+7 ms combined; 0025 already banked ~+1.9% | **bit-exact** | medium - scheduler, partly done |
+| 3rd | **NVFP4-quantize the still-bf16 GDN/attn projections + lm_head** (the same move vLLM makes on its dense path; 4-bit weight read ~4x less BW, kills the 2.9 ms bf16<->f32 convert) | ~+6.5 ms - biggest *bucket* | **bit-changing** (re-baselines md5 gates; precision-UPGRADE, see below) | medium-high - new NVFP4 weight path for non-expert linears |
+| 4th | **Fuse the W4A4 act-quant into the preceding RMSNorm/SiLU** (as vLLM fuses act-quant) | +3.3 ms | **bit-exact** | low-medium |
+
+**Reach:** the three bit-exact levers (1+2+4 ~= +15.3 ms) alone close the gap to ~154.5 ms/step
+=> ~830 t/s = **~94% of vLLM, with zero precision change and zero md5 rebaseline.** Adding the
+NVFP4-projection lever (3, +6.5) reaches ~148 ms => ~865 t/s = **~96-97% of vLLM**, with the residual
+being router/glue and the irreducible cross-stream-overlap that is structural to how ggml schedules
+host-launched nodes vs vLLM's single fused graph. Because llama's two heaviest kernels are already
+ahead, **parity-or-better is physically reachable** once the plumbing/overhead is removed; vLLM has no
+arithmetic advantage on this hardware (its W4A16 is half-rate FP4 - it only wins on overhead and on the
+dense-path weight-read BW).
+
+## (3) The leading lever, in full - and the Marlin question, settled
+
+**The user's specific hypothesis - "do what vLLM does = a Marlin-style W4A16 grouped MoE GEMM" - is
+REJECTED, by measurement and by feasibility.**
+
+- **It is not where the gap is.** The MoE GEMM is a **-1.7 ms llama win.** A W4A16 Marlin MoE GEMM would
+  make that bucket SLOWER (half-rate bf16 MMA on the ~27% GEMM bucket), not faster.
+- **Its entire intrinsic upside is the ~2% act-quant tax** (W4A16 has no activation quantize). That
+  +2% ceiling is **smaller than the +1.9% the bit-exact 0025 re-graph already banked**, at vastly higher
+  effort and with a precision change. And the act-quant tax is independently closeable bit-exactly by
+  lever 4 (fuse it into the preceding op) without touching the GEMM.
+- **The scaffold does not help.** `paged/kernel/w4a16/marlin-w4a16.cu` is dense-only, Q4_0/Q4_K, with no
+  grouped/MUL_MAT_ID path and no NVFP4 dequant. A real MoE Marlin is effectively a from-scratch port of
+  `moe_wna16_marlin_gemm` (per-expert M-tiles, block-padded `moe_align` token-sort, stream-K over ragged
+  segments, NVFP4->bf16 in-kernel dequant). vLLM only reaches the BW floor via cutlass-SM120 TMA +
+  warp-specialized pipelining; the GB10 occupancy-only route the dense scaffold tried **plateaued at
+  ~9 TFLOPS / 178 t/s (~5x under MMQ)** and STOPPED at the occupancy wall (XOR-swizzle + deep cp.async
+  collapse GB10 occupancy). Realistic outcome of an MoE port: **a net REGRESSION** on the 27% GEMM
+  bucket. Multi-week, high-risk, DGX-only, no `ncu`, for a +2% ceiling. **Do not build it.**
+
+**Why vLLM runs W4A16 at all:** not because it is better - because sm_121 (consumer Blackwell / GB10)
+has no working cutlass FP4 MoE cubins (vLLM whitelists only sm_100/103 datacenter Blackwell for native
+FP4 MoE; the engine literally warns it is falling back to "Weight-only FP4 ... Marlin kernel"). On GB10,
+W4A16 is HALF the FP4-MMA rate. **llama's native W4A4 FP4-MMA is the higher hardware tier; matching vLLM
+does NOT mean copying its W4A16 fallback.**
+
+**Precision / gate (the brief's key nuance, assessed honestly):** the observation that W4A16 (bf16 acts)
+is a strict activation-precision UPGRADE over W4A4 (FP4 acts), with better KL-to-f32, is **correct but
+unmonetizable here.** (a) The current W4A4 MoE default is **already bit-exact to the f32 reference**
+(test-backend-ops MUL_MAT_ID 806/806, greedy md5 stable on both models) - you get no quality credit for
+being more precise than a default that already passes, and the precision-sensitive site is the
+gated-DeltaNet SSM *state* (a different op, addressed by the separate 0026 bf16-SSM opt-in), not the MoE
+GEMM. (b) W4A16 is **non-bit-exact vs the W4A4 default, so adopting it re-baselines every shipped md5
+gate** - a real cost for a +2% throughput ceiling that is itself likely negative. So the precision angle
+does not flip the verdict: it would be a precision upgrade nobody needs, bought with a slower,
+occupancy-hostile, gate-rebaselining kernel. The one genuinely precision-positive AND throughput-positive
+move that quantizes weights is **lever 3 (NVFP4 projections)** - and that is W4A16 on the DENSE linears
+(where it cuts weight-read BW), not on the experts.
+
+## (4) HONEST VERDICT + recommended build
+
+**VERDICT: We can essentially match vLLM on MoE decode (~94% bit-exact, ~96-97% with the projection
+quant, parity-or-better physically in reach), but NOT by doing "what vLLM does" in the sense the question
+implies. A Marlin/W4A16 grouped MoE GEMM is the wrong lever - the MoE GEMM is already a llama win and a
+W4A16 port would regress it. The 15% is bf16 dense-projection bandwidth + recurrence-gather plumbing +
+graph/overlap overhead + a 2% act-quant tax + router glue. Every piece is closeable on llama's existing
+native-FP4 expert path, mostly bit-exactly.**
+
+**Recommended build (ship order, none of it a new MoE GEMM):**
+1. **`k_get_rows` SSM-state-gather fusion** - bit-exact, ~+5 ms, biggest single-kernel win, no rebaseline. **Do first.**
+2. **Extend CUDA-graph coverage + stream overlap** beyond 0025 - bit-exact, ~+7 ms combined, partly banked.
+3. **Fuse the W4A4 act-quant into the preceding RMSNorm/SiLU** - bit-exact, +3.3 ms, erases the act-quant tax (the only thing W4A16 would have bought) without W4A16.
+4. **NVFP4-quantize the bf16 GDN/attn projections + lm_head** - +6.5 ms (biggest bucket), bit-changing
+   (re-gate md5; precision-UPGRADE, the same NVFP4-dense-quant move vLLM makes). Ship as default after
+   re-gating, or as an opt-in if the md5 rebaseline is undesirable.
+
+**Do NOT build:** the W4A16/Marlin grouped MoE GEMM (`paged/kernel/w4a16/` scaffold is dense-only and not
+reusable). Neither default nor opt-in: +2% ceiling < the already-banked bit-exact +1.9%, likely a net
+regression on the 27% GEMM bucket, multi-week high-risk, and it rebaselines every gate. The dense
+`w4a16-marlin` STOP transfers to MoE, and MORE strongly (the tiny-M decode shape is purely BW-bound, so
+the FP4-vs-bf16 tier is a wash that the weight-read floor erases - leaving only the half-rate downside).
+
 Assisted-by: Claude:opus-4.8 [Claude Code]