diff --git a/backend/cpp/llama-cpp/patches/paged/MOE_GAP_PROGRESS.md b/backend/cpp/llama-cpp/patches/paged/MOE_GAP_PROGRESS.md new file mode 100644 index 000000000..81ba9142a --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/MOE_GAP_PROGRESS.md @@ -0,0 +1,21 @@ +# MOE_GAP_PROGRESS.md - moe-gap-groundtruth GPU agent checkpoint + +Status: **DONE.** Both-engine MoE decode decomposition complete. Findings in `MOE_GAP_VS_VLLM.md`. + +## Runs (DGX GB10 sm_121, GPU free, foreground) +- llama: `build-cuda` 2f4f5ab (0025), `llama-batched-bench -npp128 -ntg128 -npl128 -c32768 -fa on`, + `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`. S_TG=752.3 t/s, step 169.8 ms, busy 97.5%. + Artifacts on DGX: `~/llama-paged-dev/moe_gap_llama.{nsys-rep,trace.csv}`. +- vLLM 0.23.0 graphs-ON (FULL_AND_PIECEWISE, the 882-ref config): `~/bench/moe_gap_vllm.py` under + `nsys --capture-range=cudaProfilerApi`. step 142.0 ms, busy 99.7%. + Artifacts on DGX: `~/bench/moe_gap_vllm.{nsys-rep,trace.csv}`, script `~/bench/moe_gap_vllm.py`. +- Extractor: `~/bench/decode_decomp2.py` (dual-engine, steps = GDN-kernel-count / 30; cross-checked vs + flash/reshape_cache = 10x and vs throughput). Grouped-MoE GEMM isolated by per-call duration (LONG/SHORT). + +## Result (1 line) +Gap = 27.8 ms/step (llama 83.6% of vLLM). **MoE grouped GEMM is a llama WIN** (native FP4-MMA W4A4 47.3 ms +vs Marlin W4A16 50.0 ms). The 15% is bf16-projections+convert (+6.5), recurrence state-gather plumbing +(+6.6, led by k_get_rows 5.2 ms), graph/overlap (+7.0), W4A4 act-quant tax (+3.3), router/glue (+5.4). +Marlin is NOT the lever; do not build a W4A16 MoE GEMM. + +Assisted-by: Claude:opus-4.8 [Claude Code] diff --git a/backend/cpp/llama-cpp/patches/paged/MOE_GAP_VS_VLLM.md b/backend/cpp/llama-cpp/patches/paged/MOE_GAP_VS_VLLM.md new file mode 100644 index 000000000..f9a3bf1bc --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/MOE_GAP_VS_VLLM.md @@ -0,0 +1,120 @@ +# MOE_GAP_VS_VLLM.md - ground-truth both-engine MoE decode decomposition (where vLLM's ~15% lives) + +THE GPU AGENT (label `moe-gap-groundtruth`), DGX GB10 (sm_121). First **side-by-side, both-engine, +per-kernel ms/step** decomposition of the MoE decode gap. All prior B work decomposed llama ONLY; this +profiles vLLM's decode step too and computes the per-bucket `llama - vLLM` delta to pinpoint the gap. + +Model `q36-35b-a3b-nvfp4` (40 layers: 30 GDN linear-attn + 10 full-attn, 256 experts top-8, vocab 248320). +Both engines profiled at **batch 128 decode** with `nsys --cuda-graph-trace=node`, steady-decode window, +per-step normalized by GDN-kernel-count / 30 (cross-checked vs flash/reshape_cache counts and throughput). + +- **llama**: `build-cuda` tip `2f4f5ab` (patch 0025), `llama-batched-bench -npp 128 -ntg 128 -npl 128 + -c 32768 -fa on`, `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1` (the re-graph ON = the 752 t/s ship point). + Measured **S_TG = 752.3 t/s** => **step = 169.8 ms**, GPU-busy 97.5% (idle 2.5% = 4.2 ms/step). +- **vLLM 0.23.0**: `q36-35b-a3b-nvfp4-vllm`, **CUDA graphs ON** (`cudagraph_mode=FULL_AND_PIECEWISE`, + the 882-reference config, NOT enforce_eager), MARLIN NvFp4 MoE, 128 seqs x 128-tok prompt x 128 gen. + Measured **step = 142.0 ms** (= 901 t/s-equiv), GPU-busy 99.7% (idle 0.3% = 0.4 ms/step). +- Gap reproduced: **169.8 - 142.0 = 27.8 ms/step** (llama 83.6% of vLLM here; matches the ~85% server number). + +## THE HEADLINE: the MoE grouped GEMM is NOT vLLM's advantage - it is a llama WIN + +Grouped MoE-expert GEMM, isolated by per-call duration (LONG calls = the per-expert grouped GEMM): + +| grouped MoE-expert GEMM | ms/step | what | +|-------------------------|--------:|------| +| **llama** `mul_mat_q` (+stream-k fixup + gather) | **48.3** | native Blackwell FP4-MMA **W4A4** | +| **vLLM** `marlin_moe_wna16::Marlin` | **50.0** | **W4A16** (FP4 weights dequant-in-kernel -> bf16 MMA) | + +**llama's native-FP4 grouped GEMM is ~1.7 ms/step FASTER than vLLM's Marlin W4A16 at the ragged +tiny-M (~4 rows/expert) decode shape** (pure GEMM core 47.3 vs 50.0). Both read the same ~4-bit weight +bytes and are bandwidth-bound, so they tie to within a few %, and llama's 2x-rate FP4-MMA edges it. +**=> Marlin is NOT faster here; a Marlin-style W4A16 MoE GEMM in llama would make the MoE GEMM SLOWER.** +This directly answers the brief's load-bearing question #1/#2 and extends the prior `w4a16-marlin` DENSE +conclusion ("the win was NVFP4-dense-quant, not the Marlin kernel") to MoE: **the MoE GEMM kernel is not +the lever; llama already beats Marlin there.** + +## Side-by-side per-step decomposition (ms/step, kernel-time attribution) + +| bucket | llama ms | vLLM ms | Δ llama-vLLM | note | +|--------|---------:|--------:|-------------:|------| +| **Recurrence / SSM** | **79.3** | **72.7** | **+6.6** | core kernel is a llama WIN (70.0 vs 71.1); the gap is llama's state-gather/conv plumbing | +| **MoE-expert grouped GEMM** | 48.3 | 50.0 | **-1.7** | **llama FASTER** (native FP4-MMA W4A4 vs Marlin W4A16) | +| **Dense projections (+glue)** | **20.3** | **13.8** | **+6.5** | llama runs GDN/attn projections in BF16 cublas; vLLM runs them as compact NVFP4-Marlin; +2.9 ms is llama's bf16<->f32 `convert_unary` glue vLLM never pays | +| **Norms / glue / memcpy** | 9.6 | 6.0 | +3.6 | llama `k_bin_bcast` (expert-combine+residual) 4.3 + memcpy 2.4 heavier | +| **Act-quant (W4A4 tax)** | 3.3 | 0.0 | **+3.3** | `quantize_mmq_nvfp4`; vLLM W4A16 keeps acts bf16 => structurally ZERO | +| **Router / align** | 2.4 | 0.5 | +1.9 | llama computes router via a full FP4 GEMM (1.6) + argsort/scatter; vLLM fuses topk/align | +| **Attention (full-attn)** | 2.8 | 2.6 | +0.2 | parity | +| kernel-time subtotal | 166.1 | 145.7 | +20.4 | | +| **GPU idle (host bubble)** | 4.2 | 0.4 | **+3.8** | graph coverage: llama partially-graphed (0025) vs vLLM FULL_AND_PIECEWISE | +| cross-stream overlap (unionf32 PROJ-GLUE <-- two 1.2 reduce_kernel GLUE + 2.8 flash_attn_tile ATTN (5.2+ 1.0 cutlass::device (fp8 lin) PROJ + 2.4 MEMCPY-Device (SSM state) GLUE 2.9 = 0.8 nvjet 32x64 PROJ + 1.6 mul_mat_q router (M=128) ROUTER 8 ms 0.4 act_and_mul (SwiGLU) GLUE + 1.5 rms_norm_f32 GLUE pure 0.2 topkGating / moe_align ROUTE + ... llama 0.1 reshape_and_cache_flash ATTN + tax) +``` + +## WHERE THE 27.8 ms ACTUALLY IS (ranked) - and it is NOT the Marlin GEMM + +1. **Dense projections + bf16<->f32 glue: +6.5 ms.** llama keeps the GDN/attn linear projections (and + the lm_head) in **BF16** (cublas `nvjet`/`cutlass`, full-precision weight reads) and pays a 2.9 ms + `convert_unary` bf16<->f32 tax around them; vLLM runs the same projections as **compact NVFP4-Marlin + W4A16** (4-bit weight read, ~4x less BW) and stays bf16 end-to-end (no convert). This is the + **`NVFP4-dense-quant` lever the prior `w4a16-marlin` project already identified - applied to the + still-bf16 projections**, not the MoE GEMM. +2. **Recurrence state-gather/conv plumbing: +6.6 ms.** The recurrence CORE kernel is a **llama win** + (gated_delta_net 70.0 vs vLLM fused_recurrent 71.1, confirming "past vLLM on BW efficiency"). The gap + is entirely the surrounding plumbing: **`k_get_rows_float` 5.2 ms (the recurrent-state gather)** + + `ssm_conv_update` 3.4 vs vLLM's single `causal_conv1d_update` 1.6. vLLM has **no gather** - its + recurrent state is updated in-place inside the fused decode kernel. `k_get_rows` is the single biggest + llama-specific kernel vLLM has no equivalent of. +3. **Graph coverage + stream overlap: ~+7.0 ms combined** (idle +3.8, cross-stream overlap ~+3.2). vLLM + FULL_AND_PIECEWISE is 99.7% busy with more concurrent kernels; llama (partially graphed post-0025) is + 97.5% busy with thinner overlap. +4. **W4A4 act-quant tax: +3.3 ms.** `quantize_mmq_nvfp4`; vLLM's W4A16 choice makes this structurally 0. + Fusing the quant into the preceding op (as vLLM fuses act_quant into RMSNorm/SiLU) would erase it. +5. **Router GEMM + norms/glue: +5.4 ms.** llama computes router logits via a full FP4 GEMM (1.6) and has + heavier `k_bin_bcast` combine/residual + memcpy; vLLM fuses routing into tiny topk/align kernels. + +## THE SINGLE BIGGEST vLLM-MoE ADVANTAGE + +**Not the Marlin GEMM.** It is a near-tie between two ~6.5 ms buckets, both bf16-precision-related: +- **Dense projections (+6.5 ms)** - vLLM runs the GDN/attn projections + lm_head as NVFP4-Marlin while + llama runs them BF16 + a 2.9 ms convert tax. Single biggest *bucket* delta. +- **Recurrent-state gather (+5.2 ms, kernel `k_get_rows_float`)** - the single biggest *kernel* vLLM + avoids entirely (in-place fused state vs llama's separate gather). Plus +1.8 ms more REC plumbing. + +The MoE grouped GEMM (the brief's hypothesis) is a **-1.7 ms llama win**, so it is explicitly ruled out. + +## ANSWERS TO THE BRIEF + +1. **WHERE is vLLM's 15%?** Spread across bf16-projection BW (+6.5) + recurrence state-gather plumbing + (+6.6) + graph/overlap (+7.0) + act-quant tax (+3.3) + router/glue (+5.4). **NOT the MoE GEMM.** +2. **Is Marlin faster at tiny-M decode?** **No.** llama native FP4-MMA W4A4 = 47.3 ms vs Marlin W4A16 = + 50.0 ms. Marlin is ~5% slower here; both are at the LPDDR5x BW floor. +3. **Should llama implement a Marlin-style W4A16 MoE GEMM?** **No** - it would slow the MoE GEMM and is + not where the gap lives. The `w4a16-marlin` DENSE verdict ("NVFP4-dense-quant, not the Marlin kernel") + carries to MoE. The real, ordered levers are: **(a) NVFP4-quantize the still-bf16 GDN/attn projections + + lm_head** (close ~+6.5, the largest, bit-changing but the same class of move vLLM makes); **(b) fuse + away the recurrent-state gather `k_get_rows`** (~+5, bit-exact, the biggest single-kernel win); + **(c) fuller CUDA-graph coverage + stream overlap** (~+7, bit-exact); **(d) fuse the W4A4 act-quant + into the preceding op** (+3.3, bit-exact). None of these is a new MoE GEMM. + +Assisted-by: Claude:opus-4.8 [Claude Code]