docs(paged): scope ragged MoE dispatch phase

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 00:26:01 +00:00
parent b6885aa446
commit ef14748f06
3 changed files with 390 additions and 0 deletions

View File

@@ -690,3 +690,27 @@ Result:
but it does not improve the bounded serving workload. Keep patch `0052` as a
useful regression gate; do not retry this exact fan-in-only fusion unless a
fresh profile shows the weighted/add fan-in as a material bucket.
## Phase 8 Ragged MoE Dispatch Scope
Plan: `docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md`.
The next candidate is profile-gated before source work:
- Target a fused routed-expert `MUL_MAT_ID` dispatch path for ragged serving
decode, not another post-down fan-in fusion.
- First decompose live llama.cpp and vLLM MoE serving at `n=128`, `ptok=128`,
`gen=64` with Nsight and `/home/mudler/bench/bucket.py`.
- Promote only if `mm_ids_helper`, activation quant/gather, grouped MMQ, or
related MoE dispatch rows are material and not hidden by GDN or FA.
- Keep the backend-sampling/logit-bias upload cache as a non-default follow-up;
it requires `--backend-sampling` and request `backend_sampling: true` with
non-empty `logit_bias` or `ignore_eos`.
Required promotion gates remain:
- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`.
- Dense md5 `5951a5b4d624ce891e22ab5fca9bc439`.
- `MUL_MAT_ID`: `806/806` on CUDA0.
- Any fused dispatch prototype must start default-off behind
`LLAMA_MOE_FUSED_DISPATCH=1`.

View File

@@ -66,6 +66,12 @@ was rejected. The remaining plausible lever is a larger fused-MoE
prologue/epilogue that also removes gather/scatter or moves work into the GEMM
kernel, not another standalone fan-in fusion.
Phase 8 scopes that remaining lever as profile-gated ragged serving dispatch:
first measure llama.cpp and vLLM at `n=128`, `ptok=128`, `gen=64` and bucket
`mm_ids_helper`, activation quant/gather, grouped MMQ, and scatter/writeback. Do
not implement a fused routed-expert `MUL_MAT_ID` dispatch path unless those rows
are material in live serving and not dominated by GDN or FA.
### Newly-identified lever 2 - the W4A4 activation-quant pass (a vLLM-asymmetry, not just a kernel-speed gap)
Every NVFP4 GEMM (MMQ today, and the new 0034 FP4-MMA) **quantizes activations to e4m3 (amax/6 + code search) before the matmul** - a distinct, M-proportional kernel. vLLM on **sm_121 falls back to W4A16-Marlin** (the TENSORCORE_GDN_SCOPE confirms this: no tcgen05/cutlass-FP4 on GB10), i.e. **f16 activations, zero activation-quant**. So this pass (~3-6% of prefill) is a structural cost vLLM avoids, and it explains part of why even a peak FP4-MMA GEMM will not fully reach vLLM's prefill. The README's "act-quant FLAT" and "W4A16 rejected" verdicts are **decode/BW-bound findings**; in compute-bound prefill the trade is different and unaudited. **Lever: measure this quant bucket as its own nsys row; consider fusing the activation-quant into the GEMM prologue (cp.async + in-register quant) so it is not a separate global-memory pass.**