docs(paged): scope ragged MoE dispatch phase

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 00:26:01 +00:00
parent b6885aa446
commit ef14748f06
3 changed files with 390 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -690,3 +690,27 @@ Result:
  but it does not improve the bounded serving workload. Keep patch `0052` as a
  useful regression gate; do not retry this exact fan-in-only fusion unless a
  fresh profile shows the weighted/add fan-in as a material bucket.
+
+## Phase 8 Ragged MoE Dispatch Scope
+
+Plan: `docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md`.
+
+The next candidate is profile-gated before source work:
+
+- Target a fused routed-expert `MUL_MAT_ID` dispatch path for ragged serving
+  decode, not another post-down fan-in fusion.
+- First decompose live llama.cpp and vLLM MoE serving at `n=128`, `ptok=128`,
+  `gen=64` with Nsight and `/home/mudler/bench/bucket.py`.
+- Promote only if `mm_ids_helper`, activation quant/gather, grouped MMQ, or
+  related MoE dispatch rows are material and not hidden by GDN or FA.
+- Keep the backend-sampling/logit-bias upload cache as a non-default follow-up;
+  it requires `--backend-sampling` and request `backend_sampling: true` with
+  non-empty `logit_bias` or `ignore_eos`.
+
+Required promotion gates remain:
+
+- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`.
+- Dense md5 `5951a5b4d624ce891e22ab5fca9bc439`.
+- `MUL_MAT_ID`: `806/806` on CUDA0.
+- Any fused dispatch prototype must start default-off behind
+  `LLAMA_MOE_FUSED_DISPATCH=1`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -66,6 +66,12 @@ was rejected. The remaining plausible lever is a larger fused-MoE
 prologue/epilogue that also removes gather/scatter or moves work into the GEMM
 kernel, not another standalone fan-in fusion.

+Phase 8 scopes that remaining lever as profile-gated ragged serving dispatch:
+first measure llama.cpp and vLLM at `n=128`, `ptok=128`, `gen=64` and bucket
+`mm_ids_helper`, activation quant/gather, grouped MMQ, and scatter/writeback. Do
+not implement a fused routed-expert `MUL_MAT_ID` dispatch path unless those rows
+are material in live serving and not dominated by GDN or FA.
+
 ### Newly-identified lever 2 - the W4A4 activation-quant pass (a vLLM-asymmetry, not just a kernel-speed gap)
 Every NVFP4 GEMM (MMQ today, and the new 0034 FP4-MMA) **quantizes activations to e4m3 (amax/6 + code search) before the matmul** - a distinct, M-proportional kernel. vLLM on **sm_121 falls back to W4A16-Marlin** (the TENSORCORE_GDN_SCOPE confirms this: no tcgen05/cutlass-FP4 on GB10), i.e. **f16 activations, zero activation-quant**. So this pass (~3-6% of prefill) is a structural cost vLLM avoids, and it explains part of why even a peak FP4-MMA GEMM will not fully reach vLLM's prefill. The README's "act-quant FLAT" and "W4A16 rejected" verdicts are **decode/BW-bound findings**; in compute-bound prefill the trade is different and unaudited. **Lever: measure this quant bucket as its own nsys row; consider fusing the activation-quant into the GEMM prologue (cp.async + in-register quant) so it is not a separate global-memory pass.**