docs(paged): record ragged MoE profile gate

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 00:35:21 +00:00
parent ef14748f06
commit 89ef3a4020
2 changed files with 76 additions and 4 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -714,3 +714,35 @@ Required promotion gates remain:
 - `MUL_MAT_ID`: `806/806` on CUDA0.
 - Any fused dispatch prototype must start default-off behind
  `LLAMA_MOE_FUSED_DISPATCH=1`.
+
+Profile-gate result:
+
+- Clean llama.cpp artifact:
+  `/home/mudler/bench/phase8_ragged_moe_dispatch/llama_n128_clean/`.
+- vLLM artifact:
+  `/home/mudler/bench/phase8_ragged_moe_dispatch/vllm_n128/`.
+- A stale first llama profile under `llama_n128/` is intentionally ignored
+  because the binary still contained the rejected weighted-combine kernel before
+  the clean-source rebuild.
+
+Throughput:
+
+| Engine | decode tok/s/seq | decode agg tok/s | prefill tok/s |
+|--------|------------------|------------------|---------------|
+| llama.cpp | 2.70 | 412.1 | 1368.3 |
+| vLLM | 7.02 | 1036.6 | 5277.7 |
+
+llama.cpp bucket highlights from the clean profile:
+
+- GDN: `4680.27 ms`, `38.12%`.
+- `mmq_nvfp4`: `2745.11 ms`, `22.36%`.
+- `act_quant`: `441.42 ms`, `3.60%`.
+- MoE dispatch: `183.67 ms`, `1.50%`.
+- `ew_add` fan-in: `280.15 ms`, `2.28%`.
+
+Decision:
+
+- Promote to a test-only ragged `MUL_MAT_ID` gate before production source.
+- Do not implement fused dispatch yet. Standalone `mm_ids`/`gather_mmq` helper
+  time is small; a source patch must reduce the larger grouped-MMQ/activation
+  movement bucket and still beat the `+5%` serving A/B gate.
--- a/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md
+++ b/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md
@@ -85,7 +85,7 @@ Selected Phase 8 candidate:

  Write this plan and commit it before source work.

- [ ] **Step 2: Reconfirm DGX idle state**
+- [x] **Step 2: Reconfirm DGX idle state**

  Run:

@@ -106,7 +106,7 @@ Selected Phase 8 candidate:
  FREE...
  ```

- [ ] **Step 3: Run serving nsys for llama.cpp MoE**
+- [x] **Step 3: Run serving nsys for llama.cpp MoE**

  Run on DGX:

@@ -151,7 +151,24 @@ Selected Phase 8 candidate:
  - `buckets.txt` has fine rows for `mm_ids`, `gather_mmq`, `act_quant`,
    `mmq_nvfp4`, `set_rows`, `ew_add`, `gdn_core`, and `fa`.

- [ ] **Step 4: Run serving nsys for vLLM MoE**
+  Result:
+
+  - Artifact: `/home/mudler/bench/phase8_ragged_moe_dispatch/llama_n128_clean/`.
+  - Throughput: `decode_agg_tps=412.1`, `decode_perseq_tps=2.70`,
+    `prefill_tps=1368.3`.
+  - Clean rebuild was required before this run; the first `llama_n128/` profile
+    still contained the rejected weighted-combine kernel in the binary and is
+    not used for the decision.
+  - Bucket highlights:
+    - GDN: `4680.27 ms`, `38.12%`.
+    - `mmq_nvfp4`: `2745.11 ms`, `22.36%`.
+    - `act_quant`: `441.42 ms`, `3.60%`.
+    - MoE dispatch: `183.67 ms`, `1.50%`.
+    - `mm_ids`: `80.92 ms`, `0.66%`.
+    - `gather_mmq`: `50.96 ms`, `0.42%`.
+    - `ew_add`: `280.15 ms`, `2.28%`.
+
+- [x] **Step 4: Run serving nsys for vLLM MoE**

  Run on DGX:

@@ -196,7 +213,17 @@ Selected Phase 8 candidate:
  - `buckets.txt` has vLLM rows for `vllm_dispatch`, `vllm_fp4_gemm`,
    `vllm_fa`, and `fla_gdn`.

- [ ] **Step 5: Decide promotion**
+  Result:
+
+  - Artifact: `/home/mudler/bench/phase8_ragged_moe_dispatch/vllm_n128/`.
+  - Throughput: `decode_agg_tps=1036.6`, `decode_perseq_tps=7.02`,
+    `prefill_tps=5277.7`.
+  - Nsight includes startup/autotune and `delayStreamKernel`, so the aggregate
+    vLLM macro percentages are not directly comparable to llama.cpp. Direct
+    kernel extraction still shows Marlin-MoE rows around `1.0 s` and
+    `moe_align/topk/count` rows around `38.5 ms` in the full capture.
+
+- [x] **Step 5: Decide promotion**

  Promote to source only if all are true:

@@ -213,6 +240,19 @@ Selected Phase 8 candidate:
  - FA prefill dominates the profiled window.
  - MoE dispatch is too small to beat a `+5%` serving A/B gate.

+  Decision:
+
+  - Promote to Task 2 test-gate work, not production source work yet.
+  - Rationale: standalone `mm_ids` and `gather_mmq` are small, but the live
+    ragged path around `mmq_nvfp4 + act_quant + MoE-dispatch + fan-in` is
+    roughly `29.7%` of llama.cpp kernel time. vLLM throughput is still much
+    higher on the same client shape. A production patch is only justified after
+    a ragged `MUL_MAT_ID` test gate exists and after the source prototype can
+    reduce the grouped-MMQ/activation movement bucket, not merely the helper
+    kernels.
+  - GDN remains the single largest bucket, so any Phase 8 source patch still
+    must clear the `+5%` serving A/B gate before being kept.
+
 - [ ] **Step 6: Commit the profile decision**

  If promoted: