docs(paged): record ragged MoE profile gate

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 00:35:21 +00:00
parent ef14748f06
commit 89ef3a4020
2 changed files with 76 additions and 4 deletions

View File

@@ -714,3 +714,35 @@ Required promotion gates remain:
- `MUL_MAT_ID`: `806/806` on CUDA0.
- Any fused dispatch prototype must start default-off behind
`LLAMA_MOE_FUSED_DISPATCH=1`.
Profile-gate result:
- Clean llama.cpp artifact:
`/home/mudler/bench/phase8_ragged_moe_dispatch/llama_n128_clean/`.
- vLLM artifact:
`/home/mudler/bench/phase8_ragged_moe_dispatch/vllm_n128/`.
- A stale first llama profile under `llama_n128/` is intentionally ignored
because the binary still contained the rejected weighted-combine kernel before
the clean-source rebuild.
Throughput:
| Engine | decode tok/s/seq | decode agg tok/s | prefill tok/s |
|--------|------------------|------------------|---------------|
| llama.cpp | 2.70 | 412.1 | 1368.3 |
| vLLM | 7.02 | 1036.6 | 5277.7 |
llama.cpp bucket highlights from the clean profile:
- GDN: `4680.27 ms`, `38.12%`.
- `mmq_nvfp4`: `2745.11 ms`, `22.36%`.
- `act_quant`: `441.42 ms`, `3.60%`.
- MoE dispatch: `183.67 ms`, `1.50%`.
- `ew_add` fan-in: `280.15 ms`, `2.28%`.
Decision:
- Promote to a test-only ragged `MUL_MAT_ID` gate before production source.
- Do not implement fused dispatch yet. Standalone `mm_ids`/`gather_mmq` helper
time is small; a source patch must reduce the larger grouped-MMQ/activation
movement bucket and still beat the `+5%` serving A/B gate.

View File

@@ -85,7 +85,7 @@ Selected Phase 8 candidate:
Write this plan and commit it before source work.
- [ ] **Step 2: Reconfirm DGX idle state**
- [x] **Step 2: Reconfirm DGX idle state**
Run:
@@ -106,7 +106,7 @@ Selected Phase 8 candidate:
FREE...
```
- [ ] **Step 3: Run serving nsys for llama.cpp MoE**
- [x] **Step 3: Run serving nsys for llama.cpp MoE**
Run on DGX:
@@ -151,7 +151,24 @@ Selected Phase 8 candidate:
- `buckets.txt` has fine rows for `mm_ids`, `gather_mmq`, `act_quant`,
`mmq_nvfp4`, `set_rows`, `ew_add`, `gdn_core`, and `fa`.
- [ ] **Step 4: Run serving nsys for vLLM MoE**
Result:
- Artifact: `/home/mudler/bench/phase8_ragged_moe_dispatch/llama_n128_clean/`.
- Throughput: `decode_agg_tps=412.1`, `decode_perseq_tps=2.70`,
`prefill_tps=1368.3`.
- Clean rebuild was required before this run; the first `llama_n128/` profile
still contained the rejected weighted-combine kernel in the binary and is
not used for the decision.
- Bucket highlights:
- GDN: `4680.27 ms`, `38.12%`.
- `mmq_nvfp4`: `2745.11 ms`, `22.36%`.
- `act_quant`: `441.42 ms`, `3.60%`.
- MoE dispatch: `183.67 ms`, `1.50%`.
- `mm_ids`: `80.92 ms`, `0.66%`.
- `gather_mmq`: `50.96 ms`, `0.42%`.
- `ew_add`: `280.15 ms`, `2.28%`.
- [x] **Step 4: Run serving nsys for vLLM MoE**
Run on DGX:
@@ -196,7 +213,17 @@ Selected Phase 8 candidate:
- `buckets.txt` has vLLM rows for `vllm_dispatch`, `vllm_fp4_gemm`,
`vllm_fa`, and `fla_gdn`.
- [ ] **Step 5: Decide promotion**
Result:
- Artifact: `/home/mudler/bench/phase8_ragged_moe_dispatch/vllm_n128/`.
- Throughput: `decode_agg_tps=1036.6`, `decode_perseq_tps=7.02`,
`prefill_tps=5277.7`.
- Nsight includes startup/autotune and `delayStreamKernel`, so the aggregate
vLLM macro percentages are not directly comparable to llama.cpp. Direct
kernel extraction still shows Marlin-MoE rows around `1.0 s` and
`moe_align/topk/count` rows around `38.5 ms` in the full capture.
- [x] **Step 5: Decide promotion**
Promote to source only if all are true:
@@ -213,6 +240,19 @@ Selected Phase 8 candidate:
- FA prefill dominates the profiled window.
- MoE dispatch is too small to beat a `+5%` serving A/B gate.
Decision:
- Promote to Task 2 test-gate work, not production source work yet.
- Rationale: standalone `mm_ids` and `gather_mmq` are small, but the live
ragged path around `mmq_nvfp4 + act_quant + MoE-dispatch + fan-in` is
roughly `29.7%` of llama.cpp kernel time. vLLM throughput is still much
higher on the same client shape. A production patch is only justified after
a ragged `MUL_MAT_ID` test gate exists and after the source prototype can
reduce the grouped-MMQ/activation movement bucket, not merely the helper
kernels.
- GDN remains the single largest bucket, so any Phase 8 source patch still
must clear the `+5%` serving A/B gate before being kept.
- [ ] **Step 6: Commit the profile decision**
If promoted: