docs(paged): record ragged MoE profile gate

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 00:35:21 +00:00
parent ef14748f06
commit 89ef3a4020
2 changed files with 76 additions and 4 deletions

View File

@@ -714,3 +714,35 @@ Required promotion gates remain:
- `MUL_MAT_ID`: `806/806` on CUDA0.
- Any fused dispatch prototype must start default-off behind
`LLAMA_MOE_FUSED_DISPATCH=1`.
Profile-gate result:
- Clean llama.cpp artifact:
`/home/mudler/bench/phase8_ragged_moe_dispatch/llama_n128_clean/`.
- vLLM artifact:
`/home/mudler/bench/phase8_ragged_moe_dispatch/vllm_n128/`.
- A stale first llama profile under `llama_n128/` is intentionally ignored
because the binary still contained the rejected weighted-combine kernel before
the clean-source rebuild.
Throughput:
| Engine | decode tok/s/seq | decode agg tok/s | prefill tok/s |
|--------|------------------|------------------|---------------|
| llama.cpp | 2.70 | 412.1 | 1368.3 |
| vLLM | 7.02 | 1036.6 | 5277.7 |
llama.cpp bucket highlights from the clean profile:
- GDN: `4680.27 ms`, `38.12%`.
- `mmq_nvfp4`: `2745.11 ms`, `22.36%`.
- `act_quant`: `441.42 ms`, `3.60%`.
- MoE dispatch: `183.67 ms`, `1.50%`.
- `ew_add` fan-in: `280.15 ms`, `2.28%`.
Decision:
- Promote to a test-only ragged `MUL_MAT_ID` gate before production source.
- Do not implement fused dispatch yet. Standalone `mm_ids`/`gather_mmq` helper
time is small; a source patch must reduce the larger grouped-MMQ/activation
movement bucket and still beat the `+5%` serving A/B gate.