diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index e834c4e75..09c2a8c25 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -714,3 +714,35 @@ Required promotion gates remain: - `MUL_MAT_ID`: `806/806` on CUDA0. - Any fused dispatch prototype must start default-off behind `LLAMA_MOE_FUSED_DISPATCH=1`. + +Profile-gate result: + +- Clean llama.cpp artifact: + `/home/mudler/bench/phase8_ragged_moe_dispatch/llama_n128_clean/`. +- vLLM artifact: + `/home/mudler/bench/phase8_ragged_moe_dispatch/vllm_n128/`. +- A stale first llama profile under `llama_n128/` is intentionally ignored + because the binary still contained the rejected weighted-combine kernel before + the clean-source rebuild. + +Throughput: + +| Engine | decode tok/s/seq | decode agg tok/s | prefill tok/s | +|--------|------------------|------------------|---------------| +| llama.cpp | 2.70 | 412.1 | 1368.3 | +| vLLM | 7.02 | 1036.6 | 5277.7 | + +llama.cpp bucket highlights from the clean profile: + +- GDN: `4680.27 ms`, `38.12%`. +- `mmq_nvfp4`: `2745.11 ms`, `22.36%`. +- `act_quant`: `441.42 ms`, `3.60%`. +- MoE dispatch: `183.67 ms`, `1.50%`. +- `ew_add` fan-in: `280.15 ms`, `2.28%`. + +Decision: + +- Promote to a test-only ragged `MUL_MAT_ID` gate before production source. +- Do not implement fused dispatch yet. Standalone `mm_ids`/`gather_mmq` helper + time is small; a source patch must reduce the larger grouped-MMQ/activation + movement bucket and still beat the `+5%` serving A/B gate. diff --git a/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md b/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md index 579c6715b..aa8ef17ac 100644 --- a/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md +++ b/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md @@ -85,7 +85,7 @@ Selected Phase 8 candidate: Write this plan and commit it before source work. -- [ ] **Step 2: Reconfirm DGX idle state** +- [x] **Step 2: Reconfirm DGX idle state** Run: @@ -106,7 +106,7 @@ Selected Phase 8 candidate: FREE... ``` -- [ ] **Step 3: Run serving nsys for llama.cpp MoE** +- [x] **Step 3: Run serving nsys for llama.cpp MoE** Run on DGX: @@ -151,7 +151,24 @@ Selected Phase 8 candidate: - `buckets.txt` has fine rows for `mm_ids`, `gather_mmq`, `act_quant`, `mmq_nvfp4`, `set_rows`, `ew_add`, `gdn_core`, and `fa`. -- [ ] **Step 4: Run serving nsys for vLLM MoE** + Result: + + - Artifact: `/home/mudler/bench/phase8_ragged_moe_dispatch/llama_n128_clean/`. + - Throughput: `decode_agg_tps=412.1`, `decode_perseq_tps=2.70`, + `prefill_tps=1368.3`. + - Clean rebuild was required before this run; the first `llama_n128/` profile + still contained the rejected weighted-combine kernel in the binary and is + not used for the decision. + - Bucket highlights: + - GDN: `4680.27 ms`, `38.12%`. + - `mmq_nvfp4`: `2745.11 ms`, `22.36%`. + - `act_quant`: `441.42 ms`, `3.60%`. + - MoE dispatch: `183.67 ms`, `1.50%`. + - `mm_ids`: `80.92 ms`, `0.66%`. + - `gather_mmq`: `50.96 ms`, `0.42%`. + - `ew_add`: `280.15 ms`, `2.28%`. + +- [x] **Step 4: Run serving nsys for vLLM MoE** Run on DGX: @@ -196,7 +213,17 @@ Selected Phase 8 candidate: - `buckets.txt` has vLLM rows for `vllm_dispatch`, `vllm_fp4_gemm`, `vllm_fa`, and `fla_gdn`. -- [ ] **Step 5: Decide promotion** + Result: + + - Artifact: `/home/mudler/bench/phase8_ragged_moe_dispatch/vllm_n128/`. + - Throughput: `decode_agg_tps=1036.6`, `decode_perseq_tps=7.02`, + `prefill_tps=5277.7`. + - Nsight includes startup/autotune and `delayStreamKernel`, so the aggregate + vLLM macro percentages are not directly comparable to llama.cpp. Direct + kernel extraction still shows Marlin-MoE rows around `1.0 s` and + `moe_align/topk/count` rows around `38.5 ms` in the full capture. + +- [x] **Step 5: Decide promotion** Promote to source only if all are true: @@ -213,6 +240,19 @@ Selected Phase 8 candidate: - FA prefill dominates the profiled window. - MoE dispatch is too small to beat a `+5%` serving A/B gate. + Decision: + + - Promote to Task 2 test-gate work, not production source work yet. + - Rationale: standalone `mm_ids` and `gather_mmq` are small, but the live + ragged path around `mmq_nvfp4 + act_quant + MoE-dispatch + fan-in` is + roughly `29.7%` of llama.cpp kernel time. vLLM throughput is still much + higher on the same client shape. A production patch is only justified after + a ragged `MUL_MAT_ID` test gate exists and after the source prototype can + reduce the grouped-MMQ/activation movement bucket, not merely the helper + kernels. + - GDN remains the single largest bucket, so any Phase 8 source patch still + must clear the `+5%` serving A/B gate before being kept. + - [ ] **Step 6: Commit the profile decision** If promoted: