diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
index e834c4e75..09c2a8c25 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -714,3 +714,35 @@ Required promotion gates remain:
 - `MUL_MAT_ID`: `806/806` on CUDA0.
 - Any fused dispatch prototype must start default-off behind
   `LLAMA_MOE_FUSED_DISPATCH=1`.
+
+Profile-gate result:
+
+- Clean llama.cpp artifact:
+  `/home/mudler/bench/phase8_ragged_moe_dispatch/llama_n128_clean/`.
+- vLLM artifact:
+  `/home/mudler/bench/phase8_ragged_moe_dispatch/vllm_n128/`.
+- A stale first llama profile under `llama_n128/` is intentionally ignored
+  because the binary still contained the rejected weighted-combine kernel before
+  the clean-source rebuild.
+
+Throughput:
+
+| Engine | decode tok/s/seq | decode agg tok/s | prefill tok/s |
+|--------|------------------|------------------|---------------|
+| llama.cpp | 2.70 | 412.1 | 1368.3 |
+| vLLM | 7.02 | 1036.6 | 5277.7 |
+
+llama.cpp bucket highlights from the clean profile:
+
+- GDN: `4680.27 ms`, `38.12%`.
+- `mmq_nvfp4`: `2745.11 ms`, `22.36%`.
+- `act_quant`: `441.42 ms`, `3.60%`.
+- MoE dispatch: `183.67 ms`, `1.50%`.
+- `ew_add` fan-in: `280.15 ms`, `2.28%`.
+
+Decision:
+
+- Promote to a test-only ragged `MUL_MAT_ID` gate before production source.
+- Do not implement fused dispatch yet. Standalone `mm_ids`/`gather_mmq` helper
+  time is small; a source patch must reduce the larger grouped-MMQ/activation
+  movement bucket and still beat the `+5%` serving A/B gate.
diff --git a/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md b/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md
index 579c6715b..aa8ef17ac 100644
--- a/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md
+++ b/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md
@@ -85,7 +85,7 @@ Selected Phase 8 candidate:
 
   Write this plan and commit it before source work.
 
-- [ ] **Step 2: Reconfirm DGX idle state**
+- [x] **Step 2: Reconfirm DGX idle state**
 
   Run:
 
@@ -106,7 +106,7 @@ Selected Phase 8 candidate:
   FREE...
   ```
 
-- [ ] **Step 3: Run serving nsys for llama.cpp MoE**
+- [x] **Step 3: Run serving nsys for llama.cpp MoE**
 
   Run on DGX:
 
@@ -151,7 +151,24 @@ Selected Phase 8 candidate:
   - `buckets.txt` has fine rows for `mm_ids`, `gather_mmq`, `act_quant`,
     `mmq_nvfp4`, `set_rows`, `ew_add`, `gdn_core`, and `fa`.
 
-- [ ] **Step 4: Run serving nsys for vLLM MoE**
+  Result:
+
+  - Artifact: `/home/mudler/bench/phase8_ragged_moe_dispatch/llama_n128_clean/`.
+  - Throughput: `decode_agg_tps=412.1`, `decode_perseq_tps=2.70`,
+    `prefill_tps=1368.3`.
+  - Clean rebuild was required before this run; the first `llama_n128/` profile
+    still contained the rejected weighted-combine kernel in the binary and is
+    not used for the decision.
+  - Bucket highlights:
+    - GDN: `4680.27 ms`, `38.12%`.
+    - `mmq_nvfp4`: `2745.11 ms`, `22.36%`.
+    - `act_quant`: `441.42 ms`, `3.60%`.
+    - MoE dispatch: `183.67 ms`, `1.50%`.
+    - `mm_ids`: `80.92 ms`, `0.66%`.
+    - `gather_mmq`: `50.96 ms`, `0.42%`.
+    - `ew_add`: `280.15 ms`, `2.28%`.
+
+- [x] **Step 4: Run serving nsys for vLLM MoE**
 
   Run on DGX:
 
@@ -196,7 +213,17 @@ Selected Phase 8 candidate:
   - `buckets.txt` has vLLM rows for `vllm_dispatch`, `vllm_fp4_gemm`,
     `vllm_fa`, and `fla_gdn`.
 
-- [ ] **Step 5: Decide promotion**
+  Result:
+
+  - Artifact: `/home/mudler/bench/phase8_ragged_moe_dispatch/vllm_n128/`.
+  - Throughput: `decode_agg_tps=1036.6`, `decode_perseq_tps=7.02`,
+    `prefill_tps=5277.7`.
+  - Nsight includes startup/autotune and `delayStreamKernel`, so the aggregate
+    vLLM macro percentages are not directly comparable to llama.cpp. Direct
+    kernel extraction still shows Marlin-MoE rows around `1.0 s` and
+    `moe_align/topk/count` rows around `38.5 ms` in the full capture.
+
+- [x] **Step 5: Decide promotion**
 
   Promote to source only if all are true:
 
@@ -213,6 +240,19 @@ Selected Phase 8 candidate:
   - FA prefill dominates the profiled window.
   - MoE dispatch is too small to beat a `+5%` serving A/B gate.
 
+  Decision:
+
+  - Promote to Task 2 test-gate work, not production source work yet.
+  - Rationale: standalone `mm_ids` and `gather_mmq` are small, but the live
+    ragged path around `mmq_nvfp4 + act_quant + MoE-dispatch + fan-in` is
+    roughly `29.7%` of llama.cpp kernel time. vLLM throughput is still much
+    higher on the same client shape. A production patch is only justified after
+    a ragged `MUL_MAT_ID` test gate exists and after the source prototype can
+    reduce the grouped-MMQ/activation movement bucket, not merely the helper
+    kernels.
+  - GDN remains the single largest bucket, so any Phase 8 source patch still
+    must clear the `+5%` serving A/B gate before being kept.
+
 - [ ] **Step 6: Commit the profile decision**
 
   If promoted: