docs(paged): close ragged MoE dispatch shortcut

Record the Phase 8 safety rerun, canonical transcript md5 gates, full and ragged MUL_MAT_ID op gates, and the no-production-patch decision for metadata-only fused dispatch work. Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 01:57:45 +00:00
parent 2074b4fb5b
commit abc70c209e
3 changed files with 91 additions and 7 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -1085,3 +1085,31 @@ Conclusion:
 - Per the Phase 12/13 decision rule, stop GDN kernel work on GB10. The remaining
  vLLM GDN advantage requires a fuller FLA-style blocked solve or hardware
  assumptions that do not fit this GB10 patch stack without a regression.
+
+## Phase 8 Ragged MoE Dispatch Safety Rerun
+
+Phase 8 had already closed the live ragged MoE helper path by profile:
+`mm_ids=0.66%`, `gather_mmq=0.42%`, while `mmq_nvfp4=22.36%` and
+`act_quant=3.60%`. The only source patch kept from the phase is the test gate
+(`0053-test-paged-cover-ragged-MoE-dispatch.patch`); the metadata-only
+`LLAMA_MOE_FUSED_DISPATCH` shortcut is rejected.
+
+Rerun artifacts:
+
+- `/home/mudler/bench/phase8_ragged_moe_dispatch/ragged_gate_rerun_20260701_035529.txt`
+- `/home/mudler/bench/phase8_ragged_moe_dispatch/safety_rerun_20260701_035549/`
+
+Safety result:
+
+- `MUL_MAT_ID_RAGGED_MOE`: `6/6` on CUDA0.
+- Full `MUL_MAT_ID`: `806/806` on CUDA0.
+- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`.
+- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`.
+
+Conclusion:
+
+- The inferencing gates remain canonical on the unchanged production path.
+- Do not add a metadata/helper-only fused-dispatch hook. A future Phase 8
+  production candidate must reduce `mmq_nvfp4` or activation movement directly,
+  stay free of D2H id readback and new stream synchronizations, and then pass
+  the same md5/op gates before any serving A/B is considered.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -579,6 +579,29 @@ Artifacts:
 - `/home/mudler/bench/phase13_gdn_global_ai32/ab/`
 - `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff`

+### Phase 8 ragged MoE dispatch closure
+
+The remaining Phase 8 source shortcut was closed without production CUDA edits.
+The live ragged serving profile showed helper metadata buckets too small to clear
+the `+5%` serving A/B gate (`mm_ids=0.66%`, `gather_mmq=0.42%`). Patch `0023`
+already handles the broadcast-activation NVFP4 path by quantizing unique tokens
+once and gathering FP4 blocks, so a metadata-only `LLAMA_MOE_FUSED_DISPATCH`
+hook would add conflict surface without attacking the dominant buckets.
+
+Safety rerun:
+
+- `MUL_MAT_ID_RAGGED_MOE`: `6/6` on CUDA0.
+- Full `MUL_MAT_ID`: `806/806` on CUDA0.
+- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`.
+- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`.
+
+Decision:
+
+- Keep test patch `0053`.
+- Do not add a Phase 8 production patch unless it directly reduces
+  `mmq_nvfp4` or activation movement without D2H id readback, new
+  synchronizations, or md5 drift.
+
 ---

 # PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)