From b862e2c5683e3f7b4ab751428fcc0e37ee737f6d Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 00:42:36 +0000 Subject: [PATCH] docs(paged): stop ragged dispatch source shortcut Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 13 +++++++++++ .../2026-07-01-serving-ragged-moe-phase8.md | 23 +++++++++++++++++++ 2 files changed, 36 insertions(+) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 6b1973f5e..099ea394b 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -775,3 +775,16 @@ Debug note: duplicate expert IDs within token 0. That is not a valid top-k routing shape and caused a CPU/CUDA mismatch followed by a CUDA fault. The committed gate preserves unique expert IDs per token while keeping cross-token load skew. + +Production-source decision: + +- Do not start a Phase 8 production CUDA patch yet. +- Code inspection found that the existing native-FP4 MoE path already de-dups + broadcast activation quantization when `ne11 == 1`, then gathers FP4 blocks + before grouped MMQ. +- The measured helper rows are small (`mm_ids=0.66%`, `gather_mmq=0.42%`). + A metadata-only fused-dispatch hook would not plausibly clear the `+5%` + serving A/B gate. +- A future source candidate must reduce `mmq_nvfp4` (`22.36%`) or `act_quant` + (`3.60%`) directly, without D2H id readback, new stream synchronizations, or + md5 drift. diff --git a/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md b/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md index 8a4f1f5f7..2c19be192 100644 --- a/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md +++ b/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md @@ -349,6 +349,29 @@ Selected Phase 8 candidate: - Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cuh` - Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu` +**Status:** Not started. The profile and code inspection do not justify a +metadata/helper-only prototype. + +Inspection result: + +- `ggml_cuda_mul_mat_q()` already runs the ids path as + `mm_ids_helper -> quantize/gather -> grouped MMQ`. +- For native FP4 MoE with broadcast activations (`ne11 == 1`), patch `0023` + already quantizes unique tokens once and gathers FP4 blocks: + `quantize_mmq_fp4_cuda(... ids=nullptr ...)` followed by + `gather_mmq_fp4_cuda(...)`. +- The live serving profile shows `mm_ids` at `0.66%` and `gather_mmq` at + `0.42%`, while `mmq_nvfp4` is `22.36%` and `act_quant` is `3.60%`. +- Therefore a safe Phase 8 production patch must change grouped-MMQ execution + shape or activation movement. A default-off hook that only skips or repacks + metadata is not expected to clear the `+5%` serving A/B gate. + +Stop condition: + +- Do not edit production CUDA for Phase 8 until there is a concrete design for + reducing `mmq_nvfp4` or `act_quant` time without D2H id readback, new stream + synchronizations, or md5 drift. + - [ ] **Step 1: Add env-gated entry point** Add a default-off env gate: