From b6885aa44691328eceba73500fd2826039db781b Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Wed, 1 Jul 2026 00:20:53 +0000
Subject: [PATCH] docs(paged): reject weighted combine fusion candidate

Assisted-by: Codex:gpt-5
---
 .../docs/GB10_PARITY_PHASE0_RESULTS.md        | 47 +++++++++++++
 .../docs/VLLM_PARITY_LEVER_MAP.md             | 16 ++++-
 .../plans/2026-06-30-serving-source-phase7.md | 67 ++++++++++++++++++-
 3 files changed, 124 insertions(+), 6 deletions(-)

diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
index 6cf174634..d3107f82a 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -643,3 +643,50 @@ This gate is the correctness target for the next candidate: a deterministic
 post-down MoE weighted-combine fusion that preserves current f32 product and
 rank-order add semantics while avoiding the rejected SWIGLU/FP4-quantization
 shortcut.
+
+## Phase 7 Weighted-Combine Fusion Candidate Rejected
+
+Attempted candidate: fuse the post-down MoE router-weight multiply and
+rank-ordered add fan-in:
+
+`ffn_moe_down -> ggml_mul(experts, weights) -> VIEW ranks -> ADD fan-in`.
+
+The candidate was fork-first, default-on during validation, and had a rollback
+env switch: `LLAMA_MOE_NO_WEIGHTED_COMBINE_FUSION=1`.
+
+DGX artifacts:
+
+- `/home/mudler/bench/phase7_source_scope/test_backend_ops_moe_weighted_combine_orderfix.txt`
+- `/home/mudler/bench/phase7_source_scope/test_backend_ops_mul_mat_id_weighted_combine_orderfix.txt`
+- `/home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_gates_chat/`
+- `/home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_nsys_completion/`
+- `/home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_serving_ab/`
+- Rejected diff:
+  `/home/mudler/bench/phase7_source_scope/rejected-phase7-moe-weighted-combine-fusion.diff`
+
+Correctness and inference gates:
+
+- `MOE_WEIGHTED_COMBINE`: `7/7`.
+- Broad `MUL_MAT_ID`: `806/806`.
+- Canonical transcript md5:
+  - MoE `8cb0ce23777bf55f92f63d0292c756b0`.
+  - Dense `5951a5b4d624ce891e22ab5fca9bc439`.
+
+Nsight proof:
+
+- Disabled run: no `k_moe_weighted_combine` kernels.
+- Fused run: `110` `k_moe_weighted_combine` launches.
+
+Serving A/B (`n=128`, `ptok=128`, `gen=64`, `/v1/completions`):
+
+| path | decode tok/s/seq | decode agg tok/s | prefill tok/s | verdict |
+|------|------------------|------------------|---------------|---------|
+| `LLAMA_MOE_NO_WEIGHTED_COMBINE_FUSION=1` | 2.63 | 417.5 | 1345.2 | baseline |
+| fused default | 2.63 | 417.0 | 1346.9 | reject; kernel fires but A/B is flat |
+
+Result:
+
+- Rejected as a production patch. The patch is md5-safe and the kernel fires,
+  but it does not improve the bounded serving workload. Keep patch `0052` as a
+  useful regression gate; do not retry this exact fan-in-only fusion unless a
+  fresh profile shows the weighted/add fan-in as a material bucket.
diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
index b7fbc8eb0..151c0ca74 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -41,7 +41,7 @@ Estimates triangulated from the committed numbers (232/68 GEMM, 11%/5% from the
 | **FlashAttention prefill** (QK^T·softmax·PV, 10 layers) | **~3-6%**† | maybe - L²-growing; bounded at npp=128, larger at serving context | **NO** | **NO** |
 | **MoE router + combine/scatter** | **~5-8%** | **Yes** - vLLM fuses gather/weight/scatter into the grouped-GEMM epilogue | **NO** | **NO** |
 | ┝ `argsort_top_k`(256→8) + softmax + weight-norm | ~2-3% | yes | no | no |
-| ┝ combine: 7× fp32 `add` + weight `mul` (×40) | ~3-5% | yes | no | no |
+| ┝ combine: 7× fp32 `add` + weight `mul` (×40) | tested flat in Phase 7 | yes | no | no |
 | **Activation quantization** (W4A4 e4m3 pass per GEMM) | **~3-6%** | **Yes - structurally**: vLLM W4A16-Marlin on GB10 has **no** activation-quant step | **NO**‡ | partial |
 | Norm + residual tail (attn/post/q/k/ssm/l2/out + adds) | ~4% | small (0042 fused the main one) | - | - |
 | RoPE + sigmoid/silu gates + scale | ~2-3% | small | - | - |
@@ -54,7 +54,17 @@ Estimates triangulated from the committed numbers (232/68 GEMM, 11%/5% from the
 They cover ~71% of the prefill wall and the bulk of the gap. Three contributors are **materially uncovered** by either lever:
 
 ### Newly-identified lever 1 - MoE router + combine/scatter (the strongest miss on the decision model)
-llama runs the expert routing and recombination as **separate memory-bound ggml ops**: `argsort_top_k` over 256 experts, softmax/normalize, then a fan-in of **7 fp32 `ggml_add` + a weight `ggml_mul`** per MoE layer (`llama-graph.cpp` ~1797-1824), every one of 40 layers. vLLM's fused-MoE (and Marlin grouped) path folds the gather, the router-weight multiply, and the scatter-accumulate into the **GEMM epilogue/prologue** - so this is overhead vLLM essentially does not pay. Est. ~5-8% of the MoE prefill wall, entirely outside GEMM (the `mul_mat_id` is covered; the surrounding argsort/adds/mul are not) and outside GDN. **Lever: a fused top-k-weighted expert-output accumulation (or a fused-MoE epilogue), removing the 7-add fan-in and the separate weight mul.** Bit-exact-gateable (it is an fp32 reduction-order change, same precedent as the paged-MoE `8cb0ce23`).
+llama runs the expert routing and recombination as **separate memory-bound ggml ops**: `argsort_top_k` over 256 experts, softmax/normalize, then a fan-in of **7 fp32 `ggml_add` + a weight `ggml_mul`** per MoE layer (`llama-graph.cpp` ~1797-1824), every one of 40 layers. vLLM's fused-MoE (and Marlin grouped) path folds the gather, the router-weight multiply, and the scatter-accumulate into the **GEMM epilogue/prologue** - so this is overhead vLLM essentially does not pay. Est. ~5-8% of the MoE prefill wall, entirely outside GEMM (the `mul_mat_id` is covered; the surrounding argsort/adds/mul are not) and outside GDN.
+
+Phase 7 challenged the smallest version of this lever: a CUDA-only post-down
+weighted-combine fusion that removed the separate router-weight `mul` plus
+rank-order add fan-in while preserving md5. It passed `MOE_WEIGHTED_COMBINE`
+`7/7`, `MUL_MAT_ID` `806/806`, and canonical MoE/dense md5 gates; Nsight proved
+the fused kernel launched (`110` `k_moe_weighted_combine` calls). Serving A/B was
+flat (`decode_agg_tps 417.5 disabled -> 417.0 fused`), so the fan-in-only patch
+was rejected. The remaining plausible lever is a larger fused-MoE
+prologue/epilogue that also removes gather/scatter or moves work into the GEMM
+kernel, not another standalone fan-in fusion.
 
 ### Newly-identified lever 2 - the W4A4 activation-quant pass (a vLLM-asymmetry, not just a kernel-speed gap)
 Every NVFP4 GEMM (MMQ today, and the new 0034 FP4-MMA) **quantizes activations to e4m3 (amax/6 + code search) before the matmul** - a distinct, M-proportional kernel. vLLM on **sm_121 falls back to W4A16-Marlin** (the TENSORCORE_GDN_SCOPE confirms this: no tcgen05/cutlass-FP4 on GB10), i.e. **f16 activations, zero activation-quant**. So this pass (~3-6% of prefill) is a structural cost vLLM avoids, and it explains part of why even a peak FP4-MMA GEMM will not fully reach vLLM's prefill. The README's "act-quant FLAT" and "W4A16 rejected" verdicts are **decode/BW-bound findings**; in compute-bound prefill the trade is different and unaudited. **Lever: measure this quant bucket as its own nsys row; consider fusing the activation-quant into the GEMM prologue (cp.async + in-register quant) so it is not a separate global-memory pass.**
@@ -142,7 +152,7 @@ On **consumer Blackwell (sm_120/sm_121: DGX Spark/GB10, RTX 5090, RTX PRO 6000)*
 | 2 | **Dense weight GEMM - prefill** (large-M, compute-bound) | Marlin grouped/dense, async cp.async pipeline, big tiles, ~bf16 peak | MMQ small-tile, 1 CTA/SM. **New native FP4-MMA large-M kernel @103 TFLOP/s being integrated** (beats cuBLAS-bf16, bit-exact) | prefill | dequant→bf16-cuBLAS lever (0033) was **rejected** (MMQ beat it 29-49%); the native FP4-MMA kernel is the real fix and could **beat** vLLM's bf16-Marlin here |
 | 3 | **MoE expert GEMM - decode** | Marlin FP4→bf16 grouped, indirect addressing | Grouped MMQ (`mul_mat_id`), sorted expert layout, native FP4-MMA | decode | **Parity** - both BW-floor. Recurrence/GEMM are *our wins*; residual = bf16-projection BW + host loop |
 | 4 | **MoE expert GEMM - prefill** | Marlin grouped GEMM, fused, big tiles | MMQ small-tile grouped (1 CTA/SM) | prefill | **GAP (#1 prefill bottleneck per docs).** Native FP4-MMA grouped kernel is the planned fix; today MMQ is small-tile-bound |
-| 5 | **MoE routing / gather / scatter / epilogue** | Triton persistent fused-MoE: indirect token addressing, **fused gate+up + SwiGLU epilogue**, once-quantize, scatter+weighted-combine fused | Sorted per-expert layout; **NVFP4 act-quant de-dup (0023)** mirrors once-quantize; SwiGLU is **separate ops** (no fused epilogue) | both | Partial parity. **No fused gate+up+SwiGLU epilogue** (extra IO passes); matters at prefill, minor at decode |
+| 5 | **MoE routing / gather / scatter / epilogue** | Triton persistent fused-MoE: indirect token addressing, **fused gate+up + SwiGLU epilogue**, once-quantize, scatter+weighted-combine fused | Sorted per-expert layout; **NVFP4 act-quant de-dup (0023)** mirrors once-quantize; SwiGLU is **separate ops** (no fused epilogue) | both | Partial parity. **No fused gate+up+SwiGLU epilogue** (extra IO passes); fan-in-only weighted-combine fusion was Phase 7 tested-flat |
 | 6 | **GDN / linear-attn - decode** | FLA Triton `fused_recurrent_gated_delta_rule` + `fused_sigmoid_gating_delta_rule_update` (sequential, per-step state) | Fused sequential recurrence: in-place state write-back (0018), fused state gather (0019), o_proj MMVQ→MMQ (0020), occupancy retune (0022), conv-tap gather fusion (0028) | decode | **Parity-to-win** - recurrence runs at **102.6% of vLLM bandwidth**, 84.6% of GB10 peak BW. Our strongest area |
 | 7 | **GDN / linear-attn - prefill** | FLA `chunk_gated_delta_rule`: intra-chunk products on **tensor cores** (UT-transform), ~2.5× cheaper | Tuned **sequential** scan (default); chunked parallel-scan (0031) is **opt-in + ~22% slower** (serial f32 reductions, no TC, C=16 forced by 99KB smem) | prefill | **GAP (#2 prefill bottleneck).** No tensor-core chunked GDN. Scoped (TENSORCORE_GDN_SCOPE, mma.sync only); **Gram products de-risked at 6.7-9.3× over sequential**, kernel not yet built |
 | 8 | **Causal conv1d (short conv)** | FLA `causal_conv1d_fn`/`_update` Triton | `ggml_ssm_conv_update_inplace` (0021): 5-op chain → 1 op, in-place ring | both | Parity |
diff --git a/docs/superpowers/plans/2026-06-30-serving-source-phase7.md b/docs/superpowers/plans/2026-06-30-serving-source-phase7.md
index 977b6ecf2..6ca880cf9 100644
--- a/docs/superpowers/plans/2026-06-30-serving-source-phase7.md
+++ b/docs/superpowers/plans/2026-06-30-serving-source-phase7.md
@@ -1,6 +1,6 @@
 # Phase 7: Serving Source Candidate Scope
 
-**Status:** Test-gate patch landed. First production CUDA fusion candidate
+**Status:** Test-gate patches landed. Two production CUDA fusion candidates
 rejected after DGX gates and serving A/B.
 
 **Goal:** Select one maintainable source candidate for the remaining GB10 MoE
@@ -220,8 +220,19 @@ to implementation when all are true:
   - Fork commit: `3ef7eb9e4` (`test(paged): cover MoE weighted combine chain`).
   - LocalAI patch: `0052-test-paged-cover-MoE-weighted-combine-chain.patch`.
   - DGX gate: `MOE_WEIGHTED_COMBINE` `7/7` on CUDA0.
-- [ ] Implement weighted-combine fusion only if the test gate is stable.
-- [ ] Run op/md5 gates before serving A/B.
+- [x] Implement weighted-combine fusion only if the test gate is stable.
+  - Implemented as a fork-first candidate, then rejected after serving A/B.
+  - Rejected diff saved at
+    `/home/mudler/bench/phase7_source_scope/rejected-phase7-moe-weighted-combine-fusion.diff`.
+- [x] Run op/md5 gates before serving A/B.
+  - `MOE_WEIGHTED_COMBINE`: `7/7`.
+  - `MUL_MAT_ID`: `806/806`.
+  - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
+  - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
+  - Nsight proof: enabled run showed `110` launches of
+    `k_moe_weighted_combine`; disabled run showed none.
+  - Serving A/B was flat: disabled `decode_agg_tps=417.5`,
+    fused `decode_agg_tps=417.0`.
 
 ## Required Tests Before Track A Source Patch
 
@@ -327,3 +338,53 @@ DGX result:
 - `test-backend-ops test -b CUDA0 -o MOE_WEIGHTED_COMBINE -j 1`: `7/7`.
 
 This is a test-only patch and does not change the production inference path.
+
+## Rejected Production Candidate: MoE Weighted-Combine Fusion
+
+Attempted a fork-first CUDA fusion for the post-down MoE combine:
+
+`ffn_moe_down -> ggml_mul(experts, weights) -> VIEW ranks -> ADD fan-in`.
+
+The candidate added a narrow graph recognizer and a CUDA kernel that computes
+each rank's f32 product and accumulates ranks in the same `0..n_used-1` order as
+the existing add chain. It was default-on with
+`LLAMA_MOE_NO_WEIGHTED_COMBINE_FUSION=1` as the rollback switch during
+validation.
+
+Important debugging result:
+
+- The first serving profile did not show the new kernel. Root cause: the
+  recognizer's `ggml_can_fuse_subgraph()` op vector was interleaved as
+  `MUL, VIEW, VIEW, ADD...`, while the real graph order is
+  `MUL, VIEW..., ADD...`.
+- After fixing the op vector, Nsight showed the enabled completion run launched
+  `k_moe_weighted_combine` `110` times and the disabled run launched it `0`
+  times.
+
+Final DGX artifacts live under `/home/mudler/bench/phase7_source_scope/`:
+
+- Focused gate:
+  `test_backend_ops_moe_weighted_combine_orderfix.txt` -> `7/7`.
+- Broad MoE routing gate:
+  `test_backend_ops_mul_mat_id_weighted_combine_orderfix.txt` -> `806/806`.
+- Canonical transcript md5 gates:
+  `weighted_combine_orderfix_gates_chat/`
+  - MoE: `8cb0ce23777bf55f92f63d0292c756b0`.
+  - Dense: `5951a5b4d624ce891e22ab5fca9bc439`.
+- Nsight completion proof:
+  `weighted_combine_orderfix_nsys_completion/`
+  - disabled: no `k_moe_weighted_combine` kernels.
+  - fused: `110` `k_moe_weighted_combine` kernels.
+- Serving A/B:
+  `weighted_combine_orderfix_serving_ab/`
+  - disabled: `decode_agg_tps=417.5`, `decode_perseq_tps=2.63`,
+    `prefill_tps=1345.2`.
+  - fused: `decode_agg_tps=417.0`, `decode_perseq_tps=2.63`,
+    `prefill_tps=1346.9`.
+
+Verdict: reject the production patch. It is md5-safe and demonstrably fires, but
+the bounded serving result is flat, so the extra default CUDA path is not worth
+the upstream conflict and maintenance cost. Keep patch `0052` as coverage for
+future structural MoE work, but do not retry this exact post-down fan-in fusion
+unless a profile shows `ffn_moe_weighted`/add fan-in as a material bucket under a
+new workload.