docs(paged): reject weighted combine fusion candidate

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 00:20:53 +00:00
parent 4b6fc0fa1c
commit b6885aa446
3 changed files with 124 additions and 6 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -643,3 +643,50 @@ This gate is the correctness target for the next candidate: a deterministic
 post-down MoE weighted-combine fusion that preserves current f32 product and
 rank-order add semantics while avoiding the rejected SWIGLU/FP4-quantization
 shortcut.
+
+## Phase 7 Weighted-Combine Fusion Candidate Rejected
+
+Attempted candidate: fuse the post-down MoE router-weight multiply and
+rank-ordered add fan-in:
+
+`ffn_moe_down -> ggml_mul(experts, weights) -> VIEW ranks -> ADD fan-in`.
+
+The candidate was fork-first, default-on during validation, and had a rollback
+env switch: `LLAMA_MOE_NO_WEIGHTED_COMBINE_FUSION=1`.
+
+DGX artifacts:
+
+- `/home/mudler/bench/phase7_source_scope/test_backend_ops_moe_weighted_combine_orderfix.txt`
+- `/home/mudler/bench/phase7_source_scope/test_backend_ops_mul_mat_id_weighted_combine_orderfix.txt`
+- `/home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_gates_chat/`
+- `/home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_nsys_completion/`
+- `/home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_serving_ab/`
+- Rejected diff:
+  `/home/mudler/bench/phase7_source_scope/rejected-phase7-moe-weighted-combine-fusion.diff`
+
+Correctness and inference gates:
+
+- `MOE_WEIGHTED_COMBINE`: `7/7`.
+- Broad `MUL_MAT_ID`: `806/806`.
+- Canonical transcript md5:
+  - MoE `8cb0ce23777bf55f92f63d0292c756b0`.
+  - Dense `5951a5b4d624ce891e22ab5fca9bc439`.
+
+Nsight proof:
+
+- Disabled run: no `k_moe_weighted_combine` kernels.
+- Fused run: `110` `k_moe_weighted_combine` launches.
+
+Serving A/B (`n=128`, `ptok=128`, `gen=64`, `/v1/completions`):
+
+| path | decode tok/s/seq | decode agg tok/s | prefill tok/s | verdict |
+|------|------------------|------------------|---------------|---------|
+| `LLAMA_MOE_NO_WEIGHTED_COMBINE_FUSION=1` | 2.63 | 417.5 | 1345.2 | baseline |
+| fused default | 2.63 | 417.0 | 1346.9 | reject; kernel fires but A/B is flat |
+
+Result:
+
+- Rejected as a production patch. The patch is md5-safe and the kernel fires,
+  but it does not improve the bounded serving workload. Keep patch `0052` as a
+  useful regression gate; do not retry this exact fan-in-only fusion unless a
+  fresh profile shows the weighted/add fan-in as a material bucket.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -41,7 +41,7 @@ Estimates triangulated from the committed numbers (232/68 GEMM, 11%/5% from the
 | **FlashAttention prefill** (QK^T·softmax·PV, 10 layers) | **~3-6%**† | maybe - L²-growing; bounded at npp=128, larger at serving context | **NO** | **NO** |
 | **MoE router + combine/scatter** | **~5-8%** | **Yes** - vLLM fuses gather/weight/scatter into the grouped-GEMM epilogue | **NO** | **NO** |
 | ┝ `argsort_top_k`(256→8) + softmax + weight-norm | ~2-3% | yes | no | no |
-| ┝ combine: 7× fp32 `add` + weight `mul` (×40) | ~3-5% | yes | no | no |
+| ┝ combine: 7× fp32 `add` + weight `mul` (×40) | tested flat in Phase 7 | yes | no | no |
 | **Activation quantization** (W4A4 e4m3 pass per GEMM) | **~3-6%** | **Yes - structurally**: vLLM W4A16-Marlin on GB10 has **no** activation-quant step | **NO**‡ | partial |
 | Norm + residual tail (attn/post/q/k/ssm/l2/out + adds) | ~4% | small (0042 fused the main one) | - | - |
 | RoPE + sigmoid/silu gates + scale | ~2-3% | small | - | - |
@@ -54,7 +54,17 @@ Estimates triangulated from the committed numbers (232/68 GEMM, 11%/5% from the
 They cover ~71% of the prefill wall and the bulk of the gap. Three contributors are **materially uncovered** by either lever:

 ### Newly-identified lever 1 - MoE router + combine/scatter (the strongest miss on the decision model)
-llama runs the expert routing and recombination as **separate memory-bound ggml ops**: `argsort_top_k` over 256 experts, softmax/normalize, then a fan-in of **7 fp32 `ggml_add` + a weight `ggml_mul`** per MoE layer (`llama-graph.cpp` ~1797-1824), every one of 40 layers. vLLM's fused-MoE (and Marlin grouped) path folds the gather, the router-weight multiply, and the scatter-accumulate into the **GEMM epilogue/prologue** - so this is overhead vLLM essentially does not pay. Est. ~5-8% of the MoE prefill wall, entirely outside GEMM (the `mul_mat_id` is covered; the surrounding argsort/adds/mul are not) and outside GDN. **Lever: a fused top-k-weighted expert-output accumulation (or a fused-MoE epilogue), removing the 7-add fan-in and the separate weight mul.** Bit-exact-gateable (it is an fp32 reduction-order change, same precedent as the paged-MoE `8cb0ce23`).
+llama runs the expert routing and recombination as **separate memory-bound ggml ops**: `argsort_top_k` over 256 experts, softmax/normalize, then a fan-in of **7 fp32 `ggml_add` + a weight `ggml_mul`** per MoE layer (`llama-graph.cpp` ~1797-1824), every one of 40 layers. vLLM's fused-MoE (and Marlin grouped) path folds the gather, the router-weight multiply, and the scatter-accumulate into the **GEMM epilogue/prologue** - so this is overhead vLLM essentially does not pay. Est. ~5-8% of the MoE prefill wall, entirely outside GEMM (the `mul_mat_id` is covered; the surrounding argsort/adds/mul are not) and outside GDN.
+
+Phase 7 challenged the smallest version of this lever: a CUDA-only post-down
+weighted-combine fusion that removed the separate router-weight `mul` plus
+rank-order add fan-in while preserving md5. It passed `MOE_WEIGHTED_COMBINE`
+`7/7`, `MUL_MAT_ID` `806/806`, and canonical MoE/dense md5 gates; Nsight proved
+the fused kernel launched (`110` `k_moe_weighted_combine` calls). Serving A/B was
+flat (`decode_agg_tps 417.5 disabled -> 417.0 fused`), so the fan-in-only patch
+was rejected. The remaining plausible lever is a larger fused-MoE
+prologue/epilogue that also removes gather/scatter or moves work into the GEMM
+kernel, not another standalone fan-in fusion.

 ### Newly-identified lever 2 - the W4A4 activation-quant pass (a vLLM-asymmetry, not just a kernel-speed gap)
 Every NVFP4 GEMM (MMQ today, and the new 0034 FP4-MMA) **quantizes activations to e4m3 (amax/6 + code search) before the matmul** - a distinct, M-proportional kernel. vLLM on **sm_121 falls back to W4A16-Marlin** (the TENSORCORE_GDN_SCOPE confirms this: no tcgen05/cutlass-FP4 on GB10), i.e. **f16 activations, zero activation-quant**. So this pass (~3-6% of prefill) is a structural cost vLLM avoids, and it explains part of why even a peak FP4-MMA GEMM will not fully reach vLLM's prefill. The README's "act-quant FLAT" and "W4A16 rejected" verdicts are **decode/BW-bound findings**; in compute-bound prefill the trade is different and unaudited. **Lever: measure this quant bucket as its own nsys row; consider fusing the activation-quant into the GEMM prologue (cp.async + in-register quant) so it is not a separate global-memory pass.**
@@ -142,7 +152,7 @@ On **consumer Blackwell (sm_120/sm_121: DGX Spark/GB10, RTX 5090, RTX PRO 6000)*
 | 2 | **Dense weight GEMM - prefill** (large-M, compute-bound) | Marlin grouped/dense, async cp.async pipeline, big tiles, ~bf16 peak | MMQ small-tile, 1 CTA/SM. **New native FP4-MMA large-M kernel @103 TFLOP/s being integrated** (beats cuBLAS-bf16, bit-exact) | prefill | dequant→bf16-cuBLAS lever (0033) was **rejected** (MMQ beat it 29-49%); the native FP4-MMA kernel is the real fix and could **beat** vLLM's bf16-Marlin here |
 | 3 | **MoE expert GEMM - decode** | Marlin FP4→bf16 grouped, indirect addressing | Grouped MMQ (`mul_mat_id`), sorted expert layout, native FP4-MMA | decode | **Parity** - both BW-floor. Recurrence/GEMM are *our wins*; residual = bf16-projection BW + host loop |
 | 4 | **MoE expert GEMM - prefill** | Marlin grouped GEMM, fused, big tiles | MMQ small-tile grouped (1 CTA/SM) | prefill | **GAP (#1 prefill bottleneck per docs).** Native FP4-MMA grouped kernel is the planned fix; today MMQ is small-tile-bound |
-| 5 | **MoE routing / gather / scatter / epilogue** | Triton persistent fused-MoE: indirect token addressing, **fused gate+up + SwiGLU epilogue**, once-quantize, scatter+weighted-combine fused | Sorted per-expert layout; **NVFP4 act-quant de-dup (0023)** mirrors once-quantize; SwiGLU is **separate ops** (no fused epilogue) | both | Partial parity. **No fused gate+up+SwiGLU epilogue** (extra IO passes); matters at prefill, minor at decode |
+| 5 | **MoE routing / gather / scatter / epilogue** | Triton persistent fused-MoE: indirect token addressing, **fused gate+up + SwiGLU epilogue**, once-quantize, scatter+weighted-combine fused | Sorted per-expert layout; **NVFP4 act-quant de-dup (0023)** mirrors once-quantize; SwiGLU is **separate ops** (no fused epilogue) | both | Partial parity. **No fused gate+up+SwiGLU epilogue** (extra IO passes); fan-in-only weighted-combine fusion was Phase 7 tested-flat |
 | 6 | **GDN / linear-attn - decode** | FLA Triton `fused_recurrent_gated_delta_rule` + `fused_sigmoid_gating_delta_rule_update` (sequential, per-step state) | Fused sequential recurrence: in-place state write-back (0018), fused state gather (0019), o_proj MMVQ→MMQ (0020), occupancy retune (0022), conv-tap gather fusion (0028) | decode | **Parity-to-win** - recurrence runs at **102.6% of vLLM bandwidth**, 84.6% of GB10 peak BW. Our strongest area |
 | 7 | **GDN / linear-attn - prefill** | FLA `chunk_gated_delta_rule`: intra-chunk products on **tensor cores** (UT-transform), ~2.5× cheaper | Tuned **sequential** scan (default); chunked parallel-scan (0031) is **opt-in + ~22% slower** (serial f32 reductions, no TC, C=16 forced by 99KB smem) | prefill | **GAP (#2 prefill bottleneck).** No tensor-core chunked GDN. Scoped (TENSORCORE_GDN_SCOPE, mma.sync only); **Gram products de-risked at 6.7-9.3× over sequential**, kernel not yet built |
 | 8 | **Causal conv1d (short conv)** | FLA `causal_conv1d_fn`/`_update` Triton | `ggml_ssm_conv_update_inplace` (0021): 5-op chain → 1 op, in-place ring | both | Parity |