From b6885aa44691328eceba73500fd2826039db781b Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 00:20:53 +0000 Subject: [PATCH] docs(paged): reject weighted combine fusion candidate Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 47 +++++++++++++ .../docs/VLLM_PARITY_LEVER_MAP.md | 16 ++++- .../plans/2026-06-30-serving-source-phase7.md | 67 ++++++++++++++++++- 3 files changed, 124 insertions(+), 6 deletions(-) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 6cf174634..d3107f82a 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -643,3 +643,50 @@ This gate is the correctness target for the next candidate: a deterministic post-down MoE weighted-combine fusion that preserves current f32 product and rank-order add semantics while avoiding the rejected SWIGLU/FP4-quantization shortcut. + +## Phase 7 Weighted-Combine Fusion Candidate Rejected + +Attempted candidate: fuse the post-down MoE router-weight multiply and +rank-ordered add fan-in: + +`ffn_moe_down -> ggml_mul(experts, weights) -> VIEW ranks -> ADD fan-in`. + +The candidate was fork-first, default-on during validation, and had a rollback +env switch: `LLAMA_MOE_NO_WEIGHTED_COMBINE_FUSION=1`. + +DGX artifacts: + +- `/home/mudler/bench/phase7_source_scope/test_backend_ops_moe_weighted_combine_orderfix.txt` +- `/home/mudler/bench/phase7_source_scope/test_backend_ops_mul_mat_id_weighted_combine_orderfix.txt` +- `/home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_gates_chat/` +- `/home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_nsys_completion/` +- `/home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_serving_ab/` +- Rejected diff: + `/home/mudler/bench/phase7_source_scope/rejected-phase7-moe-weighted-combine-fusion.diff` + +Correctness and inference gates: + +- `MOE_WEIGHTED_COMBINE`: `7/7`. +- Broad `MUL_MAT_ID`: `806/806`. +- Canonical transcript md5: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. + +Nsight proof: + +- Disabled run: no `k_moe_weighted_combine` kernels. +- Fused run: `110` `k_moe_weighted_combine` launches. + +Serving A/B (`n=128`, `ptok=128`, `gen=64`, `/v1/completions`): + +| path | decode tok/s/seq | decode agg tok/s | prefill tok/s | verdict | +|------|------------------|------------------|---------------|---------| +| `LLAMA_MOE_NO_WEIGHTED_COMBINE_FUSION=1` | 2.63 | 417.5 | 1345.2 | baseline | +| fused default | 2.63 | 417.0 | 1346.9 | reject; kernel fires but A/B is flat | + +Result: + +- Rejected as a production patch. The patch is md5-safe and the kernel fires, + but it does not improve the bounded serving workload. Keep patch `0052` as a + useful regression gate; do not retry this exact fan-in-only fusion unless a + fresh profile shows the weighted/add fan-in as a material bucket. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index b7fbc8eb0..151c0ca74 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -41,7 +41,7 @@ Estimates triangulated from the committed numbers (232/68 GEMM, 11%/5% from the | **FlashAttention prefill** (QK^T·softmax·PV, 10 layers) | **~3-6%**† | maybe - L²-growing; bounded at npp=128, larger at serving context | **NO** | **NO** | | **MoE router + combine/scatter** | **~5-8%** | **Yes** - vLLM fuses gather/weight/scatter into the grouped-GEMM epilogue | **NO** | **NO** | | ┝ `argsort_top_k`(256→8) + softmax + weight-norm | ~2-3% | yes | no | no | -| ┝ combine: 7× fp32 `add` + weight `mul` (×40) | ~3-5% | yes | no | no | +| ┝ combine: 7× fp32 `add` + weight `mul` (×40) | tested flat in Phase 7 | yes | no | no | | **Activation quantization** (W4A4 e4m3 pass per GEMM) | **~3-6%** | **Yes - structurally**: vLLM W4A16-Marlin on GB10 has **no** activation-quant step | **NO**‡ | partial | | Norm + residual tail (attn/post/q/k/ssm/l2/out + adds) | ~4% | small (0042 fused the main one) | - | - | | RoPE + sigmoid/silu gates + scale | ~2-3% | small | - | - | @@ -54,7 +54,17 @@ Estimates triangulated from the committed numbers (232/68 GEMM, 11%/5% from the They cover ~71% of the prefill wall and the bulk of the gap. Three contributors are **materially uncovered** by either lever: ### Newly-identified lever 1 - MoE router + combine/scatter (the strongest miss on the decision model) -llama runs the expert routing and recombination as **separate memory-bound ggml ops**: `argsort_top_k` over 256 experts, softmax/normalize, then a fan-in of **7 fp32 `ggml_add` + a weight `ggml_mul`** per MoE layer (`llama-graph.cpp` ~1797-1824), every one of 40 layers. vLLM's fused-MoE (and Marlin grouped) path folds the gather, the router-weight multiply, and the scatter-accumulate into the **GEMM epilogue/prologue** - so this is overhead vLLM essentially does not pay. Est. ~5-8% of the MoE prefill wall, entirely outside GEMM (the `mul_mat_id` is covered; the surrounding argsort/adds/mul are not) and outside GDN. **Lever: a fused top-k-weighted expert-output accumulation (or a fused-MoE epilogue), removing the 7-add fan-in and the separate weight mul.** Bit-exact-gateable (it is an fp32 reduction-order change, same precedent as the paged-MoE `8cb0ce23`). +llama runs the expert routing and recombination as **separate memory-bound ggml ops**: `argsort_top_k` over 256 experts, softmax/normalize, then a fan-in of **7 fp32 `ggml_add` + a weight `ggml_mul`** per MoE layer (`llama-graph.cpp` ~1797-1824), every one of 40 layers. vLLM's fused-MoE (and Marlin grouped) path folds the gather, the router-weight multiply, and the scatter-accumulate into the **GEMM epilogue/prologue** - so this is overhead vLLM essentially does not pay. Est. ~5-8% of the MoE prefill wall, entirely outside GEMM (the `mul_mat_id` is covered; the surrounding argsort/adds/mul are not) and outside GDN. + +Phase 7 challenged the smallest version of this lever: a CUDA-only post-down +weighted-combine fusion that removed the separate router-weight `mul` plus +rank-order add fan-in while preserving md5. It passed `MOE_WEIGHTED_COMBINE` +`7/7`, `MUL_MAT_ID` `806/806`, and canonical MoE/dense md5 gates; Nsight proved +the fused kernel launched (`110` `k_moe_weighted_combine` calls). Serving A/B was +flat (`decode_agg_tps 417.5 disabled -> 417.0 fused`), so the fan-in-only patch +was rejected. The remaining plausible lever is a larger fused-MoE +prologue/epilogue that also removes gather/scatter or moves work into the GEMM +kernel, not another standalone fan-in fusion. ### Newly-identified lever 2 - the W4A4 activation-quant pass (a vLLM-asymmetry, not just a kernel-speed gap) Every NVFP4 GEMM (MMQ today, and the new 0034 FP4-MMA) **quantizes activations to e4m3 (amax/6 + code search) before the matmul** - a distinct, M-proportional kernel. vLLM on **sm_121 falls back to W4A16-Marlin** (the TENSORCORE_GDN_SCOPE confirms this: no tcgen05/cutlass-FP4 on GB10), i.e. **f16 activations, zero activation-quant**. So this pass (~3-6% of prefill) is a structural cost vLLM avoids, and it explains part of why even a peak FP4-MMA GEMM will not fully reach vLLM's prefill. The README's "act-quant FLAT" and "W4A16 rejected" verdicts are **decode/BW-bound findings**; in compute-bound prefill the trade is different and unaudited. **Lever: measure this quant bucket as its own nsys row; consider fusing the activation-quant into the GEMM prologue (cp.async + in-register quant) so it is not a separate global-memory pass.** @@ -142,7 +152,7 @@ On **consumer Blackwell (sm_120/sm_121: DGX Spark/GB10, RTX 5090, RTX PRO 6000)* | 2 | **Dense weight GEMM - prefill** (large-M, compute-bound) | Marlin grouped/dense, async cp.async pipeline, big tiles, ~bf16 peak | MMQ small-tile, 1 CTA/SM. **New native FP4-MMA large-M kernel @103 TFLOP/s being integrated** (beats cuBLAS-bf16, bit-exact) | prefill | dequant→bf16-cuBLAS lever (0033) was **rejected** (MMQ beat it 29-49%); the native FP4-MMA kernel is the real fix and could **beat** vLLM's bf16-Marlin here | | 3 | **MoE expert GEMM - decode** | Marlin FP4→bf16 grouped, indirect addressing | Grouped MMQ (`mul_mat_id`), sorted expert layout, native FP4-MMA | decode | **Parity** - both BW-floor. Recurrence/GEMM are *our wins*; residual = bf16-projection BW + host loop | | 4 | **MoE expert GEMM - prefill** | Marlin grouped GEMM, fused, big tiles | MMQ small-tile grouped (1 CTA/SM) | prefill | **GAP (#1 prefill bottleneck per docs).** Native FP4-MMA grouped kernel is the planned fix; today MMQ is small-tile-bound | -| 5 | **MoE routing / gather / scatter / epilogue** | Triton persistent fused-MoE: indirect token addressing, **fused gate+up + SwiGLU epilogue**, once-quantize, scatter+weighted-combine fused | Sorted per-expert layout; **NVFP4 act-quant de-dup (0023)** mirrors once-quantize; SwiGLU is **separate ops** (no fused epilogue) | both | Partial parity. **No fused gate+up+SwiGLU epilogue** (extra IO passes); matters at prefill, minor at decode | +| 5 | **MoE routing / gather / scatter / epilogue** | Triton persistent fused-MoE: indirect token addressing, **fused gate+up + SwiGLU epilogue**, once-quantize, scatter+weighted-combine fused | Sorted per-expert layout; **NVFP4 act-quant de-dup (0023)** mirrors once-quantize; SwiGLU is **separate ops** (no fused epilogue) | both | Partial parity. **No fused gate+up+SwiGLU epilogue** (extra IO passes); fan-in-only weighted-combine fusion was Phase 7 tested-flat | | 6 | **GDN / linear-attn - decode** | FLA Triton `fused_recurrent_gated_delta_rule` + `fused_sigmoid_gating_delta_rule_update` (sequential, per-step state) | Fused sequential recurrence: in-place state write-back (0018), fused state gather (0019), o_proj MMVQ→MMQ (0020), occupancy retune (0022), conv-tap gather fusion (0028) | decode | **Parity-to-win** - recurrence runs at **102.6% of vLLM bandwidth**, 84.6% of GB10 peak BW. Our strongest area | | 7 | **GDN / linear-attn - prefill** | FLA `chunk_gated_delta_rule`: intra-chunk products on **tensor cores** (UT-transform), ~2.5× cheaper | Tuned **sequential** scan (default); chunked parallel-scan (0031) is **opt-in + ~22% slower** (serial f32 reductions, no TC, C=16 forced by 99KB smem) | prefill | **GAP (#2 prefill bottleneck).** No tensor-core chunked GDN. Scoped (TENSORCORE_GDN_SCOPE, mma.sync only); **Gram products de-risked at 6.7-9.3× over sequential**, kernel not yet built | | 8 | **Causal conv1d (short conv)** | FLA `causal_conv1d_fn`/`_update` Triton | `ggml_ssm_conv_update_inplace` (0021): 5-op chain → 1 op, in-place ring | both | Parity | diff --git a/docs/superpowers/plans/2026-06-30-serving-source-phase7.md b/docs/superpowers/plans/2026-06-30-serving-source-phase7.md index 977b6ecf2..6ca880cf9 100644 --- a/docs/superpowers/plans/2026-06-30-serving-source-phase7.md +++ b/docs/superpowers/plans/2026-06-30-serving-source-phase7.md @@ -1,6 +1,6 @@ # Phase 7: Serving Source Candidate Scope -**Status:** Test-gate patch landed. First production CUDA fusion candidate +**Status:** Test-gate patches landed. Two production CUDA fusion candidates rejected after DGX gates and serving A/B. **Goal:** Select one maintainable source candidate for the remaining GB10 MoE @@ -220,8 +220,19 @@ to implementation when all are true: - Fork commit: `3ef7eb9e4` (`test(paged): cover MoE weighted combine chain`). - LocalAI patch: `0052-test-paged-cover-MoE-weighted-combine-chain.patch`. - DGX gate: `MOE_WEIGHTED_COMBINE` `7/7` on CUDA0. -- [ ] Implement weighted-combine fusion only if the test gate is stable. -- [ ] Run op/md5 gates before serving A/B. +- [x] Implement weighted-combine fusion only if the test gate is stable. + - Implemented as a fork-first candidate, then rejected after serving A/B. + - Rejected diff saved at + `/home/mudler/bench/phase7_source_scope/rejected-phase7-moe-weighted-combine-fusion.diff`. +- [x] Run op/md5 gates before serving A/B. + - `MOE_WEIGHTED_COMBINE`: `7/7`. + - `MUL_MAT_ID`: `806/806`. + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. + - Nsight proof: enabled run showed `110` launches of + `k_moe_weighted_combine`; disabled run showed none. + - Serving A/B was flat: disabled `decode_agg_tps=417.5`, + fused `decode_agg_tps=417.0`. ## Required Tests Before Track A Source Patch @@ -327,3 +338,53 @@ DGX result: - `test-backend-ops test -b CUDA0 -o MOE_WEIGHTED_COMBINE -j 1`: `7/7`. This is a test-only patch and does not change the production inference path. + +## Rejected Production Candidate: MoE Weighted-Combine Fusion + +Attempted a fork-first CUDA fusion for the post-down MoE combine: + +`ffn_moe_down -> ggml_mul(experts, weights) -> VIEW ranks -> ADD fan-in`. + +The candidate added a narrow graph recognizer and a CUDA kernel that computes +each rank's f32 product and accumulates ranks in the same `0..n_used-1` order as +the existing add chain. It was default-on with +`LLAMA_MOE_NO_WEIGHTED_COMBINE_FUSION=1` as the rollback switch during +validation. + +Important debugging result: + +- The first serving profile did not show the new kernel. Root cause: the + recognizer's `ggml_can_fuse_subgraph()` op vector was interleaved as + `MUL, VIEW, VIEW, ADD...`, while the real graph order is + `MUL, VIEW..., ADD...`. +- After fixing the op vector, Nsight showed the enabled completion run launched + `k_moe_weighted_combine` `110` times and the disabled run launched it `0` + times. + +Final DGX artifacts live under `/home/mudler/bench/phase7_source_scope/`: + +- Focused gate: + `test_backend_ops_moe_weighted_combine_orderfix.txt` -> `7/7`. +- Broad MoE routing gate: + `test_backend_ops_mul_mat_id_weighted_combine_orderfix.txt` -> `806/806`. +- Canonical transcript md5 gates: + `weighted_combine_orderfix_gates_chat/` + - MoE: `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense: `5951a5b4d624ce891e22ab5fca9bc439`. +- Nsight completion proof: + `weighted_combine_orderfix_nsys_completion/` + - disabled: no `k_moe_weighted_combine` kernels. + - fused: `110` `k_moe_weighted_combine` kernels. +- Serving A/B: + `weighted_combine_orderfix_serving_ab/` + - disabled: `decode_agg_tps=417.5`, `decode_perseq_tps=2.63`, + `prefill_tps=1345.2`. + - fused: `decode_agg_tps=417.0`, `decode_perseq_tps=2.63`, + `prefill_tps=1346.9`. + +Verdict: reject the production patch. It is md5-safe and demonstrably fires, but +the bounded serving result is flat, so the extra default CUDA path is not worth +the upstream conflict and maintenance cost. Keep patch `0052` as coverage for +future structural MoE work, but do not retry this exact post-down fan-in fusion +unless a profile shows `ffn_moe_weighted`/add fan-in as a material bucket under a +new workload.