docs(paged): reject graph-time gate fusion shortcut

Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-07-01 06:55:48 +00:00
parent 5354adcffb
commit 52c11b1ce5
4 changed files with 219 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -2263,3 +2263,35 @@ Decision:
  preserves F32 math and split semantics. Gate it with MoE/dense md5,
  `MUL_MAT`, `MUL_MAT_ID`, and KL validation if either md5 changes before any
  serving benchmark.
+
+## Phase 39 Gate Fusion Feasibility
+
+Phase 39 checked whether the Phase38 follow-up should be a quick graph-time
+fused gate projection.
+
+Artifacts:
+
+- `/home/mudler/bench/phase37_cublas_name_trace/20260701_083227`
+- `/home/mudler/bench/phase27_graph_node_serving/20260701_055519`
+- `/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis`
+
+Evidence:
+
+| source | result |
+|--------|--------|
+| Phase37 route trace | `sgemm=1212`, with per-layer `ffn_gate_inp.weight -> ffn_moe_logits` and `ffn_gate_inp_shexp.weight -> shared_expert_gate` entries |
+| Phase27 serving profile | total kernel time `20.0372s` |
+| Phase27 serving profile | `concat_layout=459.84ms` (`2.29%`, `2250` instances) |
+| Phase27 serving profile | `cublas_bf16_gemm=1892.81ms` (`9.45%`) and `cutlass_bf16_gemm=684.01ms` (`3.41%`) |
+
+Decision:
+
+- Reject the quick graph-time fused gate shortcut based on `ggml_concat()` of
+  the two gate weights. `concat_layout` is already a measurable serving bucket,
+  so adding graph-time weight concatenation risks moving work into an existing
+  bottleneck before removing enough SGEMM overhead.
+- The only acceptable future fused-gate design is a persistent/load-time F32
+  combined gate weight, split by output views after one matmul. It must be
+  default-off, keep gate weights in F32, avoid graph-time weight concat, and
+  pass MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` gates before any serving
+  benchmark. If md5 changes, run KL first and reject on KL regression.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -502,6 +502,17 @@ weight concatenation, not BF16/NVFP4 routing. Future fused-gate work must be
 default-off, preserve F32 semantics, and pass md5/op gates before benchmarking;
 if md5 changes, run KL first.

+Phase 39 closes the naive fused-gate shortcut. Artifact:
+`/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis`. Re-analysis
+of the Phase27 graph-node serving profile showed total kernel time `20.0372s`,
+`concat_layout=459.84ms` (`2.29%`, `2250` instances), `cublas_bf16_gemm=1892.81ms`
+(`9.45%`), and `cutlass_bf16_gemm=684.01ms` (`3.41%`). Do not implement
+graph-time `ggml_concat()` of `ffn_gate_inp.weight` plus
+`ffn_gate_inp_shexp.weight`; it risks increasing an existing layout-copy bucket.
+The only future fused-gate design worth scoping is a persistent/load-time F32
+combined gate weight with output views, default-off until MoE/dense md5,
+`MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -582,6 +593,8 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/phase36_cublas_route_trace/20260701_081228` - default-off cuBLAS subroute trace patch `0062`; default/trace/post-serving md5 and op gates green; n128 route trace found `bf16_tc=5681`, `sgemm=2511`.
 - `~/bench/phase37_cublas_name_trace/20260701_083227` - cuBLAS tensor-name trace patch `0063`; default/trace/post-serving md5 and op gates green; n128 trace identified `sgemm` as MoE gate logits and shared-expert gate projections.
 - `~/bench/phase38_gate_baseline/20260701_084410` - current Phase37 build baseline before gate-projection policy work; docker/local-ai-worker/GPU idle preflight green; MoE/dense md5 green; `MUL_MAT` `1146/1146`; `MUL_MAT_ID` `806/806`.
+- `~/bench/phase39_gate_sgemm_profile/20260701_085211` - short completion profile, diagnostic only because `-n 32` is not a canonical md5 gate; useful for confirming graph-time concat is a real kernel path.
+- `~/bench/phase39_gate_sgemm_profile/phase27_reanalysis` - Phase27 serving profile re-analysis used to reject graph-time fused gate weight concat; `concat_layout=459.84ms` (`2.29%`) in the serving kernel window.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -996,6 +996,22 @@ that computes both logits in one matmul and splits the output, but it must pass
 MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates before benchmarking. If md5
 changes, run the KL gate first and reject on any KL regression.

+### Phase 39 gate fusion feasibility
+
+Phase 39 rejected the tempting low-conflict implementation of the Phase38 idea:
+do not build a graph-time `ggml_concat()` of `ffn_gate_inp.weight` and
+`ffn_gate_inp_shexp.weight` just to issue one combined gate matmul. Phase37
+proved the named `sgemm` bucket is the two gate projections, but Phase27's
+graph-node serving profile already has `concat_layout=459.84ms` (`2.29%`,
+`2250` instances) in a `20.0372s` kernel window. Adding another concat path for
+weights would likely trade one small SGEMM shortcut for more layout-copy work.
+
+The follow-up remains valid only in the persistent-weight form: create a
+load-time F32 combined gate tensor, run one matmul, and view/split the output
+into `ffn_moe_logits` and `shared_expert_gate`. That is a model-loader/weight
+layout feature, not a graph shortcut. It must stay default-off until MoE/dense
+md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md
+++ b/docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md
@@ -0,0 +1,158 @@
+# Gate Fusion Feasibility Phase39 Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** decide whether to implement a quick fused F32 router/shared-expert gate projection after Phase38.
+
+**Architecture:** Phase39 is evidence-first and source-conservative. It compares the Phase37 tensor-name trace, the Phase27 graph-node serving profile, and llama.cpp graph/model-loader capabilities. It rejects graph-time weight concatenation because it would add layout-copy work in a bucket that is already measurable, and scopes the only acceptable follow-up as a persistent/load-time combined-weight design with md5/op/KL gates.
+
+**Tech Stack:** LocalAI paged llama.cpp backend, llama.cpp CUDA fork, DGX GB10, Nsight Systems, vLLM Qwen3-Next fused-MoE source comparison.
+
+---
+
+### Task 1: Inspect current graph/model support
+
+**Files:**
+- Read: `/home/mudler/_git/llama.cpp/ggml/include/ggml.h`
+- Read: `/home/mudler/_git/llama.cpp/src/models/qwen35moe.cpp`
+- Read: `/home/mudler/_git/llama.cpp/src/llama-graph.cpp`
+- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fused_moe/runner/moe_runner.py`
+
+- [x] **Step 1: Confirm llama.cpp gate tensors**
+
+`qwen35moe.cpp` creates:
+
+```cpp
+layer.ffn_gate_inp       = create_tensor(..., { n_embd, n_expert }, flags);
+layer.ffn_gate_inp_shexp = create_tensor(..., { n_embd }, flags);
+```
+
+and computes:
+
+```cpp
+build_moe_ffn(cur, model.layers[il].ffn_gate_inp, ...)
+build_lora_mm(model.layers[il].ffn_gate_inp_shexp, cur)
+```
+
+- [x] **Step 2: Confirm ggml graph-time concat is available but not free**
+
+`ggml.h` exposes `ggml_concat()` and `ggml_view_*()`, so a graph-time fused
+gate is syntactically possible. It would require building a temporary combined
+weight in the compute graph unless the model loader creates a persistent
+combined tensor.
+
+- [x] **Step 3: Confirm vLLM's relevant idea**
+
+vLLM's fused-MoE runner concatenates router and shared-expert gate weights into
+`_combined_gate_weight`. The useful design pattern is persistent F32 combined
+gate weight, not BF16/NVFP4 routing.
+
+### Task 2: Reuse existing serving evidence
+
+**Files:**
+- Artifact: `dgx.casa:/home/mudler/bench/phase37_cublas_name_trace/20260701_083227`
+- Artifact: `dgx.casa:/home/mudler/bench/phase27_graph_node_serving/20260701_055519`
+- Artifact: `dgx.casa:/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis`
+
+- [x] **Step 1: Read Phase37 route-name evidence**
+
+Observed:
+
+```text
+2884 route=bf16_tc
+1212 route=sgemm
+16 route=sgemm type=0 src0=blk.N.ffn_gate_inp.weight src1=attn_post_norm-N dst=ffn_moe_logits-N
+16 route=sgemm type=0 src0=blk.N.ffn_gate_inp_shexp.weight src1=attn_post_norm-N dst=shared_expert_gate-N
+```
+
+- [x] **Step 2: Re-analyze Phase27 graph-node serving profile**
+
+Run:
+
+```bash
+ssh dgx.casa 'set -euo pipefail; ART=/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis; SRC=/home/mudler/bench/phase27_graph_node_serving/20260701_055519/llama_graph_node.nsys-rep; mkdir -p "$ART"; nsys stats --report cuda_gpu_kern_sum,cuda_api_sum --format csv --output "$ART/phase27" "$SRC"'
+```
+
+Observed serving kernel buckets:
+
+```text
+TOTAL kernel time: 20.0372 s
+cublas_bf16_gemm       1892.81ms   9.45%
+cutlass_bf16_gemm       684.01ms   3.41%
+concat_layout           459.84ms   2.29%
+```
+
+Top raw kernel evidence includes:
+
+```text
+concat_non_cont         459.84ms   2.3%  2250 instances
+```
+
+### Task 3: Decision
+
+**Files:**
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
+
+- [x] **Step 1: Reject graph-time fused gate via `ggml_concat`**
+
+Do not implement a quick graph-time combined gate that concatenates
+`ffn_gate_inp` and `ffn_gate_inp_shexp` inside the compute graph. It risks
+adding work to the existing `concat_layout` bucket (`459.84ms`, `2.29%`) before
+removing enough SGEMM overhead, and it would be a high-conflict graph/model edit
+without clear upside.
+
+- [x] **Step 2: Preserve the only acceptable follow-up shape**
+
+The only follow-up worth scoping is a persistent/load-time F32 combined gate
+weight:
+
+```text
+combined_gate_weight = concat_rows(ffn_gate_inp.weight,
+                                   ffn_gate_inp_shexp.weight)
+```
+
+Requirements:
+
+- default-off until gates pass;
+- no BF16/NVFP4 conversion for gate weights;
+- no graph-time weight concat;
+- split combined output into `ffn_moe_logits` and `shared_expert_gate` views;
+- MoE/dense md5 must match before serving benchmarks;
+- `MUL_MAT` and `MUL_MAT_ID` must pass;
+- if md5 changes, run KL first and reject on KL regression.
+
+### Task 4: Verify and commit docs
+
+**Files:**
+- Modify: `docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
+
+- [x] **Step 1: Check docs diff**
+
+Run:
+
+```bash
+git diff -- docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md \
+  backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
+  backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
+  backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+```
+
+Expected: only Phase39 documentation changes.
+
+- [x] **Step 2: Commit**
+
+Run:
+
+```bash
+git add -f docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md
+git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
+  backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
+  backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+git commit -m "docs(paged): reject graph-time gate fusion shortcut" \
+  -m "Assisted-by: Codex:gpt-5"
+```