docs(paged): reject graph-time gate fusion shortcut

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 06:55:48 +00:00
parent 5354adcffb
commit 52c11b1ce5
4 changed files with 219 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -2263,3 +2263,35 @@ Decision:
  preserves F32 math and split semantics. Gate it with MoE/dense md5,
  `MUL_MAT`, `MUL_MAT_ID`, and KL validation if either md5 changes before any
  serving benchmark.
+
+## Phase 39 Gate Fusion Feasibility
+
+Phase 39 checked whether the Phase38 follow-up should be a quick graph-time
+fused gate projection.
+
+Artifacts:
+
+- `/home/mudler/bench/phase37_cublas_name_trace/20260701_083227`
+- `/home/mudler/bench/phase27_graph_node_serving/20260701_055519`
+- `/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis`
+
+Evidence:
+
+| source | result |
+|--------|--------|
+| Phase37 route trace | `sgemm=1212`, with per-layer `ffn_gate_inp.weight -> ffn_moe_logits` and `ffn_gate_inp_shexp.weight -> shared_expert_gate` entries |
+| Phase27 serving profile | total kernel time `20.0372s` |
+| Phase27 serving profile | `concat_layout=459.84ms` (`2.29%`, `2250` instances) |
+| Phase27 serving profile | `cublas_bf16_gemm=1892.81ms` (`9.45%`) and `cutlass_bf16_gemm=684.01ms` (`3.41%`) |
+
+Decision:
+
+- Reject the quick graph-time fused gate shortcut based on `ggml_concat()` of
+  the two gate weights. `concat_layout` is already a measurable serving bucket,
+  so adding graph-time weight concatenation risks moving work into an existing
+  bottleneck before removing enough SGEMM overhead.
+- The only acceptable future fused-gate design is a persistent/load-time F32
+  combined gate weight, split by output views after one matmul. It must be
+  default-off, keep gate weights in F32, avoid graph-time weight concat, and
+  pass MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` gates before any serving
+  benchmark. If md5 changes, run KL first and reject on KL regression.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -502,6 +502,17 @@ weight concatenation, not BF16/NVFP4 routing. Future fused-gate work must be
 default-off, preserve F32 semantics, and pass md5/op gates before benchmarking;
 if md5 changes, run KL first.

+Phase 39 closes the naive fused-gate shortcut. Artifact:
+`/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis`. Re-analysis
+of the Phase27 graph-node serving profile showed total kernel time `20.0372s`,
+`concat_layout=459.84ms` (`2.29%`, `2250` instances), `cublas_bf16_gemm=1892.81ms`
+(`9.45%`), and `cutlass_bf16_gemm=684.01ms` (`3.41%`). Do not implement
+graph-time `ggml_concat()` of `ffn_gate_inp.weight` plus
+`ffn_gate_inp_shexp.weight`; it risks increasing an existing layout-copy bucket.
+The only future fused-gate design worth scoping is a persistent/load-time F32
+combined gate weight with output views, default-off until MoE/dense md5,
+`MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -582,6 +593,8 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/phase36_cublas_route_trace/20260701_081228` - default-off cuBLAS subroute trace patch `0062`; default/trace/post-serving md5 and op gates green; n128 route trace found `bf16_tc=5681`, `sgemm=2511`.
 - `~/bench/phase37_cublas_name_trace/20260701_083227` - cuBLAS tensor-name trace patch `0063`; default/trace/post-serving md5 and op gates green; n128 trace identified `sgemm` as MoE gate logits and shared-expert gate projections.
 - `~/bench/phase38_gate_baseline/20260701_084410` - current Phase37 build baseline before gate-projection policy work; docker/local-ai-worker/GPU idle preflight green; MoE/dense md5 green; `MUL_MAT` `1146/1146`; `MUL_MAT_ID` `806/806`.
+- `~/bench/phase39_gate_sgemm_profile/20260701_085211` - short completion profile, diagnostic only because `-n 32` is not a canonical md5 gate; useful for confirming graph-time concat is a real kernel path.
+- `~/bench/phase39_gate_sgemm_profile/phase27_reanalysis` - Phase27 serving profile re-analysis used to reject graph-time fused gate weight concat; `concat_layout=459.84ms` (`2.29%`) in the serving kernel window.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -996,6 +996,22 @@ that computes both logits in one matmul and splits the output, but it must pass
 MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates before benchmarking. If md5
 changes, run the KL gate first and reject on any KL regression.

+### Phase 39 gate fusion feasibility
+
+Phase 39 rejected the tempting low-conflict implementation of the Phase38 idea:
+do not build a graph-time `ggml_concat()` of `ffn_gate_inp.weight` and
+`ffn_gate_inp_shexp.weight` just to issue one combined gate matmul. Phase37
+proved the named `sgemm` bucket is the two gate projections, but Phase27's
+graph-node serving profile already has `concat_layout=459.84ms` (`2.29%`,
+`2250` instances) in a `20.0372s` kernel window. Adding another concat path for
+weights would likely trade one small SGEMM shortcut for more layout-copy work.
+
+The follow-up remains valid only in the persistent-weight form: create a
+load-time F32 combined gate tensor, run one matmul, and view/split the output
+into `ffn_moe_logits` and `shared_expert_gate`. That is a model-loader/weight
+layout feature, not a graph shortcut. It must stay default-off until MoE/dense
+md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update