docs(paged): reject graph-time gate fusion shortcut

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 06:55:48 +00:00
parent 5354adcffb
commit 52c11b1ce5
4 changed files with 219 additions and 0 deletions

View File

@@ -2263,3 +2263,35 @@ Decision:
preserves F32 math and split semantics. Gate it with MoE/dense md5,
`MUL_MAT`, `MUL_MAT_ID`, and KL validation if either md5 changes before any
serving benchmark.
## Phase 39 Gate Fusion Feasibility
Phase 39 checked whether the Phase38 follow-up should be a quick graph-time
fused gate projection.
Artifacts:
- `/home/mudler/bench/phase37_cublas_name_trace/20260701_083227`
- `/home/mudler/bench/phase27_graph_node_serving/20260701_055519`
- `/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis`
Evidence:
| source | result |
|--------|--------|
| Phase37 route trace | `sgemm=1212`, with per-layer `ffn_gate_inp.weight -> ffn_moe_logits` and `ffn_gate_inp_shexp.weight -> shared_expert_gate` entries |
| Phase27 serving profile | total kernel time `20.0372s` |
| Phase27 serving profile | `concat_layout=459.84ms` (`2.29%`, `2250` instances) |
| Phase27 serving profile | `cublas_bf16_gemm=1892.81ms` (`9.45%`) and `cutlass_bf16_gemm=684.01ms` (`3.41%`) |
Decision:
- Reject the quick graph-time fused gate shortcut based on `ggml_concat()` of
the two gate weights. `concat_layout` is already a measurable serving bucket,
so adding graph-time weight concatenation risks moving work into an existing
bottleneck before removing enough SGEMM overhead.
- The only acceptable future fused-gate design is a persistent/load-time F32
combined gate weight, split by output views after one matmul. It must be
default-off, keep gate weights in F32, avoid graph-time weight concat, and
pass MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` gates before any serving
benchmark. If md5 changes, run KL first and reject on KL regression.

View File

@@ -502,6 +502,17 @@ weight concatenation, not BF16/NVFP4 routing. Future fused-gate work must be
default-off, preserve F32 semantics, and pass md5/op gates before benchmarking;
if md5 changes, run KL first.
Phase 39 closes the naive fused-gate shortcut. Artifact:
`/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis`. Re-analysis
of the Phase27 graph-node serving profile showed total kernel time `20.0372s`,
`concat_layout=459.84ms` (`2.29%`, `2250` instances), `cublas_bf16_gemm=1892.81ms`
(`9.45%`), and `cutlass_bf16_gemm=684.01ms` (`3.41%`). Do not implement
graph-time `ggml_concat()` of `ffn_gate_inp.weight` plus
`ffn_gate_inp_shexp.weight`; it risks increasing an existing layout-copy bucket.
The only future fused-gate design worth scoping is a persistent/load-time F32
combined gate weight with output views, default-off until MoE/dense md5,
`MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -582,6 +593,8 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
- `~/bench/phase36_cublas_route_trace/20260701_081228` - default-off cuBLAS subroute trace patch `0062`; default/trace/post-serving md5 and op gates green; n128 route trace found `bf16_tc=5681`, `sgemm=2511`.
- `~/bench/phase37_cublas_name_trace/20260701_083227` - cuBLAS tensor-name trace patch `0063`; default/trace/post-serving md5 and op gates green; n128 trace identified `sgemm` as MoE gate logits and shared-expert gate projections.
- `~/bench/phase38_gate_baseline/20260701_084410` - current Phase37 build baseline before gate-projection policy work; docker/local-ai-worker/GPU idle preflight green; MoE/dense md5 green; `MUL_MAT` `1146/1146`; `MUL_MAT_ID` `806/806`.
- `~/bench/phase39_gate_sgemm_profile/20260701_085211` - short completion profile, diagnostic only because `-n 32` is not a canonical md5 gate; useful for confirming graph-time concat is a real kernel path.
- `~/bench/phase39_gate_sgemm_profile/phase27_reanalysis` - Phase27 serving profile re-analysis used to reject graph-time fused gate weight concat; `concat_layout=459.84ms` (`2.29%`) in the serving kernel window.
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.

View File

@@ -996,6 +996,22 @@ that computes both logits in one matmul and splits the output, but it must pass
MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates before benchmarking. If md5
changes, run the KL gate first and reject on any KL regression.
### Phase 39 gate fusion feasibility
Phase 39 rejected the tempting low-conflict implementation of the Phase38 idea:
do not build a graph-time `ggml_concat()` of `ffn_gate_inp.weight` and
`ffn_gate_inp_shexp.weight` just to issue one combined gate matmul. Phase37
proved the named `sgemm` bucket is the two gate projections, but Phase27's
graph-node serving profile already has `concat_layout=459.84ms` (`2.29%`,
`2250` instances) in a `20.0372s` kernel window. Adding another concat path for
weights would likely trade one small SGEMM shortcut for more layout-copy work.
The follow-up remains valid only in the persistent-weight form: create a
load-time F32 combined gate tensor, run one matmul, and view/split the output
into `ffn_moe_logits` and `shared_expert_gate`. That is a model-loader/weight
layout feature, not a graph shortcut. It must stay default-off until MoE/dense
md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update