docs(paged): reject graph-time gate fusion shortcut

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 06:55:48 +00:00
parent 5354adcffb
commit 52c11b1ce5
4 changed files with 219 additions and 0 deletions

View File

@@ -2263,3 +2263,35 @@ Decision:
preserves F32 math and split semantics. Gate it with MoE/dense md5,
`MUL_MAT`, `MUL_MAT_ID`, and KL validation if either md5 changes before any
serving benchmark.
## Phase 39 Gate Fusion Feasibility
Phase 39 checked whether the Phase38 follow-up should be a quick graph-time
fused gate projection.
Artifacts:
- `/home/mudler/bench/phase37_cublas_name_trace/20260701_083227`
- `/home/mudler/bench/phase27_graph_node_serving/20260701_055519`
- `/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis`
Evidence:
| source | result |
|--------|--------|
| Phase37 route trace | `sgemm=1212`, with per-layer `ffn_gate_inp.weight -> ffn_moe_logits` and `ffn_gate_inp_shexp.weight -> shared_expert_gate` entries |
| Phase27 serving profile | total kernel time `20.0372s` |
| Phase27 serving profile | `concat_layout=459.84ms` (`2.29%`, `2250` instances) |
| Phase27 serving profile | `cublas_bf16_gemm=1892.81ms` (`9.45%`) and `cutlass_bf16_gemm=684.01ms` (`3.41%`) |
Decision:
- Reject the quick graph-time fused gate shortcut based on `ggml_concat()` of
the two gate weights. `concat_layout` is already a measurable serving bucket,
so adding graph-time weight concatenation risks moving work into an existing
bottleneck before removing enough SGEMM overhead.
- The only acceptable future fused-gate design is a persistent/load-time F32
combined gate weight, split by output views after one matmul. It must be
default-off, keep gate weights in F32, avoid graph-time weight concat, and
pass MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` gates before any serving
benchmark. If md5 changes, run KL first and reject on KL regression.

View File

@@ -502,6 +502,17 @@ weight concatenation, not BF16/NVFP4 routing. Future fused-gate work must be
default-off, preserve F32 semantics, and pass md5/op gates before benchmarking;
if md5 changes, run KL first.
Phase 39 closes the naive fused-gate shortcut. Artifact:
`/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis`. Re-analysis
of the Phase27 graph-node serving profile showed total kernel time `20.0372s`,
`concat_layout=459.84ms` (`2.29%`, `2250` instances), `cublas_bf16_gemm=1892.81ms`
(`9.45%`), and `cutlass_bf16_gemm=684.01ms` (`3.41%`). Do not implement
graph-time `ggml_concat()` of `ffn_gate_inp.weight` plus
`ffn_gate_inp_shexp.weight`; it risks increasing an existing layout-copy bucket.
The only future fused-gate design worth scoping is a persistent/load-time F32
combined gate weight with output views, default-off until MoE/dense md5,
`MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -582,6 +593,8 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
- `~/bench/phase36_cublas_route_trace/20260701_081228` - default-off cuBLAS subroute trace patch `0062`; default/trace/post-serving md5 and op gates green; n128 route trace found `bf16_tc=5681`, `sgemm=2511`.
- `~/bench/phase37_cublas_name_trace/20260701_083227` - cuBLAS tensor-name trace patch `0063`; default/trace/post-serving md5 and op gates green; n128 trace identified `sgemm` as MoE gate logits and shared-expert gate projections.
- `~/bench/phase38_gate_baseline/20260701_084410` - current Phase37 build baseline before gate-projection policy work; docker/local-ai-worker/GPU idle preflight green; MoE/dense md5 green; `MUL_MAT` `1146/1146`; `MUL_MAT_ID` `806/806`.
- `~/bench/phase39_gate_sgemm_profile/20260701_085211` - short completion profile, diagnostic only because `-n 32` is not a canonical md5 gate; useful for confirming graph-time concat is a real kernel path.
- `~/bench/phase39_gate_sgemm_profile/phase27_reanalysis` - Phase27 serving profile re-analysis used to reject graph-time fused gate weight concat; `concat_layout=459.84ms` (`2.29%`) in the serving kernel window.
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.

View File

@@ -996,6 +996,22 @@ that computes both logits in one matmul and splits the output, but it must pass
MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates before benchmarking. If md5
changes, run the KL gate first and reject on any KL regression.
### Phase 39 gate fusion feasibility
Phase 39 rejected the tempting low-conflict implementation of the Phase38 idea:
do not build a graph-time `ggml_concat()` of `ffn_gate_inp.weight` and
`ffn_gate_inp_shexp.weight` just to issue one combined gate matmul. Phase37
proved the named `sgemm` bucket is the two gate projections, but Phase27's
graph-node serving profile already has `concat_layout=459.84ms` (`2.29%`,
`2250` instances) in a `20.0372s` kernel window. Adding another concat path for
weights would likely trade one small SGEMM shortcut for more layout-copy work.
The follow-up remains valid only in the persistent-weight form: create a
load-time F32 combined gate tensor, run one matmul, and view/split the output
into `ffn_moe_logits` and `shared_expert_gate`. That is a model-loader/weight
layout feature, not a graph shortcut. It must stay default-off until MoE/dense
md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update

View File

@@ -0,0 +1,158 @@
# Gate Fusion Feasibility Phase39 Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** decide whether to implement a quick fused F32 router/shared-expert gate projection after Phase38.
**Architecture:** Phase39 is evidence-first and source-conservative. It compares the Phase37 tensor-name trace, the Phase27 graph-node serving profile, and llama.cpp graph/model-loader capabilities. It rejects graph-time weight concatenation because it would add layout-copy work in a bucket that is already measurable, and scopes the only acceptable follow-up as a persistent/load-time combined-weight design with md5/op/KL gates.
**Tech Stack:** LocalAI paged llama.cpp backend, llama.cpp CUDA fork, DGX GB10, Nsight Systems, vLLM Qwen3-Next fused-MoE source comparison.
---
### Task 1: Inspect current graph/model support
**Files:**
- Read: `/home/mudler/_git/llama.cpp/ggml/include/ggml.h`
- Read: `/home/mudler/_git/llama.cpp/src/models/qwen35moe.cpp`
- Read: `/home/mudler/_git/llama.cpp/src/llama-graph.cpp`
- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fused_moe/runner/moe_runner.py`
- [x] **Step 1: Confirm llama.cpp gate tensors**
`qwen35moe.cpp` creates:
```cpp
layer.ffn_gate_inp = create_tensor(..., { n_embd, n_expert }, flags);
layer.ffn_gate_inp_shexp = create_tensor(..., { n_embd }, flags);
```
and computes:
```cpp
build_moe_ffn(cur, model.layers[il].ffn_gate_inp, ...)
build_lora_mm(model.layers[il].ffn_gate_inp_shexp, cur)
```
- [x] **Step 2: Confirm ggml graph-time concat is available but not free**
`ggml.h` exposes `ggml_concat()` and `ggml_view_*()`, so a graph-time fused
gate is syntactically possible. It would require building a temporary combined
weight in the compute graph unless the model loader creates a persistent
combined tensor.
- [x] **Step 3: Confirm vLLM's relevant idea**
vLLM's fused-MoE runner concatenates router and shared-expert gate weights into
`_combined_gate_weight`. The useful design pattern is persistent F32 combined
gate weight, not BF16/NVFP4 routing.
### Task 2: Reuse existing serving evidence
**Files:**
- Artifact: `dgx.casa:/home/mudler/bench/phase37_cublas_name_trace/20260701_083227`
- Artifact: `dgx.casa:/home/mudler/bench/phase27_graph_node_serving/20260701_055519`
- Artifact: `dgx.casa:/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis`
- [x] **Step 1: Read Phase37 route-name evidence**
Observed:
```text
2884 route=bf16_tc
1212 route=sgemm
16 route=sgemm type=0 src0=blk.N.ffn_gate_inp.weight src1=attn_post_norm-N dst=ffn_moe_logits-N
16 route=sgemm type=0 src0=blk.N.ffn_gate_inp_shexp.weight src1=attn_post_norm-N dst=shared_expert_gate-N
```
- [x] **Step 2: Re-analyze Phase27 graph-node serving profile**
Run:
```bash
ssh dgx.casa 'set -euo pipefail; ART=/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis; SRC=/home/mudler/bench/phase27_graph_node_serving/20260701_055519/llama_graph_node.nsys-rep; mkdir -p "$ART"; nsys stats --report cuda_gpu_kern_sum,cuda_api_sum --format csv --output "$ART/phase27" "$SRC"'
```
Observed serving kernel buckets:
```text
TOTAL kernel time: 20.0372 s
cublas_bf16_gemm 1892.81ms 9.45%
cutlass_bf16_gemm 684.01ms 3.41%
concat_layout 459.84ms 2.29%
```
Top raw kernel evidence includes:
```text
concat_non_cont 459.84ms 2.3% 2250 instances
```
### Task 3: Decision
**Files:**
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
- [x] **Step 1: Reject graph-time fused gate via `ggml_concat`**
Do not implement a quick graph-time combined gate that concatenates
`ffn_gate_inp` and `ffn_gate_inp_shexp` inside the compute graph. It risks
adding work to the existing `concat_layout` bucket (`459.84ms`, `2.29%`) before
removing enough SGEMM overhead, and it would be a high-conflict graph/model edit
without clear upside.
- [x] **Step 2: Preserve the only acceptable follow-up shape**
The only follow-up worth scoping is a persistent/load-time F32 combined gate
weight:
```text
combined_gate_weight = concat_rows(ffn_gate_inp.weight,
ffn_gate_inp_shexp.weight)
```
Requirements:
- default-off until gates pass;
- no BF16/NVFP4 conversion for gate weights;
- no graph-time weight concat;
- split combined output into `ffn_moe_logits` and `shared_expert_gate` views;
- MoE/dense md5 must match before serving benchmarks;
- `MUL_MAT` and `MUL_MAT_ID` must pass;
- if md5 changes, run KL first and reject on KL regression.
### Task 4: Verify and commit docs
**Files:**
- Modify: `docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
- [x] **Step 1: Check docs diff**
Run:
```bash
git diff -- docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md \
backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
```
Expected: only Phase39 documentation changes.
- [x] **Step 2: Commit**
Run:
```bash
git add -f docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md
git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
git commit -m "docs(paged): reject graph-time gate fusion shortcut" \
-m "Assisted-by: Codex:gpt-5"
```