diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 89f03afea..8a0cf5c8e 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -2263,3 +2263,35 @@ Decision: preserves F32 math and split semantics. Gate it with MoE/dense md5, `MUL_MAT`, `MUL_MAT_ID`, and KL validation if either md5 changes before any serving benchmark. + +## Phase 39 Gate Fusion Feasibility + +Phase 39 checked whether the Phase38 follow-up should be a quick graph-time +fused gate projection. + +Artifacts: + +- `/home/mudler/bench/phase37_cublas_name_trace/20260701_083227` +- `/home/mudler/bench/phase27_graph_node_serving/20260701_055519` +- `/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis` + +Evidence: + +| source | result | +|--------|--------| +| Phase37 route trace | `sgemm=1212`, with per-layer `ffn_gate_inp.weight -> ffn_moe_logits` and `ffn_gate_inp_shexp.weight -> shared_expert_gate` entries | +| Phase27 serving profile | total kernel time `20.0372s` | +| Phase27 serving profile | `concat_layout=459.84ms` (`2.29%`, `2250` instances) | +| Phase27 serving profile | `cublas_bf16_gemm=1892.81ms` (`9.45%`) and `cutlass_bf16_gemm=684.01ms` (`3.41%`) | + +Decision: + +- Reject the quick graph-time fused gate shortcut based on `ggml_concat()` of + the two gate weights. `concat_layout` is already a measurable serving bucket, + so adding graph-time weight concatenation risks moving work into an existing + bottleneck before removing enough SGEMM overhead. +- The only acceptable future fused-gate design is a persistent/load-time F32 + combined gate weight, split by output views after one matmul. It must be + default-off, keep gate weights in F32, avoid graph-time weight concat, and + pass MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` gates before any serving + benchmark. If md5 changes, run KL first and reject on KL regression. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 45c2a67bb..6c190c2fb 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -502,6 +502,17 @@ weight concatenation, not BF16/NVFP4 routing. Future fused-gate work must be default-off, preserve F32 semantics, and pass md5/op gates before benchmarking; if md5 changes, run KL first. +Phase 39 closes the naive fused-gate shortcut. Artifact: +`/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis`. Re-analysis +of the Phase27 graph-node serving profile showed total kernel time `20.0372s`, +`concat_layout=459.84ms` (`2.29%`, `2250` instances), `cublas_bf16_gemm=1892.81ms` +(`9.45%`), and `cutlass_bf16_gemm=684.01ms` (`3.41%`). Do not implement +graph-time `ggml_concat()` of `ffn_gate_inp.weight` plus +`ffn_gate_inp_shexp.weight`; it risks increasing an existing layout-copy bucket. +The only future fused-gate design worth scoping is a persistent/load-time F32 +combined gate weight with output views, default-off until MoE/dense md5, +`MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -582,6 +593,8 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase36_cublas_route_trace/20260701_081228` - default-off cuBLAS subroute trace patch `0062`; default/trace/post-serving md5 and op gates green; n128 route trace found `bf16_tc=5681`, `sgemm=2511`. - `~/bench/phase37_cublas_name_trace/20260701_083227` - cuBLAS tensor-name trace patch `0063`; default/trace/post-serving md5 and op gates green; n128 trace identified `sgemm` as MoE gate logits and shared-expert gate projections. - `~/bench/phase38_gate_baseline/20260701_084410` - current Phase37 build baseline before gate-projection policy work; docker/local-ai-worker/GPU idle preflight green; MoE/dense md5 green; `MUL_MAT` `1146/1146`; `MUL_MAT_ID` `806/806`. +- `~/bench/phase39_gate_sgemm_profile/20260701_085211` - short completion profile, diagnostic only because `-n 32` is not a canonical md5 gate; useful for confirming graph-time concat is a real kernel path. +- `~/bench/phase39_gate_sgemm_profile/phase27_reanalysis` - Phase27 serving profile re-analysis used to reject graph-time fused gate weight concat; `concat_layout=459.84ms` (`2.29%`) in the serving kernel window. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 8f9c1a915..680e19e3f 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -996,6 +996,22 @@ that computes both logits in one matmul and splits the output, but it must pass MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates before benchmarking. If md5 changes, run the KL gate first and reject on any KL regression. +### Phase 39 gate fusion feasibility + +Phase 39 rejected the tempting low-conflict implementation of the Phase38 idea: +do not build a graph-time `ggml_concat()` of `ffn_gate_inp.weight` and +`ffn_gate_inp_shexp.weight` just to issue one combined gate matmul. Phase37 +proved the named `sgemm` bucket is the two gate projections, but Phase27's +graph-node serving profile already has `concat_layout=459.84ms` (`2.29%`, +`2250` instances) in a `20.0372s` kernel window. Adding another concat path for +weights would likely trade one small SGEMM shortcut for more layout-copy work. + +The follow-up remains valid only in the persistent-weight form: create a +load-time F32 combined gate tensor, run one matmul, and view/split the output +into `ffn_moe_logits` and `shared_expert_gate`. That is a model-loader/weight +layout feature, not a graph shortcut. It must stay default-off until MoE/dense +md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md b/docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md new file mode 100644 index 000000000..a9ec72dbf --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md @@ -0,0 +1,158 @@ +# Gate Fusion Feasibility Phase39 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** decide whether to implement a quick fused F32 router/shared-expert gate projection after Phase38. + +**Architecture:** Phase39 is evidence-first and source-conservative. It compares the Phase37 tensor-name trace, the Phase27 graph-node serving profile, and llama.cpp graph/model-loader capabilities. It rejects graph-time weight concatenation because it would add layout-copy work in a bucket that is already measurable, and scopes the only acceptable follow-up as a persistent/load-time combined-weight design with md5/op/KL gates. + +**Tech Stack:** LocalAI paged llama.cpp backend, llama.cpp CUDA fork, DGX GB10, Nsight Systems, vLLM Qwen3-Next fused-MoE source comparison. + +--- + +### Task 1: Inspect current graph/model support + +**Files:** +- Read: `/home/mudler/_git/llama.cpp/ggml/include/ggml.h` +- Read: `/home/mudler/_git/llama.cpp/src/models/qwen35moe.cpp` +- Read: `/home/mudler/_git/llama.cpp/src/llama-graph.cpp` +- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fused_moe/runner/moe_runner.py` + +- [x] **Step 1: Confirm llama.cpp gate tensors** + +`qwen35moe.cpp` creates: + +```cpp +layer.ffn_gate_inp = create_tensor(..., { n_embd, n_expert }, flags); +layer.ffn_gate_inp_shexp = create_tensor(..., { n_embd }, flags); +``` + +and computes: + +```cpp +build_moe_ffn(cur, model.layers[il].ffn_gate_inp, ...) +build_lora_mm(model.layers[il].ffn_gate_inp_shexp, cur) +``` + +- [x] **Step 2: Confirm ggml graph-time concat is available but not free** + +`ggml.h` exposes `ggml_concat()` and `ggml_view_*()`, so a graph-time fused +gate is syntactically possible. It would require building a temporary combined +weight in the compute graph unless the model loader creates a persistent +combined tensor. + +- [x] **Step 3: Confirm vLLM's relevant idea** + +vLLM's fused-MoE runner concatenates router and shared-expert gate weights into +`_combined_gate_weight`. The useful design pattern is persistent F32 combined +gate weight, not BF16/NVFP4 routing. + +### Task 2: Reuse existing serving evidence + +**Files:** +- Artifact: `dgx.casa:/home/mudler/bench/phase37_cublas_name_trace/20260701_083227` +- Artifact: `dgx.casa:/home/mudler/bench/phase27_graph_node_serving/20260701_055519` +- Artifact: `dgx.casa:/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis` + +- [x] **Step 1: Read Phase37 route-name evidence** + +Observed: + +```text +2884 route=bf16_tc +1212 route=sgemm +16 route=sgemm type=0 src0=blk.N.ffn_gate_inp.weight src1=attn_post_norm-N dst=ffn_moe_logits-N +16 route=sgemm type=0 src0=blk.N.ffn_gate_inp_shexp.weight src1=attn_post_norm-N dst=shared_expert_gate-N +``` + +- [x] **Step 2: Re-analyze Phase27 graph-node serving profile** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail; ART=/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis; SRC=/home/mudler/bench/phase27_graph_node_serving/20260701_055519/llama_graph_node.nsys-rep; mkdir -p "$ART"; nsys stats --report cuda_gpu_kern_sum,cuda_api_sum --format csv --output "$ART/phase27" "$SRC"' +``` + +Observed serving kernel buckets: + +```text +TOTAL kernel time: 20.0372 s +cublas_bf16_gemm 1892.81ms 9.45% +cutlass_bf16_gemm 684.01ms 3.41% +concat_layout 459.84ms 2.29% +``` + +Top raw kernel evidence includes: + +```text +concat_non_cont 459.84ms 2.3% 2250 instances +``` + +### Task 3: Decision + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +- [x] **Step 1: Reject graph-time fused gate via `ggml_concat`** + +Do not implement a quick graph-time combined gate that concatenates +`ffn_gate_inp` and `ffn_gate_inp_shexp` inside the compute graph. It risks +adding work to the existing `concat_layout` bucket (`459.84ms`, `2.29%`) before +removing enough SGEMM overhead, and it would be a high-conflict graph/model edit +without clear upside. + +- [x] **Step 2: Preserve the only acceptable follow-up shape** + +The only follow-up worth scoping is a persistent/load-time F32 combined gate +weight: + +```text +combined_gate_weight = concat_rows(ffn_gate_inp.weight, + ffn_gate_inp_shexp.weight) +``` + +Requirements: + +- default-off until gates pass; +- no BF16/NVFP4 conversion for gate weights; +- no graph-time weight concat; +- split combined output into `ffn_moe_logits` and `shared_expert_gate` views; +- MoE/dense md5 must match before serving benchmarks; +- `MUL_MAT` and `MUL_MAT_ID` must pass; +- if md5 changes, run KL first and reject on KL regression. + +### Task 4: Verify and commit docs + +**Files:** +- Modify: `docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +- [x] **Step 1: Check docs diff** + +Run: + +```bash +git diff -- docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +``` + +Expected: only Phase39 documentation changes. + +- [x] **Step 2: Commit** + +Run: + +```bash +git add -f docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git commit -m "docs(paged): reject graph-time gate fusion shortcut" \ + -m "Assisted-by: Codex:gpt-5" +```