docs(paged): reject persistent gate fusion shortcut

Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-07-01 07:34:14 +00:00
parent b9eff5bca3
commit ecaf406c0b
4 changed files with 169 additions and 12 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -2469,3 +2469,35 @@ Next target:
 - It must be default-off, fork-first, and validated with MoE/dense md5,
  `MUL_MAT`, `MUL_MAT_ID`, and KL if either md5 changes before any serving
  benchmark.
+
+## Phase 43 Persistent Gate Fusion Feasibility
+
+Phase 43 checked whether the Phase42 "small source candidate" can really be
+implemented as a low-conflict persistent/load-time combined gate tensor.
+
+Source facts:
+
+| path | finding |
+|------|---------|
+| `src/models/qwen35moe.cpp` | `ffn_gate_inp.weight` is loaded as `[n_embd, n_expert]`; `ffn_gate_inp_shexp.weight` is loaded separately as `[n_embd]` |
+| `src/models/qwen35moe.cpp` | the routed gate is consumed inside `build_moe_ffn(...)`; the shared-expert gate is consumed later as a separate `build_lora_mm(ffn_gate_inp_shexp, cur)` |
+| `src/llama-model-loader.cpp` | `create_tensor(...)` duplicates tensors from GGUF metadata and allocates backend buffers before `load_all_data(...)`; it has `create_tensor_as_view(...)` for views of existing GGUF tensors, not for new persistent derived tensors |
+| `src/llama-model.cpp` | backend buffers are allocated from loader contexts before tensor data is loaded; adding a new persistent derived weight requires a new derived-weight allocation/materialization path, not a local Qwen graph change |
+
+Decision:
+
+- Reject persistent/load-time fused gate projection as a "small" GB10 shortcut.
+  It is only low-conflict if the combined weight already exists in the GGUF, or
+  if llama.cpp gains a general derived-weight facility. Neither is true in the
+  current fork.
+- Do not fall back to graph-time `ggml_concat()`; Phase39 already rejected that
+  because `concat_layout` is measurable in serving.
+- Do not implement a Qwen-only loader hack that reads both tensors back to host,
+  allocates an extra backend weight buffer, and patches layer pointers after
+  load. That is high conflict surface for a gate-only SGEMM bucket and would need
+  new lifetime/state-management tests across mmap, offload, split buffers, and
+  MTP blocks.
+- The remaining GB10 parity work is no longer a shortcut patch. It is either a
+  larger funded kernel/loader effort with its own design, or a hardware pivot
+  benchmark. Any future implementation still needs the canonical MoE/dense md5,
+  `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates before benchmarking.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -541,12 +541,21 @@ regressed TTFT/end-to-end throughput.
 Phase 42 reconciles the target list after parallel read-only review. D1 is
 closed on the current GB10 path; GDN low-conflict work is exhausted after
 `0046`/`0047` plus the rejected C32/QS-early/Global-Ai32 follow-ups; W4A16/GEMM
-micro-tweaks are exhausted after `0033`-`0035` and `0048`-`0050`. The next small
-GB10 source candidate is the Phase38/39 persistent/load-time F32 combined gate
-projection: combine `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` once,
-run one F32 gate matmul, split/view outputs, default-off, no graph-time
-`ggml_concat()`, and gate with MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` before
-benchmarking. If md5 changes, run KL first.
+micro-tweaks are exhausted after `0033`-`0035` and `0048`-`0050`. It nominated
+the Phase38/39 persistent/load-time F32 combined gate projection as the last
+small GB10 source candidate.
+
+Phase 43 rejects that gate-fusion candidate as a small shortcut after source
+inspection. `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` are separate
+GGUF tensors; the Qwen35MoE graph consumes them in separate matmuls; the loader
+can create tensors from GGUF metadata or views of existing tensors, but not a
+new persistent derived concatenated weight. A correct implementation would need
+a general derived-weight allocation/materialization path across mmap, offload,
+split buffers, and MTP blocks. Do not implement a Qwen-only loader hack, and do
+not fall back to graph-time `ggml_concat()`. After Phase43 there is no remaining
+low-conflict GB10 shortcut justified by current evidence; future work is either
+a larger kernel/loader design or a hardware-pivot benchmark, still gated by
+MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` and KL if md5 changes.

 ---

--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -1081,12 +1081,20 @@ reviews:
  improved forced W4A16 only marginally and still did not beat default MMQ. Do
  not add another small W4A16 body/metadata tweak.

-The next small source candidate, if we stay on GB10, is the persistent/load-time
-F32 combined gate projection from Phase38/39: combine `ffn_gate_inp.weight` and
-`ffn_gate_inp_shexp.weight` once, issue one F32 gate matmul, and split/view the
-outputs. It must be default-off, avoid graph-time `ggml_concat()`, and pass
-MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID`; if md5 changes, run KL before any
-serving benchmark.
+Phase 43 then checked that candidate against the actual Qwen35MoE model-loader
+path and rejected it as a small shortcut. `ffn_gate_inp.weight` and
+`ffn_gate_inp_shexp.weight` are separate GGUF tensors consumed by separate graph
+matmuls; `create_tensor(...)` only materializes tensors from GGUF metadata, and
+`create_tensor_as_view(...)` can view existing tensors but cannot create a new
+persistent concatenated derived weight. A correct load-time combined gate would
+need a general derived-weight allocation/materialization path across mmap,
+offload, split buffers, and MTP blocks. Do not implement a Qwen-only loader hack,
+and do not fall back to graph-time `ggml_concat()`.
+
+The resulting GB10 state after Phase43: no remaining low-conflict shortcut patch
+is justified by the current evidence. Future work needs either a larger funded
+kernel/loader design or a hardware-pivot benchmark, with the canonical
+MoE/dense md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates.

 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

--- a/docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md
+++ b/docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md
@@ -0,0 +1,108 @@
+# Persistent Gate Fusion Phase43 Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Determine whether the Phase42 persistent/load-time F32 combined gate projection can be implemented as a low-conflict GB10 shortcut.
+
+**Architecture:** Inspect the Qwen35MoE tensor load and graph consumption paths, then decide whether to implement, reject, or rescope before source changes. This phase is a feasibility gate, not a production patch.
+
+**Tech Stack:** llama.cpp model loader, Qwen35MoE graph builder, GGUF tensor metadata, LocalAI parity docs.
+
+---
+
+### Task 1: Inspect Gate Tensor Source Paths
+
+**Files:**
+- Read: `/home/mudler/_git/llama.cpp/src/models/qwen35moe.cpp`
+- Read: `/home/mudler/_git/llama.cpp/src/llama-model-loader.cpp`
+- Read: `/home/mudler/_git/llama.cpp/src/llama-model.cpp`
+- Read: `/home/mudler/_git/llama.cpp/src/llama-model.h`
+
+- [x] **Step 1: Locate tensor creation**
+
+Observed in `src/models/qwen35moe.cpp`:
+
+```cpp
+layer.ffn_gate_inp = create_tensor(tn(LLM_TENSOR_FFN_GATE_INP, "weight", il), { n_embd, n_expert }, flags);
+layer.ffn_gate_inp_shexp = create_tensor(tn(LLM_TENSOR_FFN_GATE_INP_SHEXP, "weight", il), { n_embd }, flags);
+```
+
+- [x] **Step 2: Locate tensor consumption**
+
+Observed:
+
+```cpp
+build_moe_ffn(cur, model.layers[il].ffn_gate_inp, ...);
+ggml_tensor * shared_gate = build_lora_mm(model.layers[il].ffn_gate_inp_shexp, cur);
+```
+
+- [x] **Step 3: Locate loader support for persistent derived tensors**
+
+Observed:
+
+```text
+create_tensor(...) duplicates tensors from GGUF metadata.
+create_tensor_as_view(...) can create views of existing GGUF tensors.
+Backend buffers are allocated from loader contexts before load_all_data(...).
+No existing helper creates a new persistent derived weight from two already-loaded tensors.
+```
+
+### Task 2: Make Feasibility Decision
+
+**Files:**
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
+- Create: `docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md`
+
+- [x] **Step 1: Reject graph-time fallback**
+
+Decision:
+
+```text
+Do not use ggml_concat() at graph time; Phase39 already rejected it because concat_layout is measurable in serving.
+```
+
+- [x] **Step 2: Reject Qwen-only loader hack**
+
+Decision:
+
+```text
+Do not read both tensors back to host, allocate an extra backend weight buffer, and patch layer pointers after load.
+That would create high conflict surface across mmap, offload, split buffers, MTP blocks, and state lifetime.
+```
+
+- [x] **Step 3: Record no-go**
+
+Decision:
+
+```text
+Persistent/load-time fused gate projection is not a small GB10 shortcut.
+It requires either a GGUF-exported combined weight or a general derived-weight facility in llama.cpp.
+```
+
+### Task 3: Verify and Commit
+
+**Files:**
+- Modify: `docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md`
+
+- [x] **Step 1: Verify docs**
+
+Run:
+
+```bash
+git diff --check
+git status --short
+```
+
+- [x] **Step 2: Commit**
+
+Run:
+
+```bash
+git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
+  backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
+  backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+git add -f docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md
+git commit -m "docs(paged): reject persistent gate fusion shortcut" -m "Assisted-by: Codex:gpt-5"
+```