docs(paged): reject persistent gate fusion shortcut

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 07:34:14 +00:00
parent b9eff5bca3
commit ecaf406c0b
4 changed files with 169 additions and 12 deletions

View File

@@ -2469,3 +2469,35 @@ Next target:
- It must be default-off, fork-first, and validated with MoE/dense md5,
`MUL_MAT`, `MUL_MAT_ID`, and KL if either md5 changes before any serving
benchmark.
## Phase 43 Persistent Gate Fusion Feasibility
Phase 43 checked whether the Phase42 "small source candidate" can really be
implemented as a low-conflict persistent/load-time combined gate tensor.
Source facts:
| path | finding |
|------|---------|
| `src/models/qwen35moe.cpp` | `ffn_gate_inp.weight` is loaded as `[n_embd, n_expert]`; `ffn_gate_inp_shexp.weight` is loaded separately as `[n_embd]` |
| `src/models/qwen35moe.cpp` | the routed gate is consumed inside `build_moe_ffn(...)`; the shared-expert gate is consumed later as a separate `build_lora_mm(ffn_gate_inp_shexp, cur)` |
| `src/llama-model-loader.cpp` | `create_tensor(...)` duplicates tensors from GGUF metadata and allocates backend buffers before `load_all_data(...)`; it has `create_tensor_as_view(...)` for views of existing GGUF tensors, not for new persistent derived tensors |
| `src/llama-model.cpp` | backend buffers are allocated from loader contexts before tensor data is loaded; adding a new persistent derived weight requires a new derived-weight allocation/materialization path, not a local Qwen graph change |
Decision:
- Reject persistent/load-time fused gate projection as a "small" GB10 shortcut.
It is only low-conflict if the combined weight already exists in the GGUF, or
if llama.cpp gains a general derived-weight facility. Neither is true in the
current fork.
- Do not fall back to graph-time `ggml_concat()`; Phase39 already rejected that
because `concat_layout` is measurable in serving.
- Do not implement a Qwen-only loader hack that reads both tensors back to host,
allocates an extra backend weight buffer, and patches layer pointers after
load. That is high conflict surface for a gate-only SGEMM bucket and would need
new lifetime/state-management tests across mmap, offload, split buffers, and
MTP blocks.
- The remaining GB10 parity work is no longer a shortcut patch. It is either a
larger funded kernel/loader effort with its own design, or a hardware pivot
benchmark. Any future implementation still needs the canonical MoE/dense md5,
`MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates before benchmarking.

View File

@@ -541,12 +541,21 @@ regressed TTFT/end-to-end throughput.
Phase 42 reconciles the target list after parallel read-only review. D1 is
closed on the current GB10 path; GDN low-conflict work is exhausted after
`0046`/`0047` plus the rejected C32/QS-early/Global-Ai32 follow-ups; W4A16/GEMM
micro-tweaks are exhausted after `0033`-`0035` and `0048`-`0050`. The next small
GB10 source candidate is the Phase38/39 persistent/load-time F32 combined gate
projection: combine `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` once,
run one F32 gate matmul, split/view outputs, default-off, no graph-time
`ggml_concat()`, and gate with MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` before
benchmarking. If md5 changes, run KL first.
micro-tweaks are exhausted after `0033`-`0035` and `0048`-`0050`. It nominated
the Phase38/39 persistent/load-time F32 combined gate projection as the last
small GB10 source candidate.
Phase 43 rejects that gate-fusion candidate as a small shortcut after source
inspection. `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` are separate
GGUF tensors; the Qwen35MoE graph consumes them in separate matmuls; the loader
can create tensors from GGUF metadata or views of existing tensors, but not a
new persistent derived concatenated weight. A correct implementation would need
a general derived-weight allocation/materialization path across mmap, offload,
split buffers, and MTP blocks. Do not implement a Qwen-only loader hack, and do
not fall back to graph-time `ggml_concat()`. After Phase43 there is no remaining
low-conflict GB10 shortcut justified by current evidence; future work is either
a larger kernel/loader design or a hardware-pivot benchmark, still gated by
MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` and KL if md5 changes.
---

View File

@@ -1081,12 +1081,20 @@ reviews:
improved forced W4A16 only marginally and still did not beat default MMQ. Do
not add another small W4A16 body/metadata tweak.
The next small source candidate, if we stay on GB10, is the persistent/load-time
F32 combined gate projection from Phase38/39: combine `ffn_gate_inp.weight` and
`ffn_gate_inp_shexp.weight` once, issue one F32 gate matmul, and split/view the
outputs. It must be default-off, avoid graph-time `ggml_concat()`, and pass
MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID`; if md5 changes, run KL before any
serving benchmark.
Phase 43 then checked that candidate against the actual Qwen35MoE model-loader
path and rejected it as a small shortcut. `ffn_gate_inp.weight` and
`ffn_gate_inp_shexp.weight` are separate GGUF tensors consumed by separate graph
matmuls; `create_tensor(...)` only materializes tensors from GGUF metadata, and
`create_tensor_as_view(...)` can view existing tensors but cannot create a new
persistent concatenated derived weight. A correct load-time combined gate would
need a general derived-weight allocation/materialization path across mmap,
offload, split buffers, and MTP blocks. Do not implement a Qwen-only loader hack,
and do not fall back to graph-time `ggml_concat()`.
The resulting GB10 state after Phase43: no remaining low-conflict shortcut patch
is justified by the current evidence. Future work needs either a larger funded
kernel/loader design or a hardware-pivot benchmark, with the canonical
MoE/dense md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

View File

@@ -0,0 +1,108 @@
# Persistent Gate Fusion Phase43 Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Determine whether the Phase42 persistent/load-time F32 combined gate projection can be implemented as a low-conflict GB10 shortcut.
**Architecture:** Inspect the Qwen35MoE tensor load and graph consumption paths, then decide whether to implement, reject, or rescope before source changes. This phase is a feasibility gate, not a production patch.
**Tech Stack:** llama.cpp model loader, Qwen35MoE graph builder, GGUF tensor metadata, LocalAI parity docs.
---
### Task 1: Inspect Gate Tensor Source Paths
**Files:**
- Read: `/home/mudler/_git/llama.cpp/src/models/qwen35moe.cpp`
- Read: `/home/mudler/_git/llama.cpp/src/llama-model-loader.cpp`
- Read: `/home/mudler/_git/llama.cpp/src/llama-model.cpp`
- Read: `/home/mudler/_git/llama.cpp/src/llama-model.h`
- [x] **Step 1: Locate tensor creation**
Observed in `src/models/qwen35moe.cpp`:
```cpp
layer.ffn_gate_inp = create_tensor(tn(LLM_TENSOR_FFN_GATE_INP, "weight", il), { n_embd, n_expert }, flags);
layer.ffn_gate_inp_shexp = create_tensor(tn(LLM_TENSOR_FFN_GATE_INP_SHEXP, "weight", il), { n_embd }, flags);
```
- [x] **Step 2: Locate tensor consumption**
Observed:
```cpp
build_moe_ffn(cur, model.layers[il].ffn_gate_inp, ...);
ggml_tensor * shared_gate = build_lora_mm(model.layers[il].ffn_gate_inp_shexp, cur);
```
- [x] **Step 3: Locate loader support for persistent derived tensors**
Observed:
```text
create_tensor(...) duplicates tensors from GGUF metadata.
create_tensor_as_view(...) can create views of existing GGUF tensors.
Backend buffers are allocated from loader contexts before load_all_data(...).
No existing helper creates a new persistent derived weight from two already-loaded tensors.
```
### Task 2: Make Feasibility Decision
**Files:**
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
- Create: `docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md`
- [x] **Step 1: Reject graph-time fallback**
Decision:
```text
Do not use ggml_concat() at graph time; Phase39 already rejected it because concat_layout is measurable in serving.
```
- [x] **Step 2: Reject Qwen-only loader hack**
Decision:
```text
Do not read both tensors back to host, allocate an extra backend weight buffer, and patch layer pointers after load.
That would create high conflict surface across mmap, offload, split buffers, MTP blocks, and state lifetime.
```
- [x] **Step 3: Record no-go**
Decision:
```text
Persistent/load-time fused gate projection is not a small GB10 shortcut.
It requires either a GGUF-exported combined weight or a general derived-weight facility in llama.cpp.
```
### Task 3: Verify and Commit
**Files:**
- Modify: `docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md`
- [x] **Step 1: Verify docs**
Run:
```bash
git diff --check
git status --short
```
- [x] **Step 2: Commit**
Run:
```bash
git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
git add -f docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md
git commit -m "docs(paged): reject persistent gate fusion shortcut" -m "Assisted-by: Codex:gpt-5"
```