docs(paged): reject persistent gate fusion shortcut

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 07:34:14 +00:00
parent b9eff5bca3
commit ecaf406c0b
4 changed files with 169 additions and 12 deletions

View File

@@ -2469,3 +2469,35 @@ Next target:
- It must be default-off, fork-first, and validated with MoE/dense md5,
`MUL_MAT`, `MUL_MAT_ID`, and KL if either md5 changes before any serving
benchmark.
## Phase 43 Persistent Gate Fusion Feasibility
Phase 43 checked whether the Phase42 "small source candidate" can really be
implemented as a low-conflict persistent/load-time combined gate tensor.
Source facts:
| path | finding |
|------|---------|
| `src/models/qwen35moe.cpp` | `ffn_gate_inp.weight` is loaded as `[n_embd, n_expert]`; `ffn_gate_inp_shexp.weight` is loaded separately as `[n_embd]` |
| `src/models/qwen35moe.cpp` | the routed gate is consumed inside `build_moe_ffn(...)`; the shared-expert gate is consumed later as a separate `build_lora_mm(ffn_gate_inp_shexp, cur)` |
| `src/llama-model-loader.cpp` | `create_tensor(...)` duplicates tensors from GGUF metadata and allocates backend buffers before `load_all_data(...)`; it has `create_tensor_as_view(...)` for views of existing GGUF tensors, not for new persistent derived tensors |
| `src/llama-model.cpp` | backend buffers are allocated from loader contexts before tensor data is loaded; adding a new persistent derived weight requires a new derived-weight allocation/materialization path, not a local Qwen graph change |
Decision:
- Reject persistent/load-time fused gate projection as a "small" GB10 shortcut.
It is only low-conflict if the combined weight already exists in the GGUF, or
if llama.cpp gains a general derived-weight facility. Neither is true in the
current fork.
- Do not fall back to graph-time `ggml_concat()`; Phase39 already rejected that
because `concat_layout` is measurable in serving.
- Do not implement a Qwen-only loader hack that reads both tensors back to host,
allocates an extra backend weight buffer, and patches layer pointers after
load. That is high conflict surface for a gate-only SGEMM bucket and would need
new lifetime/state-management tests across mmap, offload, split buffers, and
MTP blocks.
- The remaining GB10 parity work is no longer a shortcut patch. It is either a
larger funded kernel/loader effort with its own design, or a hardware pivot
benchmark. Any future implementation still needs the canonical MoE/dense md5,
`MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates before benchmarking.

View File

@@ -541,12 +541,21 @@ regressed TTFT/end-to-end throughput.
Phase 42 reconciles the target list after parallel read-only review. D1 is
closed on the current GB10 path; GDN low-conflict work is exhausted after
`0046`/`0047` plus the rejected C32/QS-early/Global-Ai32 follow-ups; W4A16/GEMM
micro-tweaks are exhausted after `0033`-`0035` and `0048`-`0050`. The next small
GB10 source candidate is the Phase38/39 persistent/load-time F32 combined gate
projection: combine `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` once,
run one F32 gate matmul, split/view outputs, default-off, no graph-time
`ggml_concat()`, and gate with MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` before
benchmarking. If md5 changes, run KL first.
micro-tweaks are exhausted after `0033`-`0035` and `0048`-`0050`. It nominated
the Phase38/39 persistent/load-time F32 combined gate projection as the last
small GB10 source candidate.
Phase 43 rejects that gate-fusion candidate as a small shortcut after source
inspection. `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` are separate
GGUF tensors; the Qwen35MoE graph consumes them in separate matmuls; the loader
can create tensors from GGUF metadata or views of existing tensors, but not a
new persistent derived concatenated weight. A correct implementation would need
a general derived-weight allocation/materialization path across mmap, offload,
split buffers, and MTP blocks. Do not implement a Qwen-only loader hack, and do
not fall back to graph-time `ggml_concat()`. After Phase43 there is no remaining
low-conflict GB10 shortcut justified by current evidence; future work is either
a larger kernel/loader design or a hardware-pivot benchmark, still gated by
MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` and KL if md5 changes.
---

View File

@@ -1081,12 +1081,20 @@ reviews:
improved forced W4A16 only marginally and still did not beat default MMQ. Do
not add another small W4A16 body/metadata tweak.
The next small source candidate, if we stay on GB10, is the persistent/load-time
F32 combined gate projection from Phase38/39: combine `ffn_gate_inp.weight` and
`ffn_gate_inp_shexp.weight` once, issue one F32 gate matmul, and split/view the
outputs. It must be default-off, avoid graph-time `ggml_concat()`, and pass
MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID`; if md5 changes, run KL before any
serving benchmark.
Phase 43 then checked that candidate against the actual Qwen35MoE model-loader
path and rejected it as a small shortcut. `ffn_gate_inp.weight` and
`ffn_gate_inp_shexp.weight` are separate GGUF tensors consumed by separate graph
matmuls; `create_tensor(...)` only materializes tensors from GGUF metadata, and
`create_tensor_as_view(...)` can view existing tensors but cannot create a new
persistent concatenated derived weight. A correct load-time combined gate would
need a general derived-weight allocation/materialization path across mmap,
offload, split buffers, and MTP blocks. Do not implement a Qwen-only loader hack,
and do not fall back to graph-time `ggml_concat()`.
The resulting GB10 state after Phase43: no remaining low-conflict shortcut patch
is justified by the current evidence. Future work needs either a larger funded
kernel/loader design or a hardware-pivot benchmark, with the canonical
MoE/dense md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.