docs(paged): reject persistent gate fusion shortcut

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 07:34:14 +00:00
parent b9eff5bca3
commit ecaf406c0b
4 changed files with 169 additions and 12 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -2469,3 +2469,35 @@ Next target:
 - It must be default-off, fork-first, and validated with MoE/dense md5,
  `MUL_MAT`, `MUL_MAT_ID`, and KL if either md5 changes before any serving
  benchmark.
+
+## Phase 43 Persistent Gate Fusion Feasibility
+
+Phase 43 checked whether the Phase42 "small source candidate" can really be
+implemented as a low-conflict persistent/load-time combined gate tensor.
+
+Source facts:
+
+| path | finding |
+|------|---------|
+| `src/models/qwen35moe.cpp` | `ffn_gate_inp.weight` is loaded as `[n_embd, n_expert]`; `ffn_gate_inp_shexp.weight` is loaded separately as `[n_embd]` |
+| `src/models/qwen35moe.cpp` | the routed gate is consumed inside `build_moe_ffn(...)`; the shared-expert gate is consumed later as a separate `build_lora_mm(ffn_gate_inp_shexp, cur)` |
+| `src/llama-model-loader.cpp` | `create_tensor(...)` duplicates tensors from GGUF metadata and allocates backend buffers before `load_all_data(...)`; it has `create_tensor_as_view(...)` for views of existing GGUF tensors, not for new persistent derived tensors |
+| `src/llama-model.cpp` | backend buffers are allocated from loader contexts before tensor data is loaded; adding a new persistent derived weight requires a new derived-weight allocation/materialization path, not a local Qwen graph change |
+
+Decision:
+
+- Reject persistent/load-time fused gate projection as a "small" GB10 shortcut.
+  It is only low-conflict if the combined weight already exists in the GGUF, or
+  if llama.cpp gains a general derived-weight facility. Neither is true in the
+  current fork.
+- Do not fall back to graph-time `ggml_concat()`; Phase39 already rejected that
+  because `concat_layout` is measurable in serving.
+- Do not implement a Qwen-only loader hack that reads both tensors back to host,
+  allocates an extra backend weight buffer, and patches layer pointers after
+  load. That is high conflict surface for a gate-only SGEMM bucket and would need
+  new lifetime/state-management tests across mmap, offload, split buffers, and
+  MTP blocks.
+- The remaining GB10 parity work is no longer a shortcut patch. It is either a
+  larger funded kernel/loader effort with its own design, or a hardware pivot
+  benchmark. Any future implementation still needs the canonical MoE/dense md5,
+  `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates before benchmarking.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -541,12 +541,21 @@ regressed TTFT/end-to-end throughput.
 Phase 42 reconciles the target list after parallel read-only review. D1 is
 closed on the current GB10 path; GDN low-conflict work is exhausted after
 `0046`/`0047` plus the rejected C32/QS-early/Global-Ai32 follow-ups; W4A16/GEMM
-micro-tweaks are exhausted after `0033`-`0035` and `0048`-`0050`. The next small
-GB10 source candidate is the Phase38/39 persistent/load-time F32 combined gate
-projection: combine `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` once,
-run one F32 gate matmul, split/view outputs, default-off, no graph-time
-`ggml_concat()`, and gate with MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` before
-benchmarking. If md5 changes, run KL first.
+micro-tweaks are exhausted after `0033`-`0035` and `0048`-`0050`. It nominated
+the Phase38/39 persistent/load-time F32 combined gate projection as the last
+small GB10 source candidate.
+
+Phase 43 rejects that gate-fusion candidate as a small shortcut after source
+inspection. `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` are separate
+GGUF tensors; the Qwen35MoE graph consumes them in separate matmuls; the loader
+can create tensors from GGUF metadata or views of existing tensors, but not a
+new persistent derived concatenated weight. A correct implementation would need
+a general derived-weight allocation/materialization path across mmap, offload,
+split buffers, and MTP blocks. Do not implement a Qwen-only loader hack, and do
+not fall back to graph-time `ggml_concat()`. After Phase43 there is no remaining
+low-conflict GB10 shortcut justified by current evidence; future work is either
+a larger kernel/loader design or a hardware-pivot benchmark, still gated by
+MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` and KL if md5 changes.

 ---

--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -1081,12 +1081,20 @@ reviews:
  improved forced W4A16 only marginally and still did not beat default MMQ. Do
  not add another small W4A16 body/metadata tweak.

-The next small source candidate, if we stay on GB10, is the persistent/load-time
-F32 combined gate projection from Phase38/39: combine `ffn_gate_inp.weight` and
-`ffn_gate_inp_shexp.weight` once, issue one F32 gate matmul, and split/view the
-outputs. It must be default-off, avoid graph-time `ggml_concat()`, and pass
-MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID`; if md5 changes, run KL before any
-serving benchmark.
+Phase 43 then checked that candidate against the actual Qwen35MoE model-loader
+path and rejected it as a small shortcut. `ffn_gate_inp.weight` and
+`ffn_gate_inp_shexp.weight` are separate GGUF tensors consumed by separate graph
+matmuls; `create_tensor(...)` only materializes tensors from GGUF metadata, and
+`create_tensor_as_view(...)` can view existing tensors but cannot create a new
+persistent concatenated derived weight. A correct load-time combined gate would
+need a general derived-weight allocation/materialization path across mmap,
+offload, split buffers, and MTP blocks. Do not implement a Qwen-only loader hack,
+and do not fall back to graph-time `ggml_concat()`.
+
+The resulting GB10 state after Phase43: no remaining low-conflict shortcut patch
+is justified by the current evidence. Future work needs either a larger funded
+kernel/loader design or a hardware-pivot benchmark, with the canonical
+MoE/dense md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates.

 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.