diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 3d74498d0..283075431 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -2469,3 +2469,35 @@ Next target: - It must be default-off, fork-first, and validated with MoE/dense md5, `MUL_MAT`, `MUL_MAT_ID`, and KL if either md5 changes before any serving benchmark. + +## Phase 43 Persistent Gate Fusion Feasibility + +Phase 43 checked whether the Phase42 "small source candidate" can really be +implemented as a low-conflict persistent/load-time combined gate tensor. + +Source facts: + +| path | finding | +|------|---------| +| `src/models/qwen35moe.cpp` | `ffn_gate_inp.weight` is loaded as `[n_embd, n_expert]`; `ffn_gate_inp_shexp.weight` is loaded separately as `[n_embd]` | +| `src/models/qwen35moe.cpp` | the routed gate is consumed inside `build_moe_ffn(...)`; the shared-expert gate is consumed later as a separate `build_lora_mm(ffn_gate_inp_shexp, cur)` | +| `src/llama-model-loader.cpp` | `create_tensor(...)` duplicates tensors from GGUF metadata and allocates backend buffers before `load_all_data(...)`; it has `create_tensor_as_view(...)` for views of existing GGUF tensors, not for new persistent derived tensors | +| `src/llama-model.cpp` | backend buffers are allocated from loader contexts before tensor data is loaded; adding a new persistent derived weight requires a new derived-weight allocation/materialization path, not a local Qwen graph change | + +Decision: + +- Reject persistent/load-time fused gate projection as a "small" GB10 shortcut. + It is only low-conflict if the combined weight already exists in the GGUF, or + if llama.cpp gains a general derived-weight facility. Neither is true in the + current fork. +- Do not fall back to graph-time `ggml_concat()`; Phase39 already rejected that + because `concat_layout` is measurable in serving. +- Do not implement a Qwen-only loader hack that reads both tensors back to host, + allocates an extra backend weight buffer, and patches layer pointers after + load. That is high conflict surface for a gate-only SGEMM bucket and would need + new lifetime/state-management tests across mmap, offload, split buffers, and + MTP blocks. +- The remaining GB10 parity work is no longer a shortcut patch. It is either a + larger funded kernel/loader effort with its own design, or a hardware pivot + benchmark. Any future implementation still needs the canonical MoE/dense md5, + `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates before benchmarking. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index f6d0cf92c..4a8a3b187 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -541,12 +541,21 @@ regressed TTFT/end-to-end throughput. Phase 42 reconciles the target list after parallel read-only review. D1 is closed on the current GB10 path; GDN low-conflict work is exhausted after `0046`/`0047` plus the rejected C32/QS-early/Global-Ai32 follow-ups; W4A16/GEMM -micro-tweaks are exhausted after `0033`-`0035` and `0048`-`0050`. The next small -GB10 source candidate is the Phase38/39 persistent/load-time F32 combined gate -projection: combine `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` once, -run one F32 gate matmul, split/view outputs, default-off, no graph-time -`ggml_concat()`, and gate with MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` before -benchmarking. If md5 changes, run KL first. +micro-tweaks are exhausted after `0033`-`0035` and `0048`-`0050`. It nominated +the Phase38/39 persistent/load-time F32 combined gate projection as the last +small GB10 source candidate. + +Phase 43 rejects that gate-fusion candidate as a small shortcut after source +inspection. `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` are separate +GGUF tensors; the Qwen35MoE graph consumes them in separate matmuls; the loader +can create tensors from GGUF metadata or views of existing tensors, but not a +new persistent derived concatenated weight. A correct implementation would need +a general derived-weight allocation/materialization path across mmap, offload, +split buffers, and MTP blocks. Do not implement a Qwen-only loader hack, and do +not fall back to graph-time `ggml_concat()`. After Phase43 there is no remaining +low-conflict GB10 shortcut justified by current evidence; future work is either +a larger kernel/loader design or a hardware-pivot benchmark, still gated by +MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` and KL if md5 changes. --- diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 0f7a4113e..4183d89db 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -1081,12 +1081,20 @@ reviews: improved forced W4A16 only marginally and still did not beat default MMQ. Do not add another small W4A16 body/metadata tweak. -The next small source candidate, if we stay on GB10, is the persistent/load-time -F32 combined gate projection from Phase38/39: combine `ffn_gate_inp.weight` and -`ffn_gate_inp_shexp.weight` once, issue one F32 gate matmul, and split/view the -outputs. It must be default-off, avoid graph-time `ggml_concat()`, and pass -MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID`; if md5 changes, run KL before any -serving benchmark. +Phase 43 then checked that candidate against the actual Qwen35MoE model-loader +path and rejected it as a small shortcut. `ffn_gate_inp.weight` and +`ffn_gate_inp_shexp.weight` are separate GGUF tensors consumed by separate graph +matmuls; `create_tensor(...)` only materializes tensors from GGUF metadata, and +`create_tensor_as_view(...)` can view existing tensors but cannot create a new +persistent concatenated derived weight. A correct load-time combined gate would +need a general derived-weight allocation/materialization path across mmap, +offload, split buffers, and MTP blocks. Do not implement a Qwen-only loader hack, +and do not fall back to graph-time `ggml_concat()`. + +The resulting GB10 state after Phase43: no remaining low-conflict shortcut patch +is justified by the current evidence. Future work needs either a larger funded +kernel/loader design or a hardware-pivot benchmark, with the canonical +MoE/dense md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates. Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. diff --git a/docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md b/docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md new file mode 100644 index 000000000..634ce2091 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md @@ -0,0 +1,108 @@ +# Persistent Gate Fusion Phase43 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Determine whether the Phase42 persistent/load-time F32 combined gate projection can be implemented as a low-conflict GB10 shortcut. + +**Architecture:** Inspect the Qwen35MoE tensor load and graph consumption paths, then decide whether to implement, reject, or rescope before source changes. This phase is a feasibility gate, not a production patch. + +**Tech Stack:** llama.cpp model loader, Qwen35MoE graph builder, GGUF tensor metadata, LocalAI parity docs. + +--- + +### Task 1: Inspect Gate Tensor Source Paths + +**Files:** +- Read: `/home/mudler/_git/llama.cpp/src/models/qwen35moe.cpp` +- Read: `/home/mudler/_git/llama.cpp/src/llama-model-loader.cpp` +- Read: `/home/mudler/_git/llama.cpp/src/llama-model.cpp` +- Read: `/home/mudler/_git/llama.cpp/src/llama-model.h` + +- [x] **Step 1: Locate tensor creation** + +Observed in `src/models/qwen35moe.cpp`: + +```cpp +layer.ffn_gate_inp = create_tensor(tn(LLM_TENSOR_FFN_GATE_INP, "weight", il), { n_embd, n_expert }, flags); +layer.ffn_gate_inp_shexp = create_tensor(tn(LLM_TENSOR_FFN_GATE_INP_SHEXP, "weight", il), { n_embd }, flags); +``` + +- [x] **Step 2: Locate tensor consumption** + +Observed: + +```cpp +build_moe_ffn(cur, model.layers[il].ffn_gate_inp, ...); +ggml_tensor * shared_gate = build_lora_mm(model.layers[il].ffn_gate_inp_shexp, cur); +``` + +- [x] **Step 3: Locate loader support for persistent derived tensors** + +Observed: + +```text +create_tensor(...) duplicates tensors from GGUF metadata. +create_tensor_as_view(...) can create views of existing GGUF tensors. +Backend buffers are allocated from loader contexts before load_all_data(...). +No existing helper creates a new persistent derived weight from two already-loaded tensors. +``` + +### Task 2: Make Feasibility Decision + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Create: `docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md` + +- [x] **Step 1: Reject graph-time fallback** + +Decision: + +```text +Do not use ggml_concat() at graph time; Phase39 already rejected it because concat_layout is measurable in serving. +``` + +- [x] **Step 2: Reject Qwen-only loader hack** + +Decision: + +```text +Do not read both tensors back to host, allocate an extra backend weight buffer, and patch layer pointers after load. +That would create high conflict surface across mmap, offload, split buffers, MTP blocks, and state lifetime. +``` + +- [x] **Step 3: Record no-go** + +Decision: + +```text +Persistent/load-time fused gate projection is not a small GB10 shortcut. +It requires either a GGUF-exported combined weight or a general derived-weight facility in llama.cpp. +``` + +### Task 3: Verify and Commit + +**Files:** +- Modify: `docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md` + +- [x] **Step 1: Verify docs** + +Run: + +```bash +git diff --check +git status --short +``` + +- [x] **Step 2: Commit** + +Run: + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md +git commit -m "docs(paged): reject persistent gate fusion shortcut" -m "Assisted-by: Codex:gpt-5" +```