mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): reject persistent gate fusion shortcut
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -2469,3 +2469,35 @@ Next target:
|
||||
- It must be default-off, fork-first, and validated with MoE/dense md5,
|
||||
`MUL_MAT`, `MUL_MAT_ID`, and KL if either md5 changes before any serving
|
||||
benchmark.
|
||||
|
||||
## Phase 43 Persistent Gate Fusion Feasibility
|
||||
|
||||
Phase 43 checked whether the Phase42 "small source candidate" can really be
|
||||
implemented as a low-conflict persistent/load-time combined gate tensor.
|
||||
|
||||
Source facts:
|
||||
|
||||
| path | finding |
|
||||
|------|---------|
|
||||
| `src/models/qwen35moe.cpp` | `ffn_gate_inp.weight` is loaded as `[n_embd, n_expert]`; `ffn_gate_inp_shexp.weight` is loaded separately as `[n_embd]` |
|
||||
| `src/models/qwen35moe.cpp` | the routed gate is consumed inside `build_moe_ffn(...)`; the shared-expert gate is consumed later as a separate `build_lora_mm(ffn_gate_inp_shexp, cur)` |
|
||||
| `src/llama-model-loader.cpp` | `create_tensor(...)` duplicates tensors from GGUF metadata and allocates backend buffers before `load_all_data(...)`; it has `create_tensor_as_view(...)` for views of existing GGUF tensors, not for new persistent derived tensors |
|
||||
| `src/llama-model.cpp` | backend buffers are allocated from loader contexts before tensor data is loaded; adding a new persistent derived weight requires a new derived-weight allocation/materialization path, not a local Qwen graph change |
|
||||
|
||||
Decision:
|
||||
|
||||
- Reject persistent/load-time fused gate projection as a "small" GB10 shortcut.
|
||||
It is only low-conflict if the combined weight already exists in the GGUF, or
|
||||
if llama.cpp gains a general derived-weight facility. Neither is true in the
|
||||
current fork.
|
||||
- Do not fall back to graph-time `ggml_concat()`; Phase39 already rejected that
|
||||
because `concat_layout` is measurable in serving.
|
||||
- Do not implement a Qwen-only loader hack that reads both tensors back to host,
|
||||
allocates an extra backend weight buffer, and patches layer pointers after
|
||||
load. That is high conflict surface for a gate-only SGEMM bucket and would need
|
||||
new lifetime/state-management tests across mmap, offload, split buffers, and
|
||||
MTP blocks.
|
||||
- The remaining GB10 parity work is no longer a shortcut patch. It is either a
|
||||
larger funded kernel/loader effort with its own design, or a hardware pivot
|
||||
benchmark. Any future implementation still needs the canonical MoE/dense md5,
|
||||
`MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates before benchmarking.
|
||||
|
||||
@@ -541,12 +541,21 @@ regressed TTFT/end-to-end throughput.
|
||||
Phase 42 reconciles the target list after parallel read-only review. D1 is
|
||||
closed on the current GB10 path; GDN low-conflict work is exhausted after
|
||||
`0046`/`0047` plus the rejected C32/QS-early/Global-Ai32 follow-ups; W4A16/GEMM
|
||||
micro-tweaks are exhausted after `0033`-`0035` and `0048`-`0050`. The next small
|
||||
GB10 source candidate is the Phase38/39 persistent/load-time F32 combined gate
|
||||
projection: combine `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` once,
|
||||
run one F32 gate matmul, split/view outputs, default-off, no graph-time
|
||||
`ggml_concat()`, and gate with MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` before
|
||||
benchmarking. If md5 changes, run KL first.
|
||||
micro-tweaks are exhausted after `0033`-`0035` and `0048`-`0050`. It nominated
|
||||
the Phase38/39 persistent/load-time F32 combined gate projection as the last
|
||||
small GB10 source candidate.
|
||||
|
||||
Phase 43 rejects that gate-fusion candidate as a small shortcut after source
|
||||
inspection. `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` are separate
|
||||
GGUF tensors; the Qwen35MoE graph consumes them in separate matmuls; the loader
|
||||
can create tensors from GGUF metadata or views of existing tensors, but not a
|
||||
new persistent derived concatenated weight. A correct implementation would need
|
||||
a general derived-weight allocation/materialization path across mmap, offload,
|
||||
split buffers, and MTP blocks. Do not implement a Qwen-only loader hack, and do
|
||||
not fall back to graph-time `ggml_concat()`. After Phase43 there is no remaining
|
||||
low-conflict GB10 shortcut justified by current evidence; future work is either
|
||||
a larger kernel/loader design or a hardware-pivot benchmark, still gated by
|
||||
MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` and KL if md5 changes.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -1081,12 +1081,20 @@ reviews:
|
||||
improved forced W4A16 only marginally and still did not beat default MMQ. Do
|
||||
not add another small W4A16 body/metadata tweak.
|
||||
|
||||
The next small source candidate, if we stay on GB10, is the persistent/load-time
|
||||
F32 combined gate projection from Phase38/39: combine `ffn_gate_inp.weight` and
|
||||
`ffn_gate_inp_shexp.weight` once, issue one F32 gate matmul, and split/view the
|
||||
outputs. It must be default-off, avoid graph-time `ggml_concat()`, and pass
|
||||
MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID`; if md5 changes, run KL before any
|
||||
serving benchmark.
|
||||
Phase 43 then checked that candidate against the actual Qwen35MoE model-loader
|
||||
path and rejected it as a small shortcut. `ffn_gate_inp.weight` and
|
||||
`ffn_gate_inp_shexp.weight` are separate GGUF tensors consumed by separate graph
|
||||
matmuls; `create_tensor(...)` only materializes tensors from GGUF metadata, and
|
||||
`create_tensor_as_view(...)` can view existing tensors but cannot create a new
|
||||
persistent concatenated derived weight. A correct load-time combined gate would
|
||||
need a general derived-weight allocation/materialization path across mmap,
|
||||
offload, split buffers, and MTP blocks. Do not implement a Qwen-only loader hack,
|
||||
and do not fall back to graph-time `ggml_concat()`.
|
||||
|
||||
The resulting GB10 state after Phase43: no remaining low-conflict shortcut patch
|
||||
is justified by the current evidence. Future work needs either a larger funded
|
||||
kernel/loader design or a hardware-pivot benchmark, with the canonical
|
||||
MoE/dense md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user