mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): scope gate projection policy
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -2214,3 +2214,52 @@ Decision:
|
||||
- Do not blindly force these calls to BF16. First inspect the model-load tensor
|
||||
types for `ffn_gate_inp*`; if changing weight dtype or graph routing is
|
||||
considered, require md5/op gates and KL validation.
|
||||
|
||||
## Phase 38 Gate Projection Policy
|
||||
|
||||
Phase 38 is a safety and scope checkpoint before any `ffn_gate_inp*` route
|
||||
change. It makes the reusable inference gate stricter by default and records why
|
||||
the Phase 37 SGEMM bucket should not be treated as a missed BF16 route.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase38_gate_baseline/20260701_084410`
|
||||
|
||||
Preflight:
|
||||
|
||||
| check | actual |
|
||||
|-------|--------|
|
||||
| GPU | `NVIDIA GB10, 580.159.03` |
|
||||
| docker containers | `0` |
|
||||
| `local-ai-worker` containers | `0` |
|
||||
| GPU compute apps | `0` |
|
||||
| GPU lock owner | `FREE phase33-small-m-tile-policy-done 1782883234` |
|
||||
|
||||
Fresh baseline gates against the current Phase37 build:
|
||||
|
||||
| check | status | actual |
|
||||
|-------|--------|--------|
|
||||
| MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
|
||||
| dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
|
||||
| `MUL_MAT` | ok | `1146/1146` |
|
||||
| `MUL_MAT_ID` | ok | `806/806` |
|
||||
|
||||
Source comparison:
|
||||
|
||||
- `qwen35moe.cpp` creates `ffn_gate_inp.weight` as `[n_embd, n_expert]` and
|
||||
`ffn_gate_inp_shexp.weight` as `[n_embd]`.
|
||||
- `llama-graph.cpp` computes router logits with `build_lora_mm(gate_inp, cur)`
|
||||
and labels the result `ffn_moe_logits`.
|
||||
- vLLM Qwen3-Next constructs both gates as `ReplicatedLinear(...,
|
||||
quant_config=None)`, and its fused-MoE runner can concatenate router and
|
||||
shared-expert gate weights for one fused-gate forward path.
|
||||
|
||||
Decision:
|
||||
|
||||
- The `sgemm` bucket is router/shared-expert gate math kept unquantized by both
|
||||
engines. It is expected F32 policy, not an accidental cuBLAS fallback.
|
||||
- Do not force BF16 or NVFP4 for `ffn_gate_inp*`.
|
||||
- A future optimization can test a default-off fused gate projection that
|
||||
preserves F32 math and split semantics. Gate it with MoE/dense md5,
|
||||
`MUL_MAT`, `MUL_MAT_ID`, and KL validation if either md5 changes before any
|
||||
serving benchmark.
|
||||
|
||||
@@ -64,7 +64,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It
|
||||
- **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export.
|
||||
- **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source.
|
||||
- The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated.
|
||||
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. Current verified state: fork HEAD `c78e537b5` is mirrored by worktree patch `0057-feat-cuda-trace-moe-mmq-launch-shapes.patch`; applying all `48` patch files on `0ed235ea2c17a19fc8238668653946721ed136fd` produces tree `4eae628e4ba6f2defa14a19d19f7e4abef9a2647`, exactly matching the fork.
|
||||
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. Current verified state: fork HEAD `2d590d770` is mirrored by worktree patch `0063-feat-cuda-trace-cublas-tensor-names.patch`; applying all `54` patch files on `0ed235ea2c17a19fc8238668653946721ed136fd` produces tree `dedb1182910eafe9f6875588dc8285bfb544cce5`, exactly matching the fork.
|
||||
|
||||
### 2.6 Bench hygiene gates
|
||||
- **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS").
|
||||
@@ -490,6 +490,18 @@ bucket is `blk.N.ffn_gate_inp.weight -> ffn_moe_logits-N` and
|
||||
`blk.N.ffn_gate_inp_shexp.weight -> shared_expert_gate-N`; do not force BF16
|
||||
without first inspecting model-load tensor types and running KL validation.
|
||||
|
||||
Phase 38 is the current gate-projection policy checkpoint. Artifact:
|
||||
`/home/mudler/bench/phase38_gate_baseline/20260701_084410`. Preflight showed
|
||||
docker `0`, `local-ai-worker` `0`, compute apps `0`, and GB10 driver
|
||||
`580.159.03`. Fresh baseline gates against the Phase37 build passed: MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
|
||||
`806/806`. Source comparison found llama.cpp and vLLM both keep router and
|
||||
shared-expert gate weights unquantized; vLLM's relevant idea is fused F32 gate
|
||||
weight concatenation, not BF16/NVFP4 routing. Future fused-gate work must be
|
||||
default-off, preserve F32 semantics, and pass md5/op gates before benchmarking;
|
||||
if md5 changes, run KL first.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
@@ -520,7 +532,9 @@ are now done:
|
||||
- Phase 22 re-verified the patch-series mirror invariant after `0055`.
|
||||
|
||||
For future release checks, run `paged-inference-gates.sh` and
|
||||
`paged-current-serving-snapshot.sh` from the LocalAI backend tree.
|
||||
`paged-current-serving-snapshot.sh` from the LocalAI backend tree. The inference
|
||||
gate now defaults to both `MUL_MAT` and `MUL_MAT_ID`; set `OPS=` only for a
|
||||
focused diagnostic run.
|
||||
|
||||
### (b) Datacenter-Blackwell pivot (THE real parity path)
|
||||
The thesis: every vLLM advantage that wins on GB10 is a kernel that is **broken or capped on consumer Blackwell** and **inverts on datacenter Blackwell** (B200): FLA blocked-solve GDN, Marlin/CUTLASS grouped FP4, HBM-tuned full-cudagraph decode, native tcgen05/TMEM. ~8 TB/s HBM lifts the LPDDR5x GDN bandwidth floor ~30x. Concrete first steps:
|
||||
@@ -567,6 +581,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/phase35_mul_mat_route_trace/20260701_074359` - default-off regular MUL_MAT route trace patch `0061`; default/trace/post-serving md5 gates green; n128 route trace found BF16 `mat_f=2485`, `op_cublas=1330`.
|
||||
- `~/bench/phase36_cublas_route_trace/20260701_081228` - default-off cuBLAS subroute trace patch `0062`; default/trace/post-serving md5 and op gates green; n128 route trace found `bf16_tc=5681`, `sgemm=2511`.
|
||||
- `~/bench/phase37_cublas_name_trace/20260701_083227` - cuBLAS tensor-name trace patch `0063`; default/trace/post-serving md5 and op gates green; n128 trace identified `sgemm` as MoE gate logits and shared-expert gate projections.
|
||||
- `~/bench/phase38_gate_baseline/20260701_084410` - current Phase37 build baseline before gate-projection policy work; docker/local-ai-worker/GPU idle preflight green; MoE/dense md5 green; `MUL_MAT` `1146/1146`; `MUL_MAT_ID` `806/806`.
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
|
||||
@@ -976,6 +976,26 @@ Lever implication: do not blindly force the `sgemm` bucket to BF16. First inspec
|
||||
why `ffn_gate_inp*` loads as F32 and whether a dtype or graph-route change is
|
||||
precision-safe. If attempted, use md5/op gates plus KL validation.
|
||||
|
||||
### Phase 38 gate projection policy
|
||||
|
||||
Phase 38 re-ran the current Phase37 build safety gate before changing policy:
|
||||
artifact `/home/mudler/bench/phase38_gate_baseline/20260701_084410`, MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
|
||||
`806/806`.
|
||||
|
||||
Source check: llama.cpp's Qwen35MoE graph uses `ffn_gate_inp.weight` for
|
||||
`ffn_moe_logits` and `ffn_gate_inp_shexp.weight` for `shared_expert_gate`. vLLM
|
||||
Qwen3-Next also constructs those gates with `quant_config=None`; the relevant
|
||||
vLLM idea is not reduced precision, but concatenating router and shared-expert
|
||||
gate weights in the fused-MoE runner when shared-expert fusion is active.
|
||||
|
||||
Lever implication: keep `ffn_gate_inp*` as inference-critical F32 policy. A
|
||||
future low-conflict experiment may test a default-off fused F32 gate projection
|
||||
that computes both logits in one matmul and splits the output, but it must pass
|
||||
MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates before benchmarking. If md5
|
||||
changes, run the KL gate first and reject on any KL regression.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -12,7 +12,7 @@ Environment:
|
||||
MOE MoE GGUF path (default: ~/bench/q36-35b-a3b-nvfp4.gguf)
|
||||
DENSE Dense GGUF path (default: ~/bench/q36-27b-nvfp4.gguf)
|
||||
ART artifact dir (default: ~/bench/paged_inference_gates/<timestamp>)
|
||||
OPS comma-separated test-backend-ops filters (default: MUL_MAT_ID)
|
||||
OPS comma-separated test-backend-ops filters (default: MUL_MAT,MUL_MAT_ID)
|
||||
EXTRA_ENV extra env assignments for completion gates, e.g. "GDN_TC=5"
|
||||
|
||||
Expected md5:
|
||||
@@ -28,7 +28,7 @@ DENSE_MD5_EXPECTED=5951a5b4d624ce891e22ab5fca9bc439
|
||||
BIN=${BIN:-"$HOME/llama-phase6-source/build-cuda/bin"}
|
||||
MOE=${MOE:-"$HOME/bench/q36-35b-a3b-nvfp4.gguf"}
|
||||
DENSE=${DENSE:-"$HOME/bench/q36-27b-nvfp4.gguf"}
|
||||
OPS=${OPS:-MUL_MAT_ID}
|
||||
OPS=${OPS:-MUL_MAT,MUL_MAT_ID}
|
||||
ART=${ART:-"$HOME/bench/paged_inference_gates/$(date +%Y%m%d_%H%M%S)"}
|
||||
EXTRA_ENV=${EXTRA_ENV:-}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user