docs(paged): scope gate projection policy

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 06:49:20 +00:00
parent 9f75da01f9
commit 5354adcffb
5 changed files with 252 additions and 4 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -2214,3 +2214,52 @@ Decision:
 - Do not blindly force these calls to BF16. First inspect the model-load tensor
  types for `ffn_gate_inp*`; if changing weight dtype or graph routing is
  considered, require md5/op gates and KL validation.
+
+## Phase 38 Gate Projection Policy
+
+Phase 38 is a safety and scope checkpoint before any `ffn_gate_inp*` route
+change. It makes the reusable inference gate stricter by default and records why
+the Phase 37 SGEMM bucket should not be treated as a missed BF16 route.
+
+Artifact:
+
+- `/home/mudler/bench/phase38_gate_baseline/20260701_084410`
+
+Preflight:
+
+| check | actual |
+|-------|--------|
+| GPU | `NVIDIA GB10, 580.159.03` |
+| docker containers | `0` |
+| `local-ai-worker` containers | `0` |
+| GPU compute apps | `0` |
+| GPU lock owner | `FREE phase33-small-m-tile-policy-done 1782883234` |
+
+Fresh baseline gates against the current Phase37 build:
+
+| check | status | actual |
+|-------|--------|--------|
+| MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| `MUL_MAT` | ok | `1146/1146` |
+| `MUL_MAT_ID` | ok | `806/806` |
+
+Source comparison:
+
+- `qwen35moe.cpp` creates `ffn_gate_inp.weight` as `[n_embd, n_expert]` and
+  `ffn_gate_inp_shexp.weight` as `[n_embd]`.
+- `llama-graph.cpp` computes router logits with `build_lora_mm(gate_inp, cur)`
+  and labels the result `ffn_moe_logits`.
+- vLLM Qwen3-Next constructs both gates as `ReplicatedLinear(...,
+  quant_config=None)`, and its fused-MoE runner can concatenate router and
+  shared-expert gate weights for one fused-gate forward path.
+
+Decision:
+
+- The `sgemm` bucket is router/shared-expert gate math kept unquantized by both
+  engines. It is expected F32 policy, not an accidental cuBLAS fallback.
+- Do not force BF16 or NVFP4 for `ffn_gate_inp*`.
+- A future optimization can test a default-off fused gate projection that
+  preserves F32 math and split semantics. Gate it with MoE/dense md5,
+  `MUL_MAT`, `MUL_MAT_ID`, and KL validation if either md5 changes before any
+  serving benchmark.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -64,7 +64,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It
 - **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export.
 - **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source.
 - The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated.
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. Current verified state: fork HEAD `c78e537b5` is mirrored by worktree patch `0057-feat-cuda-trace-moe-mmq-launch-shapes.patch`; applying all `48` patch files on `0ed235ea2c17a19fc8238668653946721ed136fd` produces tree `4eae628e4ba6f2defa14a19d19f7e4abef9a2647`, exactly matching the fork.
+- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. Current verified state: fork HEAD `2d590d770` is mirrored by worktree patch `0063-feat-cuda-trace-cublas-tensor-names.patch`; applying all `54` patch files on `0ed235ea2c17a19fc8238668653946721ed136fd` produces tree `dedb1182910eafe9f6875588dc8285bfb544cce5`, exactly matching the fork.

 ### 2.6 Bench hygiene gates
 - **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS").
@@ -490,6 +490,18 @@ bucket is `blk.N.ffn_gate_inp.weight -> ffn_moe_logits-N` and
 `blk.N.ffn_gate_inp_shexp.weight -> shared_expert_gate-N`; do not force BF16
 without first inspecting model-load tensor types and running KL validation.

+Phase 38 is the current gate-projection policy checkpoint. Artifact:
+`/home/mudler/bench/phase38_gate_baseline/20260701_084410`. Preflight showed
+docker `0`, `local-ai-worker` `0`, compute apps `0`, and GB10 driver
+`580.159.03`. Fresh baseline gates against the Phase37 build passed: MoE
+`8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
+`806/806`. Source comparison found llama.cpp and vLLM both keep router and
+shared-expert gate weights unquantized; vLLM's relevant idea is fused F32 gate
+weight concatenation, not BF16/NVFP4 routing. Future fused-gate work must be
+default-off, preserve F32 semantics, and pass md5/op gates before benchmarking;
+if md5 changes, run KL first.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -520,7 +532,9 @@ are now done:
 - Phase 22 re-verified the patch-series mirror invariant after `0055`.

 For future release checks, run `paged-inference-gates.sh` and
-`paged-current-serving-snapshot.sh` from the LocalAI backend tree.
+`paged-current-serving-snapshot.sh` from the LocalAI backend tree. The inference
+gate now defaults to both `MUL_MAT` and `MUL_MAT_ID`; set `OPS=` only for a
+focused diagnostic run.

 ### (b) Datacenter-Blackwell pivot (THE real parity path)
 The thesis: every vLLM advantage that wins on GB10 is a kernel that is **broken or capped on consumer Blackwell** and **inverts on datacenter Blackwell** (B200): FLA blocked-solve GDN, Marlin/CUTLASS grouped FP4, HBM-tuned full-cudagraph decode, native tcgen05/TMEM. ~8 TB/s HBM lifts the LPDDR5x GDN bandwidth floor ~30x. Concrete first steps:
@@ -567,6 +581,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/phase35_mul_mat_route_trace/20260701_074359` - default-off regular MUL_MAT route trace patch `0061`; default/trace/post-serving md5 gates green; n128 route trace found BF16 `mat_f=2485`, `op_cublas=1330`.
 - `~/bench/phase36_cublas_route_trace/20260701_081228` - default-off cuBLAS subroute trace patch `0062`; default/trace/post-serving md5 and op gates green; n128 route trace found `bf16_tc=5681`, `sgemm=2511`.
 - `~/bench/phase37_cublas_name_trace/20260701_083227` - cuBLAS tensor-name trace patch `0063`; default/trace/post-serving md5 and op gates green; n128 trace identified `sgemm` as MoE gate logits and shared-expert gate projections.
+- `~/bench/phase38_gate_baseline/20260701_084410` - current Phase37 build baseline before gate-projection policy work; docker/local-ai-worker/GPU idle preflight green; MoE/dense md5 green; `MUL_MAT` `1146/1146`; `MUL_MAT_ID` `806/806`.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -976,6 +976,26 @@ Lever implication: do not blindly force the `sgemm` bucket to BF16. First inspec
 why `ffn_gate_inp*` loads as F32 and whether a dtype or graph-route change is
 precision-safe. If attempted, use md5/op gates plus KL validation.

+### Phase 38 gate projection policy
+
+Phase 38 re-ran the current Phase37 build safety gate before changing policy:
+artifact `/home/mudler/bench/phase38_gate_baseline/20260701_084410`, MoE
+`8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
+`806/806`.
+
+Source check: llama.cpp's Qwen35MoE graph uses `ffn_gate_inp.weight` for
+`ffn_moe_logits` and `ffn_gate_inp_shexp.weight` for `shared_expert_gate`. vLLM
+Qwen3-Next also constructs those gates with `quant_config=None`; the relevant
+vLLM idea is not reduced precision, but concatenating router and shared-expert
+gate weights in the fused-MoE runner when shared-expert fusion is active.
+
+Lever implication: keep `ffn_gate_inp*` as inference-critical F32 policy. A
+future low-conflict experiment may test a default-off fused F32 gate projection
+that computes both logits in one matmul and splits the output, but it must pass
+MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates before benchmarking. If md5
+changes, run the KL gate first and reject on any KL regression.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh
+++ b/backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh
@@ -12,7 +12,7 @@ Environment:
  MOE        MoE GGUF path (default: ~/bench/q36-35b-a3b-nvfp4.gguf)
  DENSE      Dense GGUF path (default: ~/bench/q36-27b-nvfp4.gguf)
  ART        artifact dir (default: ~/bench/paged_inference_gates/<timestamp>)
-  OPS        comma-separated test-backend-ops filters (default: MUL_MAT_ID)
+  OPS        comma-separated test-backend-ops filters (default: MUL_MAT,MUL_MAT_ID)
  EXTRA_ENV  extra env assignments for completion gates, e.g. "GDN_TC=5"

 Expected md5:
@@ -28,7 +28,7 @@ DENSE_MD5_EXPECTED=5951a5b4d624ce891e22ab5fca9bc439
 BIN=${BIN:-"$HOME/llama-phase6-source/build-cuda/bin"}
 MOE=${MOE:-"$HOME/bench/q36-35b-a3b-nvfp4.gguf"}
 DENSE=${DENSE:-"$HOME/bench/q36-27b-nvfp4.gguf"}
-OPS=${OPS:-MUL_MAT_ID}
+OPS=${OPS:-MUL_MAT,MUL_MAT_ID}
 ART=${ART:-"$HOME/bench/paged_inference_gates/$(date +%Y%m%d_%H%M%S)"}
 EXTRA_ENV=${EXTRA_ENV:-}