docs(paged): scope gate projection policy

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 06:49:20 +00:00
parent 9f75da01f9
commit 5354adcffb
5 changed files with 252 additions and 4 deletions

View File

@@ -2214,3 +2214,52 @@ Decision:
- Do not blindly force these calls to BF16. First inspect the model-load tensor
types for `ffn_gate_inp*`; if changing weight dtype or graph routing is
considered, require md5/op gates and KL validation.
## Phase 38 Gate Projection Policy
Phase 38 is a safety and scope checkpoint before any `ffn_gate_inp*` route
change. It makes the reusable inference gate stricter by default and records why
the Phase 37 SGEMM bucket should not be treated as a missed BF16 route.
Artifact:
- `/home/mudler/bench/phase38_gate_baseline/20260701_084410`
Preflight:
| check | actual |
|-------|--------|
| GPU | `NVIDIA GB10, 580.159.03` |
| docker containers | `0` |
| `local-ai-worker` containers | `0` |
| GPU compute apps | `0` |
| GPU lock owner | `FREE phase33-small-m-tile-policy-done 1782883234` |
Fresh baseline gates against the current Phase37 build:
| check | status | actual |
|-------|--------|--------|
| MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
| dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
| `MUL_MAT` | ok | `1146/1146` |
| `MUL_MAT_ID` | ok | `806/806` |
Source comparison:
- `qwen35moe.cpp` creates `ffn_gate_inp.weight` as `[n_embd, n_expert]` and
`ffn_gate_inp_shexp.weight` as `[n_embd]`.
- `llama-graph.cpp` computes router logits with `build_lora_mm(gate_inp, cur)`
and labels the result `ffn_moe_logits`.
- vLLM Qwen3-Next constructs both gates as `ReplicatedLinear(...,
quant_config=None)`, and its fused-MoE runner can concatenate router and
shared-expert gate weights for one fused-gate forward path.
Decision:
- The `sgemm` bucket is router/shared-expert gate math kept unquantized by both
engines. It is expected F32 policy, not an accidental cuBLAS fallback.
- Do not force BF16 or NVFP4 for `ffn_gate_inp*`.
- A future optimization can test a default-off fused gate projection that
preserves F32 math and split semantics. Gate it with MoE/dense md5,
`MUL_MAT`, `MUL_MAT_ID`, and KL validation if either md5 changes before any
serving benchmark.

View File

@@ -64,7 +64,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It
- **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export.
- **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source.
- The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated.
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. Current verified state: fork HEAD `c78e537b5` is mirrored by worktree patch `0057-feat-cuda-trace-moe-mmq-launch-shapes.patch`; applying all `48` patch files on `0ed235ea2c17a19fc8238668653946721ed136fd` produces tree `4eae628e4ba6f2defa14a19d19f7e4abef9a2647`, exactly matching the fork.
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. Current verified state: fork HEAD `2d590d770` is mirrored by worktree patch `0063-feat-cuda-trace-cublas-tensor-names.patch`; applying all `54` patch files on `0ed235ea2c17a19fc8238668653946721ed136fd` produces tree `dedb1182910eafe9f6875588dc8285bfb544cce5`, exactly matching the fork.
### 2.6 Bench hygiene gates
- **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS").
@@ -490,6 +490,18 @@ bucket is `blk.N.ffn_gate_inp.weight -> ffn_moe_logits-N` and
`blk.N.ffn_gate_inp_shexp.weight -> shared_expert_gate-N`; do not force BF16
without first inspecting model-load tensor types and running KL validation.
Phase 38 is the current gate-projection policy checkpoint. Artifact:
`/home/mudler/bench/phase38_gate_baseline/20260701_084410`. Preflight showed
docker `0`, `local-ai-worker` `0`, compute apps `0`, and GB10 driver
`580.159.03`. Fresh baseline gates against the Phase37 build passed: MoE
`8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
`806/806`. Source comparison found llama.cpp and vLLM both keep router and
shared-expert gate weights unquantized; vLLM's relevant idea is fused F32 gate
weight concatenation, not BF16/NVFP4 routing. Future fused-gate work must be
default-off, preserve F32 semantics, and pass md5/op gates before benchmarking;
if md5 changes, run KL first.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -520,7 +532,9 @@ are now done:
- Phase 22 re-verified the patch-series mirror invariant after `0055`.
For future release checks, run `paged-inference-gates.sh` and
`paged-current-serving-snapshot.sh` from the LocalAI backend tree.
`paged-current-serving-snapshot.sh` from the LocalAI backend tree. The inference
gate now defaults to both `MUL_MAT` and `MUL_MAT_ID`; set `OPS=` only for a
focused diagnostic run.
### (b) Datacenter-Blackwell pivot (THE real parity path)
The thesis: every vLLM advantage that wins on GB10 is a kernel that is **broken or capped on consumer Blackwell** and **inverts on datacenter Blackwell** (B200): FLA blocked-solve GDN, Marlin/CUTLASS grouped FP4, HBM-tuned full-cudagraph decode, native tcgen05/TMEM. ~8 TB/s HBM lifts the LPDDR5x GDN bandwidth floor ~30x. Concrete first steps:
@@ -567,6 +581,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
- `~/bench/phase35_mul_mat_route_trace/20260701_074359` - default-off regular MUL_MAT route trace patch `0061`; default/trace/post-serving md5 gates green; n128 route trace found BF16 `mat_f=2485`, `op_cublas=1330`.
- `~/bench/phase36_cublas_route_trace/20260701_081228` - default-off cuBLAS subroute trace patch `0062`; default/trace/post-serving md5 and op gates green; n128 route trace found `bf16_tc=5681`, `sgemm=2511`.
- `~/bench/phase37_cublas_name_trace/20260701_083227` - cuBLAS tensor-name trace patch `0063`; default/trace/post-serving md5 and op gates green; n128 trace identified `sgemm` as MoE gate logits and shared-expert gate projections.
- `~/bench/phase38_gate_baseline/20260701_084410` - current Phase37 build baseline before gate-projection policy work; docker/local-ai-worker/GPU idle preflight green; MoE/dense md5 green; `MUL_MAT` `1146/1146`; `MUL_MAT_ID` `806/806`.
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.

View File

@@ -976,6 +976,26 @@ Lever implication: do not blindly force the `sgemm` bucket to BF16. First inspec
why `ffn_gate_inp*` loads as F32 and whether a dtype or graph-route change is
precision-safe. If attempted, use md5/op gates plus KL validation.
### Phase 38 gate projection policy
Phase 38 re-ran the current Phase37 build safety gate before changing policy:
artifact `/home/mudler/bench/phase38_gate_baseline/20260701_084410`, MoE
`8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
`806/806`.
Source check: llama.cpp's Qwen35MoE graph uses `ffn_gate_inp.weight` for
`ffn_moe_logits` and `ffn_gate_inp_shexp.weight` for `shared_expert_gate`. vLLM
Qwen3-Next also constructs those gates with `quant_config=None`; the relevant
vLLM idea is not reduced precision, but concatenating router and shared-expert
gate weights in the fused-MoE runner when shared-expert fusion is active.
Lever implication: keep `ffn_gate_inp*` as inference-critical F32 policy. A
future low-conflict experiment may test a default-off fused F32 gate projection
that computes both logits in one matmul and splits the output, but it must pass
MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates before benchmarking. If md5
changes, run the KL gate first and reject on any KL regression.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update

View File

@@ -12,7 +12,7 @@ Environment:
MOE MoE GGUF path (default: ~/bench/q36-35b-a3b-nvfp4.gguf)
DENSE Dense GGUF path (default: ~/bench/q36-27b-nvfp4.gguf)
ART artifact dir (default: ~/bench/paged_inference_gates/<timestamp>)
OPS comma-separated test-backend-ops filters (default: MUL_MAT_ID)
OPS comma-separated test-backend-ops filters (default: MUL_MAT,MUL_MAT_ID)
EXTRA_ENV extra env assignments for completion gates, e.g. "GDN_TC=5"
Expected md5:
@@ -28,7 +28,7 @@ DENSE_MD5_EXPECTED=5951a5b4d624ce891e22ab5fca9bc439
BIN=${BIN:-"$HOME/llama-phase6-source/build-cuda/bin"}
MOE=${MOE:-"$HOME/bench/q36-35b-a3b-nvfp4.gguf"}
DENSE=${DENSE:-"$HOME/bench/q36-27b-nvfp4.gguf"}
OPS=${OPS:-MUL_MAT_ID}
OPS=${OPS:-MUL_MAT,MUL_MAT_ID}
ART=${ART:-"$HOME/bench/paged_inference_gates/$(date +%Y%m%d_%H%M%S)"}
EXTRA_ENV=${EXTRA_ENV:-}

View File

@@ -0,0 +1,164 @@
# Gate Projection Policy Phase38 Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** decide whether the Phase37 `ffn_gate_inp*` SGEMM bucket is a safe vLLM-parity lever without breaking inference.
**Architecture:** Treat router logits and shared-expert gate projections as inference-critical F32 policy until proven otherwise. Phase38 is analysis-first: record the source/vLLM comparison, strengthen the default inference gate, and only allow later route changes behind md5/op gates plus KL if byte output changes.
**Tech Stack:** LocalAI paged llama.cpp backend, llama.cpp CUDA fork, DGX GB10, vLLM Qwen3-Next fused-MoE code, `paged-inference-gates.sh`.
---
### Task 1: Establish a fresh inference baseline
**Files:**
- Read: `backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh`
- Artifact: `dgx.casa:/home/mudler/bench/phase38_gate_baseline/20260701_084410`
- [x] **Step 1: Verify DGX is idle**
Run:
```bash
ssh dgx.casa 'set -euo pipefail; echo owner=$(cat ~/gpu_bench_lock/owner 2>/dev/null || true); echo docker=$(docker ps -q | wc -l); echo local_ai_worker=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true); echo compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l); nvidia-smi --query-gpu=name,driver_version --format=csv,noheader'
```
Observed:
```text
owner=FREE phase33-small-m-tile-policy-done 1782883234
docker=0
local_ai_worker=0
compute=0
NVIDIA GB10, 580.159.03
```
- [x] **Step 2: Run canonical md5 and op gates**
Run:
```bash
ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase38_gate_baseline/$(date +%Y%m%d_%H%M%S); mkdir -p "$ART"; BIN=$HOME/llama-phase6-source/build-phase36/bin ART="$ART" OPS=MUL_MAT,MUL_MAT_ID $HOME/paged-inference-gates.sh | tee "$ART/gate.log"'
```
Observed:
```text
moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0
dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439
1146/1146 tests passed
Backend CUDA0: OK
806/806 tests passed
Backend CUDA0: OK
paged inference gates OK
artifacts: /home/mudler/bench/phase38_gate_baseline/20260701_084410
```
### Task 2: Strengthen the reusable inference gate
**Files:**
- Modify: `backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh`
- [x] **Step 1: Make both matmul op gates default**
Change:
```bash
OPS=${OPS:-MUL_MAT_ID}
```
to:
```bash
OPS=${OPS:-MUL_MAT,MUL_MAT_ID}
```
Also update `--help` text so the default is visible.
- [x] **Step 2: Verify shell syntax and help output**
Run:
```bash
bash -n backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh
backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh --help | grep 'default: MUL_MAT,MUL_MAT_ID'
```
Expected: exit 0 and the updated default line is printed.
### Task 3: Record the Phase37 to Phase38 policy decision
**Files:**
- Read: `/home/mudler/_git/llama.cpp/src/models/qwen35moe.cpp`
- Read: `/home/mudler/_git/llama.cpp/src/llama-graph.cpp`
- Read: `/home/mudler/_git/vllm/vllm/model_executor/models/qwen3_next.py`
- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fused_moe/runner/moe_runner.py`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
- [x] **Step 1: Source inspection result**
`qwen35moe.cpp` creates `ffn_gate_inp.weight` as `[n_embd, n_expert]` and `ffn_gate_inp_shexp.weight` as `[n_embd]`. The graph uses:
```cpp
build_moe_ffn(cur, model.layers[il].ffn_gate_inp, ...)
build_lora_mm(model.layers[il].ffn_gate_inp_shexp, cur)
```
`llama-graph.cpp` computes router logits through `build_lora_mm(gate_inp, cur)` and labels the result `ffn_moe_logits`.
- [x] **Step 2: vLLM comparison result**
`qwen3_next.py` constructs both gates as `ReplicatedLinear(..., quant_config=None)`. `moe_runner.py` can concatenate `gate.weight` and `shared_expert_gate.weight` into `_combined_gate_weight` for fused shared-expert routing.
- [x] **Step 3: Decision**
The SGEMM bucket is not an accidental slow path. It is router/shared-expert gate math kept unquantized by both llama.cpp and vLLM. Do not force BF16 or NVFP4 for `ffn_gate_inp*`. The safe follow-up lever is a default-off fused gate projection experiment that preserves F32 math and split semantics, or a diagnostic proof that the two current SGEMMs are too small to matter.
- [ ] **Step 4: Gate any later fused-gate experiment**
Before benchmarking any code change:
```bash
BIN=$HOME/llama-phase6-source/build-phase36/bin \
ART=$HOME/bench/phase38_gate_fused_candidate \
OPS=MUL_MAT,MUL_MAT_ID \
$HOME/paged-inference-gates.sh
```
If either md5 differs, stop and run the KL gate before serving benchmarks. If either op gate fails, reject the candidate.
### Task 4: Commit the docs and gate-script update
**Files:**
- Modify: `backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh`
- Modify: `docs/superpowers/plans/2026-07-01-gate-projection-policy-phase38.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
- [x] **Step 1: Run local syntax checks**
Run:
```bash
bash -n backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh
```
Expected: exit 0.
- [x] **Step 2: Commit**
Run:
```bash
git add backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh \
docs/superpowers/plans/2026-07-01-gate-projection-policy-phase38.md \
backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
git commit -m "docs(paged): scope gate projection policy" \
-m "Assisted-by: Codex:gpt-5"
```