From 5354adcffb1b66be80d7da74f90e2499fbb25374 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 06:49:20 +0000 Subject: [PATCH] docs(paged): scope gate projection policy Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 49 ++++++ .../docs/PARITY_HANDOFF.md | 19 +- .../docs/VLLM_PARITY_LEVER_MAP.md | 20 +++ .../paged-inference-gates.sh | 4 +- ...26-07-01-gate-projection-policy-phase38.md | 164 ++++++++++++++++++ 5 files changed, 252 insertions(+), 4 deletions(-) create mode 100644 docs/superpowers/plans/2026-07-01-gate-projection-policy-phase38.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 5606eef81..89f03afea 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -2214,3 +2214,52 @@ Decision: - Do not blindly force these calls to BF16. First inspect the model-load tensor types for `ffn_gate_inp*`; if changing weight dtype or graph routing is considered, require md5/op gates and KL validation. + +## Phase 38 Gate Projection Policy + +Phase 38 is a safety and scope checkpoint before any `ffn_gate_inp*` route +change. It makes the reusable inference gate stricter by default and records why +the Phase 37 SGEMM bucket should not be treated as a missed BF16 route. + +Artifact: + +- `/home/mudler/bench/phase38_gate_baseline/20260701_084410` + +Preflight: + +| check | actual | +|-------|--------| +| GPU | `NVIDIA GB10, 580.159.03` | +| docker containers | `0` | +| `local-ai-worker` containers | `0` | +| GPU compute apps | `0` | +| GPU lock owner | `FREE phase33-small-m-tile-policy-done 1782883234` | + +Fresh baseline gates against the current Phase37 build: + +| check | status | actual | +|-------|--------|--------| +| MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT` | ok | `1146/1146` | +| `MUL_MAT_ID` | ok | `806/806` | + +Source comparison: + +- `qwen35moe.cpp` creates `ffn_gate_inp.weight` as `[n_embd, n_expert]` and + `ffn_gate_inp_shexp.weight` as `[n_embd]`. +- `llama-graph.cpp` computes router logits with `build_lora_mm(gate_inp, cur)` + and labels the result `ffn_moe_logits`. +- vLLM Qwen3-Next constructs both gates as `ReplicatedLinear(..., + quant_config=None)`, and its fused-MoE runner can concatenate router and + shared-expert gate weights for one fused-gate forward path. + +Decision: + +- The `sgemm` bucket is router/shared-expert gate math kept unquantized by both + engines. It is expected F32 policy, not an accidental cuBLAS fallback. +- Do not force BF16 or NVFP4 for `ffn_gate_inp*`. +- A future optimization can test a default-off fused gate projection that + preserves F32 math and split semantics. Gate it with MoE/dense md5, + `MUL_MAT`, `MUL_MAT_ID`, and KL validation if either md5 changes before any + serving benchmark. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index a211ea4b3..45c2a67bb 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -64,7 +64,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It - **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export. - **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source. - The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated. -- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. Current verified state: fork HEAD `c78e537b5` is mirrored by worktree patch `0057-feat-cuda-trace-moe-mmq-launch-shapes.patch`; applying all `48` patch files on `0ed235ea2c17a19fc8238668653946721ed136fd` produces tree `4eae628e4ba6f2defa14a19d19f7e4abef9a2647`, exactly matching the fork. +- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. Current verified state: fork HEAD `2d590d770` is mirrored by worktree patch `0063-feat-cuda-trace-cublas-tensor-names.patch`; applying all `54` patch files on `0ed235ea2c17a19fc8238668653946721ed136fd` produces tree `dedb1182910eafe9f6875588dc8285bfb544cce5`, exactly matching the fork. ### 2.6 Bench hygiene gates - **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS"). @@ -490,6 +490,18 @@ bucket is `blk.N.ffn_gate_inp.weight -> ffn_moe_logits-N` and `blk.N.ffn_gate_inp_shexp.weight -> shared_expert_gate-N`; do not force BF16 without first inspecting model-load tensor types and running KL validation. +Phase 38 is the current gate-projection policy checkpoint. Artifact: +`/home/mudler/bench/phase38_gate_baseline/20260701_084410`. Preflight showed +docker `0`, `local-ai-worker` `0`, compute apps `0`, and GB10 driver +`580.159.03`. Fresh baseline gates against the Phase37 build passed: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. Source comparison found llama.cpp and vLLM both keep router and +shared-expert gate weights unquantized; vLLM's relevant idea is fused F32 gate +weight concatenation, not BF16/NVFP4 routing. Future fused-gate work must be +default-off, preserve F32 semantics, and pass md5/op gates before benchmarking; +if md5 changes, run KL first. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -520,7 +532,9 @@ are now done: - Phase 22 re-verified the patch-series mirror invariant after `0055`. For future release checks, run `paged-inference-gates.sh` and -`paged-current-serving-snapshot.sh` from the LocalAI backend tree. +`paged-current-serving-snapshot.sh` from the LocalAI backend tree. The inference +gate now defaults to both `MUL_MAT` and `MUL_MAT_ID`; set `OPS=` only for a +focused diagnostic run. ### (b) Datacenter-Blackwell pivot (THE real parity path) The thesis: every vLLM advantage that wins on GB10 is a kernel that is **broken or capped on consumer Blackwell** and **inverts on datacenter Blackwell** (B200): FLA blocked-solve GDN, Marlin/CUTLASS grouped FP4, HBM-tuned full-cudagraph decode, native tcgen05/TMEM. ~8 TB/s HBM lifts the LPDDR5x GDN bandwidth floor ~30x. Concrete first steps: @@ -567,6 +581,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase35_mul_mat_route_trace/20260701_074359` - default-off regular MUL_MAT route trace patch `0061`; default/trace/post-serving md5 gates green; n128 route trace found BF16 `mat_f=2485`, `op_cublas=1330`. - `~/bench/phase36_cublas_route_trace/20260701_081228` - default-off cuBLAS subroute trace patch `0062`; default/trace/post-serving md5 and op gates green; n128 route trace found `bf16_tc=5681`, `sgemm=2511`. - `~/bench/phase37_cublas_name_trace/20260701_083227` - cuBLAS tensor-name trace patch `0063`; default/trace/post-serving md5 and op gates green; n128 trace identified `sgemm` as MoE gate logits and shared-expert gate projections. +- `~/bench/phase38_gate_baseline/20260701_084410` - current Phase37 build baseline before gate-projection policy work; docker/local-ai-worker/GPU idle preflight green; MoE/dense md5 green; `MUL_MAT` `1146/1146`; `MUL_MAT_ID` `806/806`. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index a0955085b..8f9c1a915 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -976,6 +976,26 @@ Lever implication: do not blindly force the `sgemm` bucket to BF16. First inspec why `ffn_gate_inp*` loads as F32 and whether a dtype or graph-route change is precision-safe. If attempted, use md5/op gates plus KL validation. +### Phase 38 gate projection policy + +Phase 38 re-ran the current Phase37 build safety gate before changing policy: +artifact `/home/mudler/bench/phase38_gate_baseline/20260701_084410`, MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +Source check: llama.cpp's Qwen35MoE graph uses `ffn_gate_inp.weight` for +`ffn_moe_logits` and `ffn_gate_inp_shexp.weight` for `shared_expert_gate`. vLLM +Qwen3-Next also constructs those gates with `quant_config=None`; the relevant +vLLM idea is not reduced precision, but concatenating router and shared-expert +gate weights in the fused-MoE runner when shared-expert fusion is active. + +Lever implication: keep `ffn_gate_inp*` as inference-critical F32 policy. A +future low-conflict experiment may test a default-off fused F32 gate projection +that computes both logits in one matmul and splits the output, but it must pass +MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates before benchmarking. If md5 +changes, run the KL gate first and reject on any KL regression. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh b/backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh index ccff49c3c..bbe4149e3 100755 --- a/backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh +++ b/backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh @@ -12,7 +12,7 @@ Environment: MOE MoE GGUF path (default: ~/bench/q36-35b-a3b-nvfp4.gguf) DENSE Dense GGUF path (default: ~/bench/q36-27b-nvfp4.gguf) ART artifact dir (default: ~/bench/paged_inference_gates/) - OPS comma-separated test-backend-ops filters (default: MUL_MAT_ID) + OPS comma-separated test-backend-ops filters (default: MUL_MAT,MUL_MAT_ID) EXTRA_ENV extra env assignments for completion gates, e.g. "GDN_TC=5" Expected md5: @@ -28,7 +28,7 @@ DENSE_MD5_EXPECTED=5951a5b4d624ce891e22ab5fca9bc439 BIN=${BIN:-"$HOME/llama-phase6-source/build-cuda/bin"} MOE=${MOE:-"$HOME/bench/q36-35b-a3b-nvfp4.gguf"} DENSE=${DENSE:-"$HOME/bench/q36-27b-nvfp4.gguf"} -OPS=${OPS:-MUL_MAT_ID} +OPS=${OPS:-MUL_MAT,MUL_MAT_ID} ART=${ART:-"$HOME/bench/paged_inference_gates/$(date +%Y%m%d_%H%M%S)"} EXTRA_ENV=${EXTRA_ENV:-} diff --git a/docs/superpowers/plans/2026-07-01-gate-projection-policy-phase38.md b/docs/superpowers/plans/2026-07-01-gate-projection-policy-phase38.md new file mode 100644 index 000000000..2b3d0a275 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-gate-projection-policy-phase38.md @@ -0,0 +1,164 @@ +# Gate Projection Policy Phase38 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** decide whether the Phase37 `ffn_gate_inp*` SGEMM bucket is a safe vLLM-parity lever without breaking inference. + +**Architecture:** Treat router logits and shared-expert gate projections as inference-critical F32 policy until proven otherwise. Phase38 is analysis-first: record the source/vLLM comparison, strengthen the default inference gate, and only allow later route changes behind md5/op gates plus KL if byte output changes. + +**Tech Stack:** LocalAI paged llama.cpp backend, llama.cpp CUDA fork, DGX GB10, vLLM Qwen3-Next fused-MoE code, `paged-inference-gates.sh`. + +--- + +### Task 1: Establish a fresh inference baseline + +**Files:** +- Read: `backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh` +- Artifact: `dgx.casa:/home/mudler/bench/phase38_gate_baseline/20260701_084410` + +- [x] **Step 1: Verify DGX is idle** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail; echo owner=$(cat ~/gpu_bench_lock/owner 2>/dev/null || true); echo docker=$(docker ps -q | wc -l); echo local_ai_worker=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true); echo compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l); nvidia-smi --query-gpu=name,driver_version --format=csv,noheader' +``` + +Observed: + +```text +owner=FREE phase33-small-m-tile-policy-done 1782883234 +docker=0 +local_ai_worker=0 +compute=0 +NVIDIA GB10, 580.159.03 +``` + +- [x] **Step 2: Run canonical md5 and op gates** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase38_gate_baseline/$(date +%Y%m%d_%H%M%S); mkdir -p "$ART"; BIN=$HOME/llama-phase6-source/build-phase36/bin ART="$ART" OPS=MUL_MAT,MUL_MAT_ID $HOME/paged-inference-gates.sh | tee "$ART/gate.log"' +``` + +Observed: + +```text +moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 +dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 +1146/1146 tests passed +Backend CUDA0: OK +806/806 tests passed +Backend CUDA0: OK +paged inference gates OK +artifacts: /home/mudler/bench/phase38_gate_baseline/20260701_084410 +``` + +### Task 2: Strengthen the reusable inference gate + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh` + +- [x] **Step 1: Make both matmul op gates default** + +Change: + +```bash +OPS=${OPS:-MUL_MAT_ID} +``` + +to: + +```bash +OPS=${OPS:-MUL_MAT,MUL_MAT_ID} +``` + +Also update `--help` text so the default is visible. + +- [x] **Step 2: Verify shell syntax and help output** + +Run: + +```bash +bash -n backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh +backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh --help | grep 'default: MUL_MAT,MUL_MAT_ID' +``` + +Expected: exit 0 and the updated default line is printed. + +### Task 3: Record the Phase37 to Phase38 policy decision + +**Files:** +- Read: `/home/mudler/_git/llama.cpp/src/models/qwen35moe.cpp` +- Read: `/home/mudler/_git/llama.cpp/src/llama-graph.cpp` +- Read: `/home/mudler/_git/vllm/vllm/model_executor/models/qwen3_next.py` +- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fused_moe/runner/moe_runner.py` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +- [x] **Step 1: Source inspection result** + +`qwen35moe.cpp` creates `ffn_gate_inp.weight` as `[n_embd, n_expert]` and `ffn_gate_inp_shexp.weight` as `[n_embd]`. The graph uses: + +```cpp +build_moe_ffn(cur, model.layers[il].ffn_gate_inp, ...) +build_lora_mm(model.layers[il].ffn_gate_inp_shexp, cur) +``` + +`llama-graph.cpp` computes router logits through `build_lora_mm(gate_inp, cur)` and labels the result `ffn_moe_logits`. + +- [x] **Step 2: vLLM comparison result** + +`qwen3_next.py` constructs both gates as `ReplicatedLinear(..., quant_config=None)`. `moe_runner.py` can concatenate `gate.weight` and `shared_expert_gate.weight` into `_combined_gate_weight` for fused shared-expert routing. + +- [x] **Step 3: Decision** + +The SGEMM bucket is not an accidental slow path. It is router/shared-expert gate math kept unquantized by both llama.cpp and vLLM. Do not force BF16 or NVFP4 for `ffn_gate_inp*`. The safe follow-up lever is a default-off fused gate projection experiment that preserves F32 math and split semantics, or a diagnostic proof that the two current SGEMMs are too small to matter. + +- [ ] **Step 4: Gate any later fused-gate experiment** + +Before benchmarking any code change: + +```bash +BIN=$HOME/llama-phase6-source/build-phase36/bin \ +ART=$HOME/bench/phase38_gate_fused_candidate \ +OPS=MUL_MAT,MUL_MAT_ID \ +$HOME/paged-inference-gates.sh +``` + +If either md5 differs, stop and run the KL gate before serving benchmarks. If either op gate fails, reject the candidate. + +### Task 4: Commit the docs and gate-script update + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh` +- Modify: `docs/superpowers/plans/2026-07-01-gate-projection-policy-phase38.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +- [x] **Step 1: Run local syntax checks** + +Run: + +```bash +bash -n backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh +``` + +Expected: exit 0. + +- [x] **Step 2: Commit** + +Run: + +```bash +git add backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh \ + docs/superpowers/plans/2026-07-01-gate-projection-policy-phase38.md \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git commit -m "docs(paged): scope gate projection policy" \ + -m "Assisted-by: Codex:gpt-5" +```