From a0194125f5a9c19714418267ea6dc64877286247 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Wed, 1 Jul 2026 03:35:54 +0000
Subject: [PATCH] chore(paged): summarize snapshot inference gates

Emit a compact gate_summary.tsv from current serving snapshots so each artifact records the pre/post MoE md5, dense md5, and backend op checks. Add a summary-only mode for auditing existing artifacts and document the Phase 25 backfill on the Phase 20 snapshot.

Assisted-by: Codex:gpt-5
---
 backend/cpp/llama-cpp-localai-paged/README.md |   5 +-
 .../docs/GB10_PARITY_PHASE0_RESULTS.md        |  57 +++++++++
 .../docs/PARITY_HANDOFF.md                    |  11 ++
 .../docs/VLLM_PARITY_LEVER_MAP.md             |  15 +++
 .../paged-current-serving-snapshot.sh         |  95 +++++++++++++-
 ...026-07-01-snapshot-gate-summary-phase25.md | 121 ++++++++++++++++++
 6 files changed, 300 insertions(+), 4 deletions(-)
 create mode 100644 docs/superpowers/plans/2026-07-01-snapshot-gate-summary-phase25.md

diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md
index 838ac17a2..6755f001f 100644
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -618,9 +618,12 @@ DGX mirror `f2521ab12`, artifact
 Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving
 snapshots. It targets the clean `~/llama-phase6-source` mirror, checks
 docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post
-inference gates, writes `hardware.txt`, and emits paged/vLLM ratios.
+inference gates, writes `hardware.txt`, emits `gate_summary.tsv`, and emits
+paged/vLLM ratios.
 `hardware.txt` records the GPU identity and hardware class so GB10/workstation
 Blackwell evidence is not confused with a future datacenter-Blackwell rerun.
+`gate_summary.tsv` records pre/post MoE md5, dense md5, and backend-op checks
+so an artifact proves inferencing gates without reading full logs.
 Do not use the stale DGX
 `~/bench/combined_definitive.sh` without first porting it to the current mirror
 and lock discipline.
diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
index 58761a7cd..8a777ad92 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -1514,3 +1514,60 @@ Decision:
   GB10-to-datacenter generalization.
 - The Phase 20 GB10 closure still applies to `gb10_or_workstation_blackwell`;
   datacenter Blackwell needs a fresh run of the same methodology.
+
+## Phase 25 Snapshot Gate Summary
+
+Phase 25 made current-stack serving artifacts self-auditing for the inference
+gates that protect the paged path.
+
+Script change:
+
+- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` now
+  writes `gate_summary.tsv` after the post gate in a full run.
+- The script also supports `--summarize-gates ART` to generate the same summary
+  from existing `gate_pre/` and `gate_post/` artifacts without launching
+  servers.
+
+Recorded rows:
+
+- pre/post MoE transcript md5 versus
+  `8cb0ce23777bf55f92f63d0292c756b0`;
+- pre/post dense transcript md5 versus
+  `5951a5b4d624ce891e22ab5fca9bc439`;
+- pre/post backend op rows, currently `MUL_MAT_ID`, with the parsed passed/total
+  count.
+
+Verification:
+
+- Red check: Phase 20 initially had gate artifacts but no `gate_summary.tsv`.
+- local `bash -n` passed;
+- local `--help` passed;
+- DGX `--summarize-gates` against Phase 20 wrote six green rows;
+- DGX `DRY_RUN=1` validated the normal path still preflights and writes
+  `hardware.txt` without launching servers or writing a gate summary before
+  gates exist.
+
+Artifacts:
+
+- Backfilled summary:
+  `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`
+- Dry run:
+  `/home/mudler/bench/phase25_gate_summary_dryrun/20260701_053353`
+
+Backfilled Phase 20 gate summary:
+
+```text
+pre  moe_md5     ok  8cb0ce23777bf55f92f63d0292c756b0
+pre  dense_md5   ok  5951a5b4d624ce891e22ab5fca9bc439
+pre  op_MUL_MAT_ID   ok  806/806
+post moe_md5     ok  8cb0ce23777bf55f92f63d0292c756b0
+post dense_md5   ok  5951a5b4d624ce891e22ab5fca9bc439
+post op_MUL_MAT_ID   ok  806/806
+```
+
+Decision:
+
+- Future full serving snapshots carry compact proof that inference md5/op gates
+  stayed green before and after the paged-vs-vLLM run.
+- Treat `gate_summary.tsv` plus `hardware.txt` as the quick audit surface before
+  accepting a parity snapshot.
diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
index 5c756a99f..cda3918bd 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -132,6 +132,10 @@ python3 /home/mudler/bench/h2h_cli3.py   # OpenAI /v1/completions, ignore_eos, f
 The harness also writes `hardware.txt` before any server starts, including
 `DRY_RUN=1`, so every new snapshot records the GPU model, driver, compute
 capability when exposed by `nvidia-smi`, and a conservative hardware class.
+Full runs also write `gate_summary.tsv` after the post gate, summarizing pre/post
+MoE md5, dense md5, and backend-op checks; use
+`paged-current-serving-snapshot.sh --summarize-gates ART` to backfill or audit an
+existing snapshot without starting servers.
 
 ### 3.4 THE DECODE-PROFILING RULE (this trap caused 4 wrong analyses)
 Decode runs as a **replayed CUDA graph**. `nsys` **without** `--cuda-graph-trace=node` collapses each graph replay into ONE opaque launch, so every per-kernel attribution becomes an artifact. This is exactly what made the old "paged 159 us/tok, GPU ~16% busy, host-bound, 5.4x more GPU-efficient" story wrong, and produced the wrong ~56% headline.
@@ -332,6 +336,12 @@ hardware report. DGX dry run passed at
 artifacts self-describing: GB10/workstation Blackwell results must not be used
 as datacenter-Blackwell parity evidence.
 
+Phase 25 extended the same harness to write `gate_summary.tsv`. The summary was
+backfilled on the Phase 20 artifact at
+`/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`;
+it records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
+`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806` as `ok`.
+
 ---
 
 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -396,6 +406,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/phase20_current_snapshot/20260701_050621` - current clean-stack paged-vs-vLLM MoE serving snapshot.
 - `~/bench/phase21_harness_dryrun/20260701_051757` - current snapshot harness dry-run artifact.
 - `~/bench/phase24_hardware_report_dryrun/20260701_052741` - current snapshot harness dry run proving `hardware.txt` captures the DGX as `hardware_class=gb10_or_workstation_blackwell`.
+- `~/bench/phase25_gate_summary_dryrun/20260701_053353` - dry run after adding `gate_summary.tsv` support; normal dry-run still writes `hardware.txt` and does not emit a gate summary before gates exist.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
index 2b3a0e176..552ef6c0b 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -676,6 +676,21 @@ with `GPU 0: NVIDIA GB10`, driver `580.159.03`, and compute capability `12.1`.
 Use `hardware.txt` when comparing future snapshots. GB10/workstation Blackwell
 results do not establish datacenter-Blackwell parity.
 
+### Phase 25 snapshot gate summary
+
+Phase 25 extended `paged-current-serving-snapshot.sh` to write
+`gate_summary.tsv` after the post gate in full runs. It also added
+`--summarize-gates ART` for auditing existing artifacts without launching
+servers.
+
+The Phase 20 artifact was backfilled at
+`/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`.
+It records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
+`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806` as `ok`.
+
+Use `hardware.txt` plus `gate_summary.tsv` as the quick audit surface before
+accepting any new parity snapshot.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
 
 ### Phase 10 GDN C32 slab update
diff --git a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
index 9ed6277c1..af1a7aac1 100755
--- a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
+++ b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
@@ -3,7 +3,7 @@ set -euo pipefail
 
 usage() {
   cat <<'EOF'
-Usage: paged-current-serving-snapshot.sh
+Usage: paged-current-serving-snapshot.sh [--summarize-gates ART]
 
 Run a current-stack paged llama.cpp vs vLLM MoE serving snapshot on DGX.
 
@@ -30,13 +30,32 @@ Environment overrides:
   VLLM_BIN     vLLM executable (default: ~/vllm-bench/bin/vllm)
   SKIP_GATES=1 to skip pre/post paged inference gates
   DRY_RUN=1    validate inputs/preflight, write hardware.txt, and print commands without running servers
+
+Options:
+  --summarize-gates ART  write ART/gate_summary.tsv from existing gate_pre/gate_post artifacts
 EOF
 }
 
-if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
+SUMMARY_GATES_ART=""
+case "${1:-}" in
+  -h|--help)
   usage
   exit 0
-fi
+  ;;
+  --summarize-gates)
+    if [[ -z "${2:-}" ]]; then
+      usage >&2
+      exit 2
+    fi
+    SUMMARY_GATES_ART="$2"
+  ;;
+  "")
+  ;;
+  *)
+    usage >&2
+    exit 2
+  ;;
+esac
 
 SRC=${SRC:-"$HOME/llama-phase6-source"}
 BIN=${BIN:-"$SRC/build-cuda/bin"}
@@ -56,6 +75,8 @@ VLLM_PORT=${VLLM_PORT:-8000}
 VLLM_BIN=${VLLM_BIN:-"$HOME/vllm-bench/bin/vllm"}
 SKIP_GATES=${SKIP_GATES:-0}
 DRY_RUN=${DRY_RUN:-0}
+MOE_MD5_EXPECTED=8cb0ce23777bf55f92f63d0292c756b0
+DENSE_MD5_EXPECTED=5951a5b4d624ce891e22ab5fca9bc439
 
 LOCK_DIR="$HOME/gpu_bench_lock"
 OWNER="$LOCK_DIR/owner"
@@ -271,6 +292,73 @@ for n in sorted({row[1] for row in rows}):
 PY
 }
 
+write_gate_summary() {
+  python3 - "$ART" "$MOE_MD5_EXPECTED" "$DENSE_MD5_EXPECTED" <<'PY' | tee "$ART/gate_summary.tsv"
+import re
+import sys
+from pathlib import Path
+
+art = Path(sys.argv[1])
+expected = {
+    "moe": sys.argv[2],
+    "dense": sys.argv[3],
+}
+ansi = re.compile(r"\x1b\[[0-9;]*m")
+bad = False
+
+print("phase\tcheck\tstatus\tactual\texpected\tdetails")
+
+for phase in ("pre", "post"):
+    gate_dir = art / f"gate_{phase}"
+    if not gate_dir.exists():
+        print(f"{phase}\tall\tskipped\t\t\t{gate_dir} missing")
+        continue
+
+    for name, want in expected.items():
+        md5_path = gate_dir / f"{name}.md5"
+        if not md5_path.exists():
+            print(f"{phase}\t{name}_md5\tmissing\t\t{want}\t{md5_path} missing")
+            bad = True
+            continue
+        got = md5_path.read_text().split()[0]
+        status = "ok" if got == want else "mismatch"
+        if status != "ok":
+            bad = True
+        print(f"{phase}\t{name}_md5\t{status}\t{got}\t{want}\t{md5_path}")
+
+    op_paths = sorted(gate_dir.glob("op_*.txt"))
+    if not op_paths:
+        print(f"{phase}\top\tmissing\t\t\tno op_*.txt files")
+        bad = True
+        continue
+
+    for path in op_paths:
+        op = path.stem.removeprefix("op_")
+        text = ansi.sub("", path.read_text(errors="replace"))
+        passed = re.search(r"(\d+)/(\d+) tests passed", text)
+        backend_ok = re.search(r"Backend CUDA0:\s+OK", text)
+        if passed:
+            actual = f"{passed.group(1)}/{passed.group(2)}"
+            status = "ok" if passed.group(1) == passed.group(2) and backend_ok else "fail"
+        else:
+            actual = ""
+            status = "missing"
+        if status != "ok":
+            bad = True
+        print(f"{phase}\top_{op}\t{status}\t{actual}\tall\t{path}")
+
+if bad:
+    sys.exit(6)
+PY
+}
+
+if [[ -n "$SUMMARY_GATES_ART" ]]; then
+  ART="$SUMMARY_GATES_ART"
+  require_path "$ART"
+  write_gate_summary
+  exit 0
+fi
+
 require_path "$SRC"
 require_path "$BIN/llama-server"
 require_path "$BIN/llama-completion"
@@ -306,5 +394,6 @@ run_vllm
 release_lock
 trap - EXIT
 run_gate post
+write_gate_summary
 write_summary
 log "artifacts: $ART"
diff --git a/docs/superpowers/plans/2026-07-01-snapshot-gate-summary-phase25.md b/docs/superpowers/plans/2026-07-01-snapshot-gate-summary-phase25.md
new file mode 100644
index 000000000..86b937071
--- /dev/null
+++ b/docs/superpowers/plans/2026-07-01-snapshot-gate-summary-phase25.md
@@ -0,0 +1,121 @@
+# Snapshot Gate Summary Phase 25 Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use
+> superpowers:verification-before-completion before recording the phase result.
+> Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** make current-stack paged-vs-vLLM serving artifacts prove that
+inference md5/op gates stayed green without requiring a full log read.
+
+**Architecture:** extend the existing current serving snapshot harness with a
+compact gate-summary writer. Keep it additive and outside llama.cpp source: no
+patch-series change and no inference behavior change.
+
+**Tech Stack:** Bash, Python stdlib, existing `paged-inference-gates.sh`
+artifacts.
+
+---
+
+## Task 1: Red Check
+
+- [x] **Step 1: Prove Phase 20 lacks compact gate proof**
+
+  Command:
+
+  ```bash
+  ssh dgx.casa 'test -e ~/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv'
+  ```
+
+  Result:
+
+  - exited `1` before the patch, while `gate_pre/`, `gate_post/`, and full gate
+    logs existed.
+
+## Task 2: Add Gate Summary
+
+- [x] **Step 1: Extend `paged-current-serving-snapshot.sh`**
+
+  File:
+
+  - `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
+
+  Behavior:
+
+  - writes `$ART/gate_summary.tsv` after the post gate in a full serving run;
+  - records pre/post MoE md5, dense md5, and backend op status;
+  - compares MoE against `8cb0ce23777bf55f92f63d0292c756b0`;
+  - compares dense against `5951a5b4d624ce891e22ab5fca9bc439`;
+  - parses op pass counts such as `806/806 tests passed`;
+  - exits non-zero if an existing gate artifact is missing, mismatched, or not
+    fully passing;
+  - supports `--summarize-gates ART` to audit existing artifacts without running
+    servers.
+
+## Task 3: Verify
+
+- [x] **Step 1: Local syntax/help checks**
+
+  Commands:
+
+  ```bash
+  bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
+  backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help
+  ```
+
+  Result:
+
+  - both passed.
+
+- [x] **Step 2: Backfill Phase 20 gate summary**
+
+  Command:
+
+  ```bash
+  /tmp/paged-current-serving-snapshot.sh \
+    --summarize-gates ~/bench/phase20_current_snapshot/20260701_050621
+  ```
+
+  Result:
+
+  - wrote `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`;
+  - pre/post MoE md5 rows were `ok`;
+  - pre/post dense md5 rows were `ok`;
+  - pre/post `MUL_MAT_ID` rows were `ok` with `806/806`.
+
+- [x] **Step 3: DGX dry run**
+
+  Command:
+
+  ```bash
+  DRY_RUN=1 ART=~/bench/phase25_gate_summary_dryrun/20260701_053353 \
+    /tmp/paged-current-serving-snapshot.sh
+  ```
+
+  Result:
+
+  - preflight verified `docker=0`, `local_ai_worker=0`, `compute=0`;
+  - `hardware.txt` was still written;
+  - no paged or vLLM server launched;
+  - no `gate_summary.tsv` was written before gates existed.
+
+  Artifact:
+
+  - `/home/mudler/bench/phase25_gate_summary_dryrun/20260701_053353`
+
+## Task 4: Record Result
+
+- [x] **Step 1: Update parity docs**
+
+  Updated files:
+
+  - `backend/cpp/llama-cpp-localai-paged/README.md`
+  - `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
+  - `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
+  - `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
+
+## Self-Review
+
+- No llama.cpp source behavior changed.
+- Future full snapshots now contain compact proof of pre/post md5 and op gates.
+- The summary-only mode lets old artifacts be audited without consuming GPU
+  benchmark time.