chore(paged): summarize snapshot inference gates

Emit a compact gate_summary.tsv from current serving snapshots so each artifact records the pre/post MoE md5, dense md5, and backend op checks. Add a summary-only mode for auditing existing artifacts and document the Phase 25 backfill on the Phase 20 snapshot.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 03:35:54 +00:00
parent 7108b68a70
commit a0194125f5
6 changed files with 300 additions and 4 deletions

View File

@@ -618,9 +618,12 @@ DGX mirror `f2521ab12`, artifact
Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving
snapshots. It targets the clean `~/llama-phase6-source` mirror, checks
docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post
inference gates, writes `hardware.txt`, and emits paged/vLLM ratios.
inference gates, writes `hardware.txt`, emits `gate_summary.tsv`, and emits
paged/vLLM ratios.
`hardware.txt` records the GPU identity and hardware class so GB10/workstation
Blackwell evidence is not confused with a future datacenter-Blackwell rerun.
`gate_summary.tsv` records pre/post MoE md5, dense md5, and backend-op checks
so an artifact proves inferencing gates without reading full logs.
Do not use the stale DGX
`~/bench/combined_definitive.sh` without first porting it to the current mirror
and lock discipline.

View File

@@ -1514,3 +1514,60 @@ Decision:
GB10-to-datacenter generalization.
- The Phase 20 GB10 closure still applies to `gb10_or_workstation_blackwell`;
datacenter Blackwell needs a fresh run of the same methodology.
## Phase 25 Snapshot Gate Summary
Phase 25 made current-stack serving artifacts self-auditing for the inference
gates that protect the paged path.
Script change:
- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` now
writes `gate_summary.tsv` after the post gate in a full run.
- The script also supports `--summarize-gates ART` to generate the same summary
from existing `gate_pre/` and `gate_post/` artifacts without launching
servers.
Recorded rows:
- pre/post MoE transcript md5 versus
`8cb0ce23777bf55f92f63d0292c756b0`;
- pre/post dense transcript md5 versus
`5951a5b4d624ce891e22ab5fca9bc439`;
- pre/post backend op rows, currently `MUL_MAT_ID`, with the parsed passed/total
count.
Verification:
- Red check: Phase 20 initially had gate artifacts but no `gate_summary.tsv`.
- local `bash -n` passed;
- local `--help` passed;
- DGX `--summarize-gates` against Phase 20 wrote six green rows;
- DGX `DRY_RUN=1` validated the normal path still preflights and writes
`hardware.txt` without launching servers or writing a gate summary before
gates exist.
Artifacts:
- Backfilled summary:
`/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`
- Dry run:
`/home/mudler/bench/phase25_gate_summary_dryrun/20260701_053353`
Backfilled Phase 20 gate summary:
```text
pre moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0
pre dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439
pre op_MUL_MAT_ID ok 806/806
post moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0
post dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439
post op_MUL_MAT_ID ok 806/806
```
Decision:
- Future full serving snapshots carry compact proof that inference md5/op gates
stayed green before and after the paged-vs-vLLM run.
- Treat `gate_summary.tsv` plus `hardware.txt` as the quick audit surface before
accepting a parity snapshot.

View File

@@ -132,6 +132,10 @@ python3 /home/mudler/bench/h2h_cli3.py # OpenAI /v1/completions, ignore_eos, f
The harness also writes `hardware.txt` before any server starts, including
`DRY_RUN=1`, so every new snapshot records the GPU model, driver, compute
capability when exposed by `nvidia-smi`, and a conservative hardware class.
Full runs also write `gate_summary.tsv` after the post gate, summarizing pre/post
MoE md5, dense md5, and backend-op checks; use
`paged-current-serving-snapshot.sh --summarize-gates ART` to backfill or audit an
existing snapshot without starting servers.
### 3.4 THE DECODE-PROFILING RULE (this trap caused 4 wrong analyses)
Decode runs as a **replayed CUDA graph**. `nsys` **without** `--cuda-graph-trace=node` collapses each graph replay into ONE opaque launch, so every per-kernel attribution becomes an artifact. This is exactly what made the old "paged 159 us/tok, GPU ~16% busy, host-bound, 5.4x more GPU-efficient" story wrong, and produced the wrong ~56% headline.
@@ -332,6 +336,12 @@ hardware report. DGX dry run passed at
artifacts self-describing: GB10/workstation Blackwell results must not be used
as datacenter-Blackwell parity evidence.
Phase 25 extended the same harness to write `gate_summary.tsv`. The summary was
backfilled on the Phase 20 artifact at
`/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`;
it records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806` as `ok`.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -396,6 +406,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
- `~/bench/phase20_current_snapshot/20260701_050621` - current clean-stack paged-vs-vLLM MoE serving snapshot.
- `~/bench/phase21_harness_dryrun/20260701_051757` - current snapshot harness dry-run artifact.
- `~/bench/phase24_hardware_report_dryrun/20260701_052741` - current snapshot harness dry run proving `hardware.txt` captures the DGX as `hardware_class=gb10_or_workstation_blackwell`.
- `~/bench/phase25_gate_summary_dryrun/20260701_053353` - dry run after adding `gate_summary.tsv` support; normal dry-run still writes `hardware.txt` and does not emit a gate summary before gates exist.
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.

View File

@@ -676,6 +676,21 @@ with `GPU 0: NVIDIA GB10`, driver `580.159.03`, and compute capability `12.1`.
Use `hardware.txt` when comparing future snapshots. GB10/workstation Blackwell
results do not establish datacenter-Blackwell parity.
### Phase 25 snapshot gate summary
Phase 25 extended `paged-current-serving-snapshot.sh` to write
`gate_summary.tsv` after the post gate in full runs. It also added
`--summarize-gates ART` for auditing existing artifacts without launching
servers.
The Phase 20 artifact was backfilled at
`/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`.
It records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806` as `ok`.
Use `hardware.txt` plus `gate_summary.tsv` as the quick audit surface before
accepting any new parity snapshot.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update

View File

@@ -3,7 +3,7 @@ set -euo pipefail
usage() {
cat <<'EOF'
Usage: paged-current-serving-snapshot.sh
Usage: paged-current-serving-snapshot.sh [--summarize-gates ART]
Run a current-stack paged llama.cpp vs vLLM MoE serving snapshot on DGX.
@@ -30,13 +30,32 @@ Environment overrides:
VLLM_BIN vLLM executable (default: ~/vllm-bench/bin/vllm)
SKIP_GATES=1 to skip pre/post paged inference gates
DRY_RUN=1 validate inputs/preflight, write hardware.txt, and print commands without running servers
Options:
--summarize-gates ART write ART/gate_summary.tsv from existing gate_pre/gate_post artifacts
EOF
}
if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
SUMMARY_GATES_ART=""
case "${1:-}" in
-h|--help)
usage
exit 0
fi
;;
--summarize-gates)
if [[ -z "${2:-}" ]]; then
usage >&2
exit 2
fi
SUMMARY_GATES_ART="$2"
;;
"")
;;
*)
usage >&2
exit 2
;;
esac
SRC=${SRC:-"$HOME/llama-phase6-source"}
BIN=${BIN:-"$SRC/build-cuda/bin"}
@@ -56,6 +75,8 @@ VLLM_PORT=${VLLM_PORT:-8000}
VLLM_BIN=${VLLM_BIN:-"$HOME/vllm-bench/bin/vllm"}
SKIP_GATES=${SKIP_GATES:-0}
DRY_RUN=${DRY_RUN:-0}
MOE_MD5_EXPECTED=8cb0ce23777bf55f92f63d0292c756b0
DENSE_MD5_EXPECTED=5951a5b4d624ce891e22ab5fca9bc439
LOCK_DIR="$HOME/gpu_bench_lock"
OWNER="$LOCK_DIR/owner"
@@ -271,6 +292,73 @@ for n in sorted({row[1] for row in rows}):
PY
}
write_gate_summary() {
python3 - "$ART" "$MOE_MD5_EXPECTED" "$DENSE_MD5_EXPECTED" <<'PY' | tee "$ART/gate_summary.tsv"
import re
import sys
from pathlib import Path
art = Path(sys.argv[1])
expected = {
"moe": sys.argv[2],
"dense": sys.argv[3],
}
ansi = re.compile(r"\x1b\[[0-9;]*m")
bad = False
print("phase\tcheck\tstatus\tactual\texpected\tdetails")
for phase in ("pre", "post"):
gate_dir = art / f"gate_{phase}"
if not gate_dir.exists():
print(f"{phase}\tall\tskipped\t\t\t{gate_dir} missing")
continue
for name, want in expected.items():
md5_path = gate_dir / f"{name}.md5"
if not md5_path.exists():
print(f"{phase}\t{name}_md5\tmissing\t\t{want}\t{md5_path} missing")
bad = True
continue
got = md5_path.read_text().split()[0]
status = "ok" if got == want else "mismatch"
if status != "ok":
bad = True
print(f"{phase}\t{name}_md5\t{status}\t{got}\t{want}\t{md5_path}")
op_paths = sorted(gate_dir.glob("op_*.txt"))
if not op_paths:
print(f"{phase}\top\tmissing\t\t\tno op_*.txt files")
bad = True
continue
for path in op_paths:
op = path.stem.removeprefix("op_")
text = ansi.sub("", path.read_text(errors="replace"))
passed = re.search(r"(\d+)/(\d+) tests passed", text)
backend_ok = re.search(r"Backend CUDA0:\s+OK", text)
if passed:
actual = f"{passed.group(1)}/{passed.group(2)}"
status = "ok" if passed.group(1) == passed.group(2) and backend_ok else "fail"
else:
actual = ""
status = "missing"
if status != "ok":
bad = True
print(f"{phase}\top_{op}\t{status}\t{actual}\tall\t{path}")
if bad:
sys.exit(6)
PY
}
if [[ -n "$SUMMARY_GATES_ART" ]]; then
ART="$SUMMARY_GATES_ART"
require_path "$ART"
write_gate_summary
exit 0
fi
require_path "$SRC"
require_path "$BIN/llama-server"
require_path "$BIN/llama-completion"
@@ -306,5 +394,6 @@ run_vllm
release_lock
trap - EXIT
run_gate post
write_gate_summary
write_summary
log "artifacts: $ART"

View File

@@ -0,0 +1,121 @@
# Snapshot Gate Summary Phase 25 Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use
> superpowers:verification-before-completion before recording the phase result.
> Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** make current-stack paged-vs-vLLM serving artifacts prove that
inference md5/op gates stayed green without requiring a full log read.
**Architecture:** extend the existing current serving snapshot harness with a
compact gate-summary writer. Keep it additive and outside llama.cpp source: no
patch-series change and no inference behavior change.
**Tech Stack:** Bash, Python stdlib, existing `paged-inference-gates.sh`
artifacts.
---
## Task 1: Red Check
- [x] **Step 1: Prove Phase 20 lacks compact gate proof**
Command:
```bash
ssh dgx.casa 'test -e ~/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv'
```
Result:
- exited `1` before the patch, while `gate_pre/`, `gate_post/`, and full gate
logs existed.
## Task 2: Add Gate Summary
- [x] **Step 1: Extend `paged-current-serving-snapshot.sh`**
File:
- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
Behavior:
- writes `$ART/gate_summary.tsv` after the post gate in a full serving run;
- records pre/post MoE md5, dense md5, and backend op status;
- compares MoE against `8cb0ce23777bf55f92f63d0292c756b0`;
- compares dense against `5951a5b4d624ce891e22ab5fca9bc439`;
- parses op pass counts such as `806/806 tests passed`;
- exits non-zero if an existing gate artifact is missing, mismatched, or not
fully passing;
- supports `--summarize-gates ART` to audit existing artifacts without running
servers.
## Task 3: Verify
- [x] **Step 1: Local syntax/help checks**
Commands:
```bash
bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help
```
Result:
- both passed.
- [x] **Step 2: Backfill Phase 20 gate summary**
Command:
```bash
/tmp/paged-current-serving-snapshot.sh \
--summarize-gates ~/bench/phase20_current_snapshot/20260701_050621
```
Result:
- wrote `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`;
- pre/post MoE md5 rows were `ok`;
- pre/post dense md5 rows were `ok`;
- pre/post `MUL_MAT_ID` rows were `ok` with `806/806`.
- [x] **Step 3: DGX dry run**
Command:
```bash
DRY_RUN=1 ART=~/bench/phase25_gate_summary_dryrun/20260701_053353 \
/tmp/paged-current-serving-snapshot.sh
```
Result:
- preflight verified `docker=0`, `local_ai_worker=0`, `compute=0`;
- `hardware.txt` was still written;
- no paged or vLLM server launched;
- no `gate_summary.tsv` was written before gates existed.
Artifact:
- `/home/mudler/bench/phase25_gate_summary_dryrun/20260701_053353`
## Task 4: Record Result
- [x] **Step 1: Update parity docs**
Updated files:
- `backend/cpp/llama-cpp-localai-paged/README.md`
- `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
- `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
## Self-Review
- No llama.cpp source behavior changed.
- Future full snapshots now contain compact proof of pre/post md5 and op gates.
- The summary-only mode lets old artifacts be audited without consuming GPU
benchmark time.