From a0194125f5a9c19714418267ea6dc64877286247 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 03:35:54 +0000 Subject: [PATCH] chore(paged): summarize snapshot inference gates Emit a compact gate_summary.tsv from current serving snapshots so each artifact records the pre/post MoE md5, dense md5, and backend op checks. Add a summary-only mode for auditing existing artifacts and document the Phase 25 backfill on the Phase 20 snapshot. Assisted-by: Codex:gpt-5 --- backend/cpp/llama-cpp-localai-paged/README.md | 5 +- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 57 +++++++++ .../docs/PARITY_HANDOFF.md | 11 ++ .../docs/VLLM_PARITY_LEVER_MAP.md | 15 +++ .../paged-current-serving-snapshot.sh | 95 +++++++++++++- ...026-07-01-snapshot-gate-summary-phase25.md | 121 ++++++++++++++++++ 6 files changed, 300 insertions(+), 4 deletions(-) create mode 100644 docs/superpowers/plans/2026-07-01-snapshot-gate-summary-phase25.md diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md index 838ac17a2..6755f001f 100644 --- a/backend/cpp/llama-cpp-localai-paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -618,9 +618,12 @@ DGX mirror `f2521ab12`, artifact Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving snapshots. It targets the clean `~/llama-phase6-source` mirror, checks docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post -inference gates, writes `hardware.txt`, and emits paged/vLLM ratios. +inference gates, writes `hardware.txt`, emits `gate_summary.tsv`, and emits +paged/vLLM ratios. `hardware.txt` records the GPU identity and hardware class so GB10/workstation Blackwell evidence is not confused with a future datacenter-Blackwell rerun. +`gate_summary.tsv` records pre/post MoE md5, dense md5, and backend-op checks +so an artifact proves inferencing gates without reading full logs. Do not use the stale DGX `~/bench/combined_definitive.sh` without first porting it to the current mirror and lock discipline. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 58761a7cd..8a777ad92 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -1514,3 +1514,60 @@ Decision: GB10-to-datacenter generalization. - The Phase 20 GB10 closure still applies to `gb10_or_workstation_blackwell`; datacenter Blackwell needs a fresh run of the same methodology. + +## Phase 25 Snapshot Gate Summary + +Phase 25 made current-stack serving artifacts self-auditing for the inference +gates that protect the paged path. + +Script change: + +- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` now + writes `gate_summary.tsv` after the post gate in a full run. +- The script also supports `--summarize-gates ART` to generate the same summary + from existing `gate_pre/` and `gate_post/` artifacts without launching + servers. + +Recorded rows: + +- pre/post MoE transcript md5 versus + `8cb0ce23777bf55f92f63d0292c756b0`; +- pre/post dense transcript md5 versus + `5951a5b4d624ce891e22ab5fca9bc439`; +- pre/post backend op rows, currently `MUL_MAT_ID`, with the parsed passed/total + count. + +Verification: + +- Red check: Phase 20 initially had gate artifacts but no `gate_summary.tsv`. +- local `bash -n` passed; +- local `--help` passed; +- DGX `--summarize-gates` against Phase 20 wrote six green rows; +- DGX `DRY_RUN=1` validated the normal path still preflights and writes + `hardware.txt` without launching servers or writing a gate summary before + gates exist. + +Artifacts: + +- Backfilled summary: + `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv` +- Dry run: + `/home/mudler/bench/phase25_gate_summary_dryrun/20260701_053353` + +Backfilled Phase 20 gate summary: + +```text +pre moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0 +pre dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439 +pre op_MUL_MAT_ID ok 806/806 +post moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0 +post dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439 +post op_MUL_MAT_ID ok 806/806 +``` + +Decision: + +- Future full serving snapshots carry compact proof that inference md5/op gates + stayed green before and after the paged-vs-vLLM run. +- Treat `gate_summary.tsv` plus `hardware.txt` as the quick audit surface before + accepting a parity snapshot. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 5c756a99f..cda3918bd 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -132,6 +132,10 @@ python3 /home/mudler/bench/h2h_cli3.py # OpenAI /v1/completions, ignore_eos, f The harness also writes `hardware.txt` before any server starts, including `DRY_RUN=1`, so every new snapshot records the GPU model, driver, compute capability when exposed by `nvidia-smi`, and a conservative hardware class. +Full runs also write `gate_summary.tsv` after the post gate, summarizing pre/post +MoE md5, dense md5, and backend-op checks; use +`paged-current-serving-snapshot.sh --summarize-gates ART` to backfill or audit an +existing snapshot without starting servers. ### 3.4 THE DECODE-PROFILING RULE (this trap caused 4 wrong analyses) Decode runs as a **replayed CUDA graph**. `nsys` **without** `--cuda-graph-trace=node` collapses each graph replay into ONE opaque launch, so every per-kernel attribution becomes an artifact. This is exactly what made the old "paged 159 us/tok, GPU ~16% busy, host-bound, 5.4x more GPU-efficient" story wrong, and produced the wrong ~56% headline. @@ -332,6 +336,12 @@ hardware report. DGX dry run passed at artifacts self-describing: GB10/workstation Blackwell results must not be used as datacenter-Blackwell parity evidence. +Phase 25 extended the same harness to write `gate_summary.tsv`. The summary was +backfilled on the Phase 20 artifact at +`/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`; +it records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806` as `ok`. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -396,6 +406,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase20_current_snapshot/20260701_050621` - current clean-stack paged-vs-vLLM MoE serving snapshot. - `~/bench/phase21_harness_dryrun/20260701_051757` - current snapshot harness dry-run artifact. - `~/bench/phase24_hardware_report_dryrun/20260701_052741` - current snapshot harness dry run proving `hardware.txt` captures the DGX as `hardware_class=gb10_or_workstation_blackwell`. +- `~/bench/phase25_gate_summary_dryrun/20260701_053353` - dry run after adding `gate_summary.tsv` support; normal dry-run still writes `hardware.txt` and does not emit a gate summary before gates exist. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 2b3a0e176..552ef6c0b 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -676,6 +676,21 @@ with `GPU 0: NVIDIA GB10`, driver `580.159.03`, and compute capability `12.1`. Use `hardware.txt` when comparing future snapshots. GB10/workstation Blackwell results do not establish datacenter-Blackwell parity. +### Phase 25 snapshot gate summary + +Phase 25 extended `paged-current-serving-snapshot.sh` to write +`gate_summary.tsv` after the post gate in full runs. It also added +`--summarize-gates ART` for auditing existing artifacts without launching +servers. + +The Phase 20 artifact was backfilled at +`/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`. +It records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806` as `ok`. + +Use `hardware.txt` plus `gate_summary.tsv` as the quick audit surface before +accepting any new parity snapshot. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh index 9ed6277c1..af1a7aac1 100755 --- a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +++ b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh @@ -3,7 +3,7 @@ set -euo pipefail usage() { cat <<'EOF' -Usage: paged-current-serving-snapshot.sh +Usage: paged-current-serving-snapshot.sh [--summarize-gates ART] Run a current-stack paged llama.cpp vs vLLM MoE serving snapshot on DGX. @@ -30,13 +30,32 @@ Environment overrides: VLLM_BIN vLLM executable (default: ~/vllm-bench/bin/vllm) SKIP_GATES=1 to skip pre/post paged inference gates DRY_RUN=1 validate inputs/preflight, write hardware.txt, and print commands without running servers + +Options: + --summarize-gates ART write ART/gate_summary.tsv from existing gate_pre/gate_post artifacts EOF } -if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then +SUMMARY_GATES_ART="" +case "${1:-}" in + -h|--help) usage exit 0 -fi + ;; + --summarize-gates) + if [[ -z "${2:-}" ]]; then + usage >&2 + exit 2 + fi + SUMMARY_GATES_ART="$2" + ;; + "") + ;; + *) + usage >&2 + exit 2 + ;; +esac SRC=${SRC:-"$HOME/llama-phase6-source"} BIN=${BIN:-"$SRC/build-cuda/bin"} @@ -56,6 +75,8 @@ VLLM_PORT=${VLLM_PORT:-8000} VLLM_BIN=${VLLM_BIN:-"$HOME/vllm-bench/bin/vllm"} SKIP_GATES=${SKIP_GATES:-0} DRY_RUN=${DRY_RUN:-0} +MOE_MD5_EXPECTED=8cb0ce23777bf55f92f63d0292c756b0 +DENSE_MD5_EXPECTED=5951a5b4d624ce891e22ab5fca9bc439 LOCK_DIR="$HOME/gpu_bench_lock" OWNER="$LOCK_DIR/owner" @@ -271,6 +292,73 @@ for n in sorted({row[1] for row in rows}): PY } +write_gate_summary() { + python3 - "$ART" "$MOE_MD5_EXPECTED" "$DENSE_MD5_EXPECTED" <<'PY' | tee "$ART/gate_summary.tsv" +import re +import sys +from pathlib import Path + +art = Path(sys.argv[1]) +expected = { + "moe": sys.argv[2], + "dense": sys.argv[3], +} +ansi = re.compile(r"\x1b\[[0-9;]*m") +bad = False + +print("phase\tcheck\tstatus\tactual\texpected\tdetails") + +for phase in ("pre", "post"): + gate_dir = art / f"gate_{phase}" + if not gate_dir.exists(): + print(f"{phase}\tall\tskipped\t\t\t{gate_dir} missing") + continue + + for name, want in expected.items(): + md5_path = gate_dir / f"{name}.md5" + if not md5_path.exists(): + print(f"{phase}\t{name}_md5\tmissing\t\t{want}\t{md5_path} missing") + bad = True + continue + got = md5_path.read_text().split()[0] + status = "ok" if got == want else "mismatch" + if status != "ok": + bad = True + print(f"{phase}\t{name}_md5\t{status}\t{got}\t{want}\t{md5_path}") + + op_paths = sorted(gate_dir.glob("op_*.txt")) + if not op_paths: + print(f"{phase}\top\tmissing\t\t\tno op_*.txt files") + bad = True + continue + + for path in op_paths: + op = path.stem.removeprefix("op_") + text = ansi.sub("", path.read_text(errors="replace")) + passed = re.search(r"(\d+)/(\d+) tests passed", text) + backend_ok = re.search(r"Backend CUDA0:\s+OK", text) + if passed: + actual = f"{passed.group(1)}/{passed.group(2)}" + status = "ok" if passed.group(1) == passed.group(2) and backend_ok else "fail" + else: + actual = "" + status = "missing" + if status != "ok": + bad = True + print(f"{phase}\top_{op}\t{status}\t{actual}\tall\t{path}") + +if bad: + sys.exit(6) +PY +} + +if [[ -n "$SUMMARY_GATES_ART" ]]; then + ART="$SUMMARY_GATES_ART" + require_path "$ART" + write_gate_summary + exit 0 +fi + require_path "$SRC" require_path "$BIN/llama-server" require_path "$BIN/llama-completion" @@ -306,5 +394,6 @@ run_vllm release_lock trap - EXIT run_gate post +write_gate_summary write_summary log "artifacts: $ART" diff --git a/docs/superpowers/plans/2026-07-01-snapshot-gate-summary-phase25.md b/docs/superpowers/plans/2026-07-01-snapshot-gate-summary-phase25.md new file mode 100644 index 000000000..86b937071 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-snapshot-gate-summary-phase25.md @@ -0,0 +1,121 @@ +# Snapshot Gate Summary Phase 25 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:verification-before-completion before recording the phase result. +> Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** make current-stack paged-vs-vLLM serving artifacts prove that +inference md5/op gates stayed green without requiring a full log read. + +**Architecture:** extend the existing current serving snapshot harness with a +compact gate-summary writer. Keep it additive and outside llama.cpp source: no +patch-series change and no inference behavior change. + +**Tech Stack:** Bash, Python stdlib, existing `paged-inference-gates.sh` +artifacts. + +--- + +## Task 1: Red Check + +- [x] **Step 1: Prove Phase 20 lacks compact gate proof** + + Command: + + ```bash + ssh dgx.casa 'test -e ~/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv' + ``` + + Result: + + - exited `1` before the patch, while `gate_pre/`, `gate_post/`, and full gate + logs existed. + +## Task 2: Add Gate Summary + +- [x] **Step 1: Extend `paged-current-serving-snapshot.sh`** + + File: + + - `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + + Behavior: + + - writes `$ART/gate_summary.tsv` after the post gate in a full serving run; + - records pre/post MoE md5, dense md5, and backend op status; + - compares MoE against `8cb0ce23777bf55f92f63d0292c756b0`; + - compares dense against `5951a5b4d624ce891e22ab5fca9bc439`; + - parses op pass counts such as `806/806 tests passed`; + - exits non-zero if an existing gate artifact is missing, mismatched, or not + fully passing; + - supports `--summarize-gates ART` to audit existing artifacts without running + servers. + +## Task 3: Verify + +- [x] **Step 1: Local syntax/help checks** + + Commands: + + ```bash + bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh + backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help + ``` + + Result: + + - both passed. + +- [x] **Step 2: Backfill Phase 20 gate summary** + + Command: + + ```bash + /tmp/paged-current-serving-snapshot.sh \ + --summarize-gates ~/bench/phase20_current_snapshot/20260701_050621 + ``` + + Result: + + - wrote `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`; + - pre/post MoE md5 rows were `ok`; + - pre/post dense md5 rows were `ok`; + - pre/post `MUL_MAT_ID` rows were `ok` with `806/806`. + +- [x] **Step 3: DGX dry run** + + Command: + + ```bash + DRY_RUN=1 ART=~/bench/phase25_gate_summary_dryrun/20260701_053353 \ + /tmp/paged-current-serving-snapshot.sh + ``` + + Result: + + - preflight verified `docker=0`, `local_ai_worker=0`, `compute=0`; + - `hardware.txt` was still written; + - no paged or vLLM server launched; + - no `gate_summary.tsv` was written before gates existed. + + Artifact: + + - `/home/mudler/bench/phase25_gate_summary_dryrun/20260701_053353` + +## Task 4: Record Result + +- [x] **Step 1: Update parity docs** + + Updated files: + + - `backend/cpp/llama-cpp-localai-paged/README.md` + - `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` + - `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + - `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +## Self-Review + +- No llama.cpp source behavior changed. +- Future full snapshots now contain compact proof of pre/post md5 and op gates. +- The summary-only mode lets old artifacts be audited without consuming GPU + benchmark time.