chore(paged): add current serving snapshot harness

Add a reusable current-stack paged-vs-vLLM serving snapshot harness that targets the clean DGX mirror, enforces idle/lock preflight, runs pre/post inference gates, and records ratio summaries. Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 03:19:36 +00:00
parent c99678da42
commit ff3f0620de
6 changed files with 446 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -1405,3 +1405,39 @@ Decision:
 - Keep MTP scheduler work closed. The next credible parity path is either a
  datacenter-Blackwell rerun or a larger fused-kernel project outside the
  low-conflict GB10 patch stack.
+
+## Phase 21 Current-Stack Serving Harness
+
+Phase 21 made the Phase 20 current-stack serving snapshot repeatable from the
+LocalAI backend tree.
+
+New script:
+
+- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
+
+Purpose:
+
+- targets the clean `~/llama-phase6-source` mirror by default;
+- rejects busy docker, `local-ai-worker`, GPU compute, or owned GPU-lock state;
+- builds the current llama.cpp targets;
+- runs pre/post `paged-inference-gates.sh`;
+- runs paged and vLLM serving arms with the same h2h client;
+- writes paged/vLLM ratio summaries.
+
+Verification:
+
+- local `bash -n` passed;
+- local `--help` passed;
+- DGX `DRY_RUN=1` validated required paths and preflight without launching
+  servers.
+
+Dry-run artifact:
+
+- `/home/mudler/bench/phase21_harness_dryrun/20260701_051757`
+
+Decision:
+
+- Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving
+  snapshots.
+- Do not use stale DGX `~/bench/combined_definitive.sh` without porting it to
+  `~/llama-phase6-source` and the owner-file lock discipline.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -304,6 +304,17 @@ This keeps the GB10 shortcut closure intact: do not reopen MTP or small
 scheduler work. The credible next parity path is a datacenter-Blackwell rerun or
 a larger fused-kernel project outside this low-conflict patch stack.

+Phase 21 added a reusable current-stack serving harness:
+`backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`.
+It defaults to `~/llama-phase6-source`, validates docker/`local-ai-worker`/GPU
+idle state, uses the owner-file lock, runs pre/post inference gates, compares
+paged and vLLM with h2h, and writes ratio summaries. DGX dry run passed at
+`/home/mudler/bench/phase21_harness_dryrun/20260701_051757`.
+
+Use this harness for future current-stack GB10 snapshots. Do not reuse
+`~/bench/combined_definitive.sh` unless it is first ported away from stale
+`~/llama-paged-dev` paths and old lock assumptions.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -644,6 +644,23 @@ credible parity path is not another MTP/scheduler shortcut; it is either the
 documented datacenter-Blackwell rerun or a larger fused-kernel project outside
 the low-conflict GB10 patch stack.

+### Phase 21 current-stack harness
+
+Phase 21 added `paged-current-serving-snapshot.sh` so Phase 20 can be repeated
+without the stale DGX `combined_definitive.sh` assumptions. The script defaults
+to `~/llama-phase6-source`, enforces docker/`local-ai-worker`/GPU-idle preflight,
+uses the owner-file lock, runs pre/post md5/op gates, runs paged and vLLM in the
+same session, and emits ratio rows in `summary.tsv`.
+
+Verification:
+
+- local `bash -n` and `--help` passed;
+- DGX `DRY_RUN=1` passed and wrote
+  `/home/mudler/bench/phase21_harness_dryrun/20260701_051757`.
+
+Use this harness for future current-stack GB10 snapshots before making parity
+claims.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update