chore(paged): add current serving snapshot harness

Add a reusable current-stack paged-vs-vLLM serving snapshot harness that targets the clean DGX mirror, enforces idle/lock preflight, runs pre/post inference gates, and records ratio summaries.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 03:19:36 +00:00
parent c99678da42
commit ff3f0620de
6 changed files with 446 additions and 0 deletions

View File

@@ -1405,3 +1405,39 @@ Decision:
- Keep MTP scheduler work closed. The next credible parity path is either a
datacenter-Blackwell rerun or a larger fused-kernel project outside the
low-conflict GB10 patch stack.
## Phase 21 Current-Stack Serving Harness
Phase 21 made the Phase 20 current-stack serving snapshot repeatable from the
LocalAI backend tree.
New script:
- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
Purpose:
- targets the clean `~/llama-phase6-source` mirror by default;
- rejects busy docker, `local-ai-worker`, GPU compute, or owned GPU-lock state;
- builds the current llama.cpp targets;
- runs pre/post `paged-inference-gates.sh`;
- runs paged and vLLM serving arms with the same h2h client;
- writes paged/vLLM ratio summaries.
Verification:
- local `bash -n` passed;
- local `--help` passed;
- DGX `DRY_RUN=1` validated required paths and preflight without launching
servers.
Dry-run artifact:
- `/home/mudler/bench/phase21_harness_dryrun/20260701_051757`
Decision:
- Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving
snapshots.
- Do not use stale DGX `~/bench/combined_definitive.sh` without porting it to
`~/llama-phase6-source` and the owner-file lock discipline.

View File

@@ -304,6 +304,17 @@ This keeps the GB10 shortcut closure intact: do not reopen MTP or small
scheduler work. The credible next parity path is a datacenter-Blackwell rerun or
a larger fused-kernel project outside this low-conflict patch stack.
Phase 21 added a reusable current-stack serving harness:
`backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`.
It defaults to `~/llama-phase6-source`, validates docker/`local-ai-worker`/GPU
idle state, uses the owner-file lock, runs pre/post inference gates, compares
paged and vLLM with h2h, and writes ratio summaries. DGX dry run passed at
`/home/mudler/bench/phase21_harness_dryrun/20260701_051757`.
Use this harness for future current-stack GB10 snapshots. Do not reuse
`~/bench/combined_definitive.sh` unless it is first ported away from stale
`~/llama-paged-dev` paths and old lock assumptions.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)

View File

@@ -644,6 +644,23 @@ credible parity path is not another MTP/scheduler shortcut; it is either the
documented datacenter-Blackwell rerun or a larger fused-kernel project outside
the low-conflict GB10 patch stack.
### Phase 21 current-stack harness
Phase 21 added `paged-current-serving-snapshot.sh` so Phase 20 can be repeated
without the stale DGX `combined_definitive.sh` assumptions. The script defaults
to `~/llama-phase6-source`, enforces docker/`local-ai-worker`/GPU-idle preflight,
uses the owner-file lock, runs pre/post md5/op gates, runs paged and vLLM in the
same session, and emits ratio rows in `summary.tsv`.
Verification:
- local `bash -n` and `--help` passed;
- DGX `DRY_RUN=1` passed and wrote
`/home/mudler/bench/phase21_harness_dryrun/20260701_051757`.
Use this harness for future current-stack GB10 snapshots before making parity
claims.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update