mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
chore(paged): add current serving snapshot harness
Add a reusable current-stack paged-vs-vLLM serving snapshot harness that targets the clean DGX mirror, enforces idle/lock preflight, runs pre/post inference gates, and records ratio summaries. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -1405,3 +1405,39 @@ Decision:
|
||||
- Keep MTP scheduler work closed. The next credible parity path is either a
|
||||
datacenter-Blackwell rerun or a larger fused-kernel project outside the
|
||||
low-conflict GB10 patch stack.
|
||||
|
||||
## Phase 21 Current-Stack Serving Harness
|
||||
|
||||
Phase 21 made the Phase 20 current-stack serving snapshot repeatable from the
|
||||
LocalAI backend tree.
|
||||
|
||||
New script:
|
||||
|
||||
- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
|
||||
|
||||
Purpose:
|
||||
|
||||
- targets the clean `~/llama-phase6-source` mirror by default;
|
||||
- rejects busy docker, `local-ai-worker`, GPU compute, or owned GPU-lock state;
|
||||
- builds the current llama.cpp targets;
|
||||
- runs pre/post `paged-inference-gates.sh`;
|
||||
- runs paged and vLLM serving arms with the same h2h client;
|
||||
- writes paged/vLLM ratio summaries.
|
||||
|
||||
Verification:
|
||||
|
||||
- local `bash -n` passed;
|
||||
- local `--help` passed;
|
||||
- DGX `DRY_RUN=1` validated required paths and preflight without launching
|
||||
servers.
|
||||
|
||||
Dry-run artifact:
|
||||
|
||||
- `/home/mudler/bench/phase21_harness_dryrun/20260701_051757`
|
||||
|
||||
Decision:
|
||||
|
||||
- Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving
|
||||
snapshots.
|
||||
- Do not use stale DGX `~/bench/combined_definitive.sh` without porting it to
|
||||
`~/llama-phase6-source` and the owner-file lock discipline.
|
||||
|
||||
@@ -304,6 +304,17 @@ This keeps the GB10 shortcut closure intact: do not reopen MTP or small
|
||||
scheduler work. The credible next parity path is a datacenter-Blackwell rerun or
|
||||
a larger fused-kernel project outside this low-conflict patch stack.
|
||||
|
||||
Phase 21 added a reusable current-stack serving harness:
|
||||
`backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`.
|
||||
It defaults to `~/llama-phase6-source`, validates docker/`local-ai-worker`/GPU
|
||||
idle state, uses the owner-file lock, runs pre/post inference gates, compares
|
||||
paged and vLLM with h2h, and writes ratio summaries. DGX dry run passed at
|
||||
`/home/mudler/bench/phase21_harness_dryrun/20260701_051757`.
|
||||
|
||||
Use this harness for future current-stack GB10 snapshots. Do not reuse
|
||||
`~/bench/combined_definitive.sh` unless it is first ported away from stale
|
||||
`~/llama-paged-dev` paths and old lock assumptions.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
|
||||
@@ -644,6 +644,23 @@ credible parity path is not another MTP/scheduler shortcut; it is either the
|
||||
documented datacenter-Blackwell rerun or a larger fused-kernel project outside
|
||||
the low-conflict GB10 patch stack.
|
||||
|
||||
### Phase 21 current-stack harness
|
||||
|
||||
Phase 21 added `paged-current-serving-snapshot.sh` so Phase 20 can be repeated
|
||||
without the stale DGX `combined_definitive.sh` assumptions. The script defaults
|
||||
to `~/llama-phase6-source`, enforces docker/`local-ai-worker`/GPU-idle preflight,
|
||||
uses the owner-file lock, runs pre/post md5/op gates, runs paged and vLLM in the
|
||||
same session, and emits ratio rows in `summary.tsv`.
|
||||
|
||||
Verification:
|
||||
|
||||
- local `bash -n` and `--help` passed;
|
||||
- DGX `DRY_RUN=1` passed and wrote
|
||||
`/home/mudler/bench/phase21_harness_dryrun/20260701_051757`.
|
||||
|
||||
Use this harness for future current-stack GB10 snapshots before making parity
|
||||
claims.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
Reference in New Issue
Block a user