mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
chore(paged): summarize snapshot inference gates
Emit a compact gate_summary.tsv from current serving snapshots so each artifact records the pre/post MoE md5, dense md5, and backend op checks. Add a summary-only mode for auditing existing artifacts and document the Phase 25 backfill on the Phase 20 snapshot. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -618,9 +618,12 @@ DGX mirror `f2521ab12`, artifact
|
||||
Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving
|
||||
snapshots. It targets the clean `~/llama-phase6-source` mirror, checks
|
||||
docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post
|
||||
inference gates, writes `hardware.txt`, and emits paged/vLLM ratios.
|
||||
inference gates, writes `hardware.txt`, emits `gate_summary.tsv`, and emits
|
||||
paged/vLLM ratios.
|
||||
`hardware.txt` records the GPU identity and hardware class so GB10/workstation
|
||||
Blackwell evidence is not confused with a future datacenter-Blackwell rerun.
|
||||
`gate_summary.tsv` records pre/post MoE md5, dense md5, and backend-op checks
|
||||
so an artifact proves inferencing gates without reading full logs.
|
||||
Do not use the stale DGX
|
||||
`~/bench/combined_definitive.sh` without first porting it to the current mirror
|
||||
and lock discipline.
|
||||
|
||||
@@ -1514,3 +1514,60 @@ Decision:
|
||||
GB10-to-datacenter generalization.
|
||||
- The Phase 20 GB10 closure still applies to `gb10_or_workstation_blackwell`;
|
||||
datacenter Blackwell needs a fresh run of the same methodology.
|
||||
|
||||
## Phase 25 Snapshot Gate Summary
|
||||
|
||||
Phase 25 made current-stack serving artifacts self-auditing for the inference
|
||||
gates that protect the paged path.
|
||||
|
||||
Script change:
|
||||
|
||||
- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` now
|
||||
writes `gate_summary.tsv` after the post gate in a full run.
|
||||
- The script also supports `--summarize-gates ART` to generate the same summary
|
||||
from existing `gate_pre/` and `gate_post/` artifacts without launching
|
||||
servers.
|
||||
|
||||
Recorded rows:
|
||||
|
||||
- pre/post MoE transcript md5 versus
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`;
|
||||
- pre/post dense transcript md5 versus
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`;
|
||||
- pre/post backend op rows, currently `MUL_MAT_ID`, with the parsed passed/total
|
||||
count.
|
||||
|
||||
Verification:
|
||||
|
||||
- Red check: Phase 20 initially had gate artifacts but no `gate_summary.tsv`.
|
||||
- local `bash -n` passed;
|
||||
- local `--help` passed;
|
||||
- DGX `--summarize-gates` against Phase 20 wrote six green rows;
|
||||
- DGX `DRY_RUN=1` validated the normal path still preflights and writes
|
||||
`hardware.txt` without launching servers or writing a gate summary before
|
||||
gates exist.
|
||||
|
||||
Artifacts:
|
||||
|
||||
- Backfilled summary:
|
||||
`/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`
|
||||
- Dry run:
|
||||
`/home/mudler/bench/phase25_gate_summary_dryrun/20260701_053353`
|
||||
|
||||
Backfilled Phase 20 gate summary:
|
||||
|
||||
```text
|
||||
pre moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0
|
||||
pre dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439
|
||||
pre op_MUL_MAT_ID ok 806/806
|
||||
post moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0
|
||||
post dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439
|
||||
post op_MUL_MAT_ID ok 806/806
|
||||
```
|
||||
|
||||
Decision:
|
||||
|
||||
- Future full serving snapshots carry compact proof that inference md5/op gates
|
||||
stayed green before and after the paged-vs-vLLM run.
|
||||
- Treat `gate_summary.tsv` plus `hardware.txt` as the quick audit surface before
|
||||
accepting a parity snapshot.
|
||||
|
||||
@@ -132,6 +132,10 @@ python3 /home/mudler/bench/h2h_cli3.py # OpenAI /v1/completions, ignore_eos, f
|
||||
The harness also writes `hardware.txt` before any server starts, including
|
||||
`DRY_RUN=1`, so every new snapshot records the GPU model, driver, compute
|
||||
capability when exposed by `nvidia-smi`, and a conservative hardware class.
|
||||
Full runs also write `gate_summary.tsv` after the post gate, summarizing pre/post
|
||||
MoE md5, dense md5, and backend-op checks; use
|
||||
`paged-current-serving-snapshot.sh --summarize-gates ART` to backfill or audit an
|
||||
existing snapshot without starting servers.
|
||||
|
||||
### 3.4 THE DECODE-PROFILING RULE (this trap caused 4 wrong analyses)
|
||||
Decode runs as a **replayed CUDA graph**. `nsys` **without** `--cuda-graph-trace=node` collapses each graph replay into ONE opaque launch, so every per-kernel attribution becomes an artifact. This is exactly what made the old "paged 159 us/tok, GPU ~16% busy, host-bound, 5.4x more GPU-efficient" story wrong, and produced the wrong ~56% headline.
|
||||
@@ -332,6 +336,12 @@ hardware report. DGX dry run passed at
|
||||
artifacts self-describing: GB10/workstation Blackwell results must not be used
|
||||
as datacenter-Blackwell parity evidence.
|
||||
|
||||
Phase 25 extended the same harness to write `gate_summary.tsv`. The summary was
|
||||
backfilled on the Phase 20 artifact at
|
||||
`/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`;
|
||||
it records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806` as `ok`.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
@@ -396,6 +406,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/phase20_current_snapshot/20260701_050621` - current clean-stack paged-vs-vLLM MoE serving snapshot.
|
||||
- `~/bench/phase21_harness_dryrun/20260701_051757` - current snapshot harness dry-run artifact.
|
||||
- `~/bench/phase24_hardware_report_dryrun/20260701_052741` - current snapshot harness dry run proving `hardware.txt` captures the DGX as `hardware_class=gb10_or_workstation_blackwell`.
|
||||
- `~/bench/phase25_gate_summary_dryrun/20260701_053353` - dry run after adding `gate_summary.tsv` support; normal dry-run still writes `hardware.txt` and does not emit a gate summary before gates exist.
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
|
||||
@@ -676,6 +676,21 @@ with `GPU 0: NVIDIA GB10`, driver `580.159.03`, and compute capability `12.1`.
|
||||
Use `hardware.txt` when comparing future snapshots. GB10/workstation Blackwell
|
||||
results do not establish datacenter-Blackwell parity.
|
||||
|
||||
### Phase 25 snapshot gate summary
|
||||
|
||||
Phase 25 extended `paged-current-serving-snapshot.sh` to write
|
||||
`gate_summary.tsv` after the post gate in full runs. It also added
|
||||
`--summarize-gates ART` for auditing existing artifacts without launching
|
||||
servers.
|
||||
|
||||
The Phase 20 artifact was backfilled at
|
||||
`/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`.
|
||||
It records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806` as `ok`.
|
||||
|
||||
Use `hardware.txt` plus `gate_summary.tsv` as the quick audit surface before
|
||||
accepting any new parity snapshot.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -3,7 +3,7 @@ set -euo pipefail
|
||||
|
||||
usage() {
|
||||
cat <<'EOF'
|
||||
Usage: paged-current-serving-snapshot.sh
|
||||
Usage: paged-current-serving-snapshot.sh [--summarize-gates ART]
|
||||
|
||||
Run a current-stack paged llama.cpp vs vLLM MoE serving snapshot on DGX.
|
||||
|
||||
@@ -30,13 +30,32 @@ Environment overrides:
|
||||
VLLM_BIN vLLM executable (default: ~/vllm-bench/bin/vllm)
|
||||
SKIP_GATES=1 to skip pre/post paged inference gates
|
||||
DRY_RUN=1 validate inputs/preflight, write hardware.txt, and print commands without running servers
|
||||
|
||||
Options:
|
||||
--summarize-gates ART write ART/gate_summary.tsv from existing gate_pre/gate_post artifacts
|
||||
EOF
|
||||
}
|
||||
|
||||
if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
|
||||
SUMMARY_GATES_ART=""
|
||||
case "${1:-}" in
|
||||
-h|--help)
|
||||
usage
|
||||
exit 0
|
||||
fi
|
||||
;;
|
||||
--summarize-gates)
|
||||
if [[ -z "${2:-}" ]]; then
|
||||
usage >&2
|
||||
exit 2
|
||||
fi
|
||||
SUMMARY_GATES_ART="$2"
|
||||
;;
|
||||
"")
|
||||
;;
|
||||
*)
|
||||
usage >&2
|
||||
exit 2
|
||||
;;
|
||||
esac
|
||||
|
||||
SRC=${SRC:-"$HOME/llama-phase6-source"}
|
||||
BIN=${BIN:-"$SRC/build-cuda/bin"}
|
||||
@@ -56,6 +75,8 @@ VLLM_PORT=${VLLM_PORT:-8000}
|
||||
VLLM_BIN=${VLLM_BIN:-"$HOME/vllm-bench/bin/vllm"}
|
||||
SKIP_GATES=${SKIP_GATES:-0}
|
||||
DRY_RUN=${DRY_RUN:-0}
|
||||
MOE_MD5_EXPECTED=8cb0ce23777bf55f92f63d0292c756b0
|
||||
DENSE_MD5_EXPECTED=5951a5b4d624ce891e22ab5fca9bc439
|
||||
|
||||
LOCK_DIR="$HOME/gpu_bench_lock"
|
||||
OWNER="$LOCK_DIR/owner"
|
||||
@@ -271,6 +292,73 @@ for n in sorted({row[1] for row in rows}):
|
||||
PY
|
||||
}
|
||||
|
||||
write_gate_summary() {
|
||||
python3 - "$ART" "$MOE_MD5_EXPECTED" "$DENSE_MD5_EXPECTED" <<'PY' | tee "$ART/gate_summary.tsv"
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
art = Path(sys.argv[1])
|
||||
expected = {
|
||||
"moe": sys.argv[2],
|
||||
"dense": sys.argv[3],
|
||||
}
|
||||
ansi = re.compile(r"\x1b\[[0-9;]*m")
|
||||
bad = False
|
||||
|
||||
print("phase\tcheck\tstatus\tactual\texpected\tdetails")
|
||||
|
||||
for phase in ("pre", "post"):
|
||||
gate_dir = art / f"gate_{phase}"
|
||||
if not gate_dir.exists():
|
||||
print(f"{phase}\tall\tskipped\t\t\t{gate_dir} missing")
|
||||
continue
|
||||
|
||||
for name, want in expected.items():
|
||||
md5_path = gate_dir / f"{name}.md5"
|
||||
if not md5_path.exists():
|
||||
print(f"{phase}\t{name}_md5\tmissing\t\t{want}\t{md5_path} missing")
|
||||
bad = True
|
||||
continue
|
||||
got = md5_path.read_text().split()[0]
|
||||
status = "ok" if got == want else "mismatch"
|
||||
if status != "ok":
|
||||
bad = True
|
||||
print(f"{phase}\t{name}_md5\t{status}\t{got}\t{want}\t{md5_path}")
|
||||
|
||||
op_paths = sorted(gate_dir.glob("op_*.txt"))
|
||||
if not op_paths:
|
||||
print(f"{phase}\top\tmissing\t\t\tno op_*.txt files")
|
||||
bad = True
|
||||
continue
|
||||
|
||||
for path in op_paths:
|
||||
op = path.stem.removeprefix("op_")
|
||||
text = ansi.sub("", path.read_text(errors="replace"))
|
||||
passed = re.search(r"(\d+)/(\d+) tests passed", text)
|
||||
backend_ok = re.search(r"Backend CUDA0:\s+OK", text)
|
||||
if passed:
|
||||
actual = f"{passed.group(1)}/{passed.group(2)}"
|
||||
status = "ok" if passed.group(1) == passed.group(2) and backend_ok else "fail"
|
||||
else:
|
||||
actual = ""
|
||||
status = "missing"
|
||||
if status != "ok":
|
||||
bad = True
|
||||
print(f"{phase}\top_{op}\t{status}\t{actual}\tall\t{path}")
|
||||
|
||||
if bad:
|
||||
sys.exit(6)
|
||||
PY
|
||||
}
|
||||
|
||||
if [[ -n "$SUMMARY_GATES_ART" ]]; then
|
||||
ART="$SUMMARY_GATES_ART"
|
||||
require_path "$ART"
|
||||
write_gate_summary
|
||||
exit 0
|
||||
fi
|
||||
|
||||
require_path "$SRC"
|
||||
require_path "$BIN/llama-server"
|
||||
require_path "$BIN/llama-completion"
|
||||
@@ -306,5 +394,6 @@ run_vllm
|
||||
release_lock
|
||||
trap - EXIT
|
||||
run_gate post
|
||||
write_gate_summary
|
||||
write_summary
|
||||
log "artifacts: $ART"
|
||||
|
||||
@@ -0,0 +1,121 @@
|
||||
# Snapshot Gate Summary Phase 25 Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use
|
||||
> superpowers:verification-before-completion before recording the phase result.
|
||||
> Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** make current-stack paged-vs-vLLM serving artifacts prove that
|
||||
inference md5/op gates stayed green without requiring a full log read.
|
||||
|
||||
**Architecture:** extend the existing current serving snapshot harness with a
|
||||
compact gate-summary writer. Keep it additive and outside llama.cpp source: no
|
||||
patch-series change and no inference behavior change.
|
||||
|
||||
**Tech Stack:** Bash, Python stdlib, existing `paged-inference-gates.sh`
|
||||
artifacts.
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Red Check
|
||||
|
||||
- [x] **Step 1: Prove Phase 20 lacks compact gate proof**
|
||||
|
||||
Command:
|
||||
|
||||
```bash
|
||||
ssh dgx.casa 'test -e ~/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv'
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
- exited `1` before the patch, while `gate_pre/`, `gate_post/`, and full gate
|
||||
logs existed.
|
||||
|
||||
## Task 2: Add Gate Summary
|
||||
|
||||
- [x] **Step 1: Extend `paged-current-serving-snapshot.sh`**
|
||||
|
||||
File:
|
||||
|
||||
- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
|
||||
|
||||
Behavior:
|
||||
|
||||
- writes `$ART/gate_summary.tsv` after the post gate in a full serving run;
|
||||
- records pre/post MoE md5, dense md5, and backend op status;
|
||||
- compares MoE against `8cb0ce23777bf55f92f63d0292c756b0`;
|
||||
- compares dense against `5951a5b4d624ce891e22ab5fca9bc439`;
|
||||
- parses op pass counts such as `806/806 tests passed`;
|
||||
- exits non-zero if an existing gate artifact is missing, mismatched, or not
|
||||
fully passing;
|
||||
- supports `--summarize-gates ART` to audit existing artifacts without running
|
||||
servers.
|
||||
|
||||
## Task 3: Verify
|
||||
|
||||
- [x] **Step 1: Local syntax/help checks**
|
||||
|
||||
Commands:
|
||||
|
||||
```bash
|
||||
bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
|
||||
backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
- both passed.
|
||||
|
||||
- [x] **Step 2: Backfill Phase 20 gate summary**
|
||||
|
||||
Command:
|
||||
|
||||
```bash
|
||||
/tmp/paged-current-serving-snapshot.sh \
|
||||
--summarize-gates ~/bench/phase20_current_snapshot/20260701_050621
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
- wrote `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`;
|
||||
- pre/post MoE md5 rows were `ok`;
|
||||
- pre/post dense md5 rows were `ok`;
|
||||
- pre/post `MUL_MAT_ID` rows were `ok` with `806/806`.
|
||||
|
||||
- [x] **Step 3: DGX dry run**
|
||||
|
||||
Command:
|
||||
|
||||
```bash
|
||||
DRY_RUN=1 ART=~/bench/phase25_gate_summary_dryrun/20260701_053353 \
|
||||
/tmp/paged-current-serving-snapshot.sh
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
- preflight verified `docker=0`, `local_ai_worker=0`, `compute=0`;
|
||||
- `hardware.txt` was still written;
|
||||
- no paged or vLLM server launched;
|
||||
- no `gate_summary.tsv` was written before gates existed.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase25_gate_summary_dryrun/20260701_053353`
|
||||
|
||||
## Task 4: Record Result
|
||||
|
||||
- [x] **Step 1: Update parity docs**
|
||||
|
||||
Updated files:
|
||||
|
||||
- `backend/cpp/llama-cpp-localai-paged/README.md`
|
||||
- `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
|
||||
- `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
|
||||
- `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
|
||||
|
||||
## Self-Review
|
||||
|
||||
- No llama.cpp source behavior changed.
|
||||
- Future full snapshots now contain compact proof of pre/post md5 and op gates.
|
||||
- The summary-only mode lets old artifacts be audited without consuming GPU
|
||||
benchmark time.
|
||||
Reference in New Issue
Block a user