chore(paged): add current serving snapshot harness

Add a reusable current-stack paged-vs-vLLM serving snapshot harness that targets the clean DGX mirror, enforces idle/lock preflight, runs pre/post inference gates, and records ratio summaries. Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 03:19:36 +00:00
parent c99678da42
commit ff3f0620de
6 changed files with 446 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -614,3 +614,10 @@ DGX mirror `f2521ab12`, artifact
 | 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
 | 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
 | 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
+
+Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving
+snapshots. It targets the clean `~/llama-phase6-source` mirror, checks
+docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post
+inference gates, and emits paged/vLLM ratios. Do not use the stale DGX
+`~/bench/combined_definitive.sh` without first porting it to the current mirror
+and lock discipline.
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -1405,3 +1405,39 @@ Decision:
 - Keep MTP scheduler work closed. The next credible parity path is either a
  datacenter-Blackwell rerun or a larger fused-kernel project outside the
  low-conflict GB10 patch stack.
+
+## Phase 21 Current-Stack Serving Harness
+
+Phase 21 made the Phase 20 current-stack serving snapshot repeatable from the
+LocalAI backend tree.
+
+New script:
+
+- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
+
+Purpose:
+
+- targets the clean `~/llama-phase6-source` mirror by default;
+- rejects busy docker, `local-ai-worker`, GPU compute, or owned GPU-lock state;
+- builds the current llama.cpp targets;
+- runs pre/post `paged-inference-gates.sh`;
+- runs paged and vLLM serving arms with the same h2h client;
+- writes paged/vLLM ratio summaries.
+
+Verification:
+
+- local `bash -n` passed;
+- local `--help` passed;
+- DGX `DRY_RUN=1` validated required paths and preflight without launching
+  servers.
+
+Dry-run artifact:
+
+- `/home/mudler/bench/phase21_harness_dryrun/20260701_051757`
+
+Decision:
+
+- Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving
+  snapshots.
+- Do not use stale DGX `~/bench/combined_definitive.sh` without porting it to
+  `~/llama-phase6-source` and the owner-file lock discipline.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -304,6 +304,17 @@ This keeps the GB10 shortcut closure intact: do not reopen MTP or small
 scheduler work. The credible next parity path is a datacenter-Blackwell rerun or
 a larger fused-kernel project outside this low-conflict patch stack.

+Phase 21 added a reusable current-stack serving harness:
+`backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`.
+It defaults to `~/llama-phase6-source`, validates docker/`local-ai-worker`/GPU
+idle state, uses the owner-file lock, runs pre/post inference gates, compares
+paged and vLLM with h2h, and writes ratio summaries. DGX dry run passed at
+`/home/mudler/bench/phase21_harness_dryrun/20260701_051757`.
+
+Use this harness for future current-stack GB10 snapshots. Do not reuse
+`~/bench/combined_definitive.sh` unless it is first ported away from stale
+`~/llama-paged-dev` paths and old lock assumptions.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -644,6 +644,23 @@ credible parity path is not another MTP/scheduler shortcut; it is either the
 documented datacenter-Blackwell rerun or a larger fused-kernel project outside
 the low-conflict GB10 patch stack.

+### Phase 21 current-stack harness
+
+Phase 21 added `paged-current-serving-snapshot.sh` so Phase 20 can be repeated
+without the stale DGX `combined_definitive.sh` assumptions. The script defaults
+to `~/llama-phase6-source`, enforces docker/`local-ai-worker`/GPU-idle preflight,
+uses the owner-file lock, runs pre/post md5/op gates, runs paged and vLLM in the
+same session, and emits ratio rows in `summary.tsv`.
+
+Verification:
+
+- local `bash -n` and `--help` passed;
+- DGX `DRY_RUN=1` passed and wrote
+  `/home/mudler/bench/phase21_harness_dryrun/20260701_051757`.
+
+Use this harness for future current-stack GB10 snapshots before making parity
+claims.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
+++ b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
@@ -0,0 +1,268 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+usage() {
+  cat <<'EOF'
+Usage: paged-current-serving-snapshot.sh
+
+Run a current-stack paged llama.cpp vs vLLM MoE serving snapshot on DGX.
+
+This harness uses the clean llama.cpp mirror by default, not stale development
+trees. It runs pre/post paged inference gates, then a same-session serving
+comparison with the h2h client.
+
+Environment overrides:
+  SRC          llama.cpp source dir (default: ~/llama-phase6-source)
+  BIN          llama.cpp build bin dir (default: $SRC/build-cuda/bin)
+  MODEL        paged GGUF path (default: ~/bench/q36-35b-a3b-nvfp4.gguf)
+  VLLM_MODEL   vLLM model dir (default: ~/bench/q36-35b-a3b-nvfp4-vllm)
+  H2H          h2h client (default: ~/bench/h2h_cli3.py)
+  ART          artifact dir (default: ~/bench/phase_current_serving_snapshot/<timestamp>)
+  NPL          concurrency list (default: "8 32 128")
+  PTOK         prompt filler words (default: 128)
+  GEN          generated tokens (default: 64)
+  CTX          llama-server context (default: 131072)
+  PARALLEL     llama-server parallel slots (default: 128)
+  BATCH        llama-server logical batch (default: 2048)
+  UBATCH       llama-server physical batch (default: 512)
+  LLAMA_PORT   llama-server port (default: 8098)
+  VLLM_PORT    vLLM port (default: 8000)
+  VLLM_BIN     vLLM executable (default: ~/vllm-bench/bin/vllm)
+  SKIP_GATES=1 to skip pre/post paged inference gates
+  DRY_RUN=1    validate inputs/preflight and print commands without running servers
+EOF
+}
+
+if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
+  usage
+  exit 0
+fi
+
+SRC=${SRC:-"$HOME/llama-phase6-source"}
+BIN=${BIN:-"$SRC/build-cuda/bin"}
+MODEL=${MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4.gguf"}
+VLLM_MODEL=${VLLM_MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4-vllm"}
+H2H=${H2H:-"$HOME/bench/h2h_cli3.py"}
+ART=${ART:-"$HOME/bench/phase_current_serving_snapshot/$(date +%Y%m%d_%H%M%S)"}
+NPL=${NPL:-"8 32 128"}
+PTOK=${PTOK:-128}
+GEN=${GEN:-64}
+CTX=${CTX:-131072}
+PARALLEL=${PARALLEL:-128}
+BATCH=${BATCH:-2048}
+UBATCH=${UBATCH:-512}
+LLAMA_PORT=${LLAMA_PORT:-8098}
+VLLM_PORT=${VLLM_PORT:-8000}
+VLLM_BIN=${VLLM_BIN:-"$HOME/vllm-bench/bin/vllm"}
+SKIP_GATES=${SKIP_GATES:-0}
+DRY_RUN=${DRY_RUN:-0}
+
+LOCK_DIR="$HOME/gpu_bench_lock"
+OWNER="$LOCK_DIR/owner"
+SERVER_PID=""
+
+log() {
+  printf '[%s] %s\n' "$(date -Is)" "$*" | tee -a "$ART/run.log"
+}
+
+require_path() {
+  if [[ ! -e "$1" ]]; then
+    echo "missing required path: $1" >&2
+    exit 2
+  fi
+}
+
+preflight() {
+  mkdir -p "$ART"
+  local docker_count local_ai compute owner
+  docker_count=$(docker ps -q | wc -l)
+  local_ai=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true)
+  compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed '/^$/d' | wc -l)
+  owner="FREE-no-lock-file"
+  if [[ -f "$OWNER" ]]; then
+    owner=$(cat "$OWNER")
+  fi
+  {
+    echo "docker=$docker_count"
+    echo "local_ai_worker=$local_ai"
+    echo "compute=$compute"
+    echo "$owner"
+  } | tee "$ART/preflight.txt"
+  [[ "$docker_count" == "0" ]]
+  [[ "$local_ai" == "0" ]]
+  [[ "$compute" == "0" ]]
+  case "$owner" in
+    FREE*|FREE-no-lock-file) ;;
+    *) echo "GPU lock is busy: $owner" >&2; exit 3 ;;
+  esac
+}
+
+acquire_lock() {
+  mkdir -p "$LOCK_DIR"
+  echo "codex-current-serving-snapshot $(date +%s)" > "$OWNER"
+}
+
+release_lock() {
+  if [[ -n "$SERVER_PID" ]]; then
+    kill "$SERVER_PID" >/dev/null 2>&1 || true
+    wait "$SERVER_PID" >/dev/null 2>&1 || true
+    SERVER_PID=""
+  fi
+  pkill -9 -f "[l]lama-server.*--port $LLAMA_PORT" >/dev/null 2>&1 || true
+  pkill -9 -u "$(id -u)" -f "[v]llm serve" >/dev/null 2>&1 || true
+  mkdir -p "$LOCK_DIR"
+  echo "FREE released-by-codex-current-serving-snapshot $(date +%s)" > "$OWNER"
+}
+
+wait_http() {
+  local url="$1"
+  local pattern="$2"
+  local log_file="$3"
+  local health="$4"
+  for _ in $(seq 1 240); do
+    if curl -fsS "$url" > "$health" 2>"$health.err" && grep -q "$pattern" "$health"; then
+      return 0
+    fi
+    if [[ -n "$SERVER_PID" ]] && ! kill -0 "$SERVER_PID" >/dev/null 2>&1; then
+      tail -120 "$log_file" >&2 || true
+      return 1
+    fi
+    sleep 1
+  done
+  tail -120 "$log_file" >&2 || true
+  return 1
+}
+
+run_gate() {
+  local name="$1"
+  if [[ "$SKIP_GATES" == "1" ]]; then
+    log "skipping $name inference gate"
+    return
+  fi
+  log "running $name inference gate"
+  ART="$ART/gate_$name" "$HOME/paged-inference-gates.sh" > "$ART/gate_$name.log" 2>&1
+  cat "$ART/gate_$name.log" | tee -a "$ART/run.log"
+}
+
+run_paged() {
+  local arm_dir="$ART/paged"
+  mkdir -p "$arm_dir"
+  log "starting paged current-stack server"
+  cd "$BIN"
+  env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 \
+    ./llama-server \
+      -m "$MODEL" -ngl 99 -fa on -c "$CTX" -b "$BATCH" -ub "$UBATCH" \
+      --parallel "$PARALLEL" --host 127.0.0.1 --port "$LLAMA_PORT" --no-webui \
+      > "$arm_dir/server.log" 2>&1 &
+  SERVER_PID=$!
+  wait_http "http://127.0.0.1:$LLAMA_PORT/health" "ok" "$arm_dir/server.log" "$arm_dir/health.json"
+  python3 "$H2H" --url "http://127.0.0.1:$LLAMA_PORT/v1/completions" \
+    --model q36 -n 8 --ptok "$PTOK" --gen 16 --nonce "warm_paged_$(date +%s)" --no-cache >/dev/null
+  for n in $NPL; do
+    log "paged n=$n"
+    python3 "$H2H" --url "http://127.0.0.1:$LLAMA_PORT/v1/completions" \
+      --model q36 -n "$n" --ptok "$PTOK" --gen "$GEN" \
+      --nonce "paged_${n}_$(date +%s)" --no-cache > "$arm_dir/n${n}.json"
+    cat "$arm_dir/n${n}.json" | tee -a "$ART/run.log"
+  done
+  kill "$SERVER_PID" >/dev/null 2>&1 || true
+  wait "$SERVER_PID" >/dev/null 2>&1 || true
+  SERVER_PID=""
+  sleep 3
+}
+
+run_vllm() {
+  local arm_dir="$ART/vllm"
+  mkdir -p "$arm_dir"
+  export PATH="$(dirname "$VLLM_BIN"):$PATH"
+  export VLLM_LOGGING_LEVEL=${VLLM_LOGGING_LEVEL:-INFO}
+  export HF_HUB_OFFLINE=${HF_HUB_OFFLINE:-1}
+  log "starting vLLM server"
+  nohup "$VLLM_BIN" serve "$VLLM_MODEL" \
+    --served-model-name q36 --gpu-memory-utilization 0.85 --max-model-len 4096 \
+    --max-num-seqs 256 --host 127.0.0.1 --port "$VLLM_PORT" --tensor-parallel-size 1 \
+    > "$arm_dir/server.log" 2>&1 &
+  SERVER_PID=$!
+  wait_http "http://127.0.0.1:$VLLM_PORT/v1/models" "q36" "$arm_dir/server.log" "$arm_dir/models.json"
+  python3 "$H2H" --url "http://127.0.0.1:$VLLM_PORT/v1/completions" \
+    --model q36 -n 8 --ptok "$PTOK" --gen 16 --nonce "warm_vllm_$(date +%s)" --no-cache >/dev/null
+  for n in $NPL; do
+    log "vllm n=$n"
+    python3 "$H2H" --url "http://127.0.0.1:$VLLM_PORT/v1/completions" \
+      --model q36 -n "$n" --ptok "$PTOK" --gen "$GEN" \
+      --nonce "vllm_${n}_$(date +%s)" --no-cache > "$arm_dir/n${n}.json"
+    cat "$arm_dir/n${n}.json" | tee -a "$ART/run.log"
+  done
+  kill "$SERVER_PID" >/dev/null 2>&1 || true
+  pkill -9 -u "$(id -u)" -f "[v]llm serve" >/dev/null 2>&1 || true
+  wait "$SERVER_PID" >/dev/null 2>&1 || true
+  SERVER_PID=""
+  sleep 5
+}
+
+write_summary() {
+  python3 - "$ART" <<'PY' | tee "$ART/summary.tsv"
+import json
+import sys
+from pathlib import Path
+
+art = Path(sys.argv[1])
+rows = []
+for arm in ("paged", "vllm"):
+    for path in sorted((art / arm).glob("n*.json")):
+        data = json.loads(path.read_text())
+        rows.append((arm, data["n"], data["agg_tps"], data["decode_agg_tps"],
+                     data["decode_perseq_tps"], data["prefill_tps"],
+                     data["ttft_mean_ms"], data["wall_s"]))
+
+print("arm\tn\tagg_tps\tdecode_agg_tps\tdecode_perseq_tps\tprefill_tps\tttft_mean_ms\twall_s")
+for row in rows:
+    print("\t".join(str(x) for x in row))
+
+by_key = {(row[0], row[1]): row for row in rows}
+print("\nratio\tn\tpaged_decode_over_vllm\tpaged_perseq_over_vllm\tpaged_agg_over_vllm\tpaged_ttft_over_vllm")
+for n in sorted({row[1] for row in rows}):
+    paged = by_key.get(("paged", n))
+    vllm = by_key.get(("vllm", n))
+    if not paged or not vllm:
+        continue
+    print(f"ratio\t{n}\t{paged[3]/vllm[3]:.4f}\t{paged[4]/vllm[4]:.4f}\t{paged[2]/vllm[2]:.4f}\t{paged[6]/vllm[6]:.4f}")
+PY
+}
+
+require_path "$SRC"
+require_path "$BIN/llama-server"
+require_path "$BIN/llama-completion"
+require_path "$BIN/test-backend-ops"
+require_path "$MODEL"
+require_path "$VLLM_MODEL"
+require_path "$H2H"
+require_path "$VLLM_BIN"
+require_path "$HOME/paged-inference-gates.sh"
+
+preflight
+log "artifact=$ART"
+log "source=$(git -C "$SRC" log --oneline -1)"
+
+if [[ "$DRY_RUN" == "1" ]]; then
+  log "dry run only; commands validated"
+  log "would build: cmake --build $SRC/build-cuda --target llama-server llama-completion test-backend-ops -j8"
+  log "would run paged NPL=[$NPL] PTOK=$PTOK GEN=$GEN"
+  log "would run vLLM NPL=[$NPL] PTOK=$PTOK GEN=$GEN"
+  exit 0
+fi
+
+log "building llama-server, llama-completion, and test-backend-ops"
+cmake --build "$SRC/build-cuda" --target llama-server llama-completion test-backend-ops -j 8 \
+  > "$ART/build.log" 2>&1
+
+run_gate pre
+acquire_lock
+trap release_lock EXIT
+run_paged
+run_vllm
+release_lock
+trap - EXIT
+run_gate post
+write_summary
+log "artifacts: $ART"
--- a/docs/superpowers/plans/2026-07-01-current-serving-harness-phase21.md
+++ b/docs/superpowers/plans/2026-07-01-current-serving-harness-phase21.md
@@ -0,0 +1,107 @@
+# Current Serving Harness Phase 21 Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use
+> superpowers:verification-before-completion before recording the phase result.
+> Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** make the Phase 20 current-stack paged-vs-vLLM serving snapshot
+repeatable from the LocalAI backend tree.
+
+**Architecture:** add a standalone shell harness beside the existing paged
+inference gate and MTP serving harness. The script targets the clean
+`~/llama-phase6-source` mirror, uses the owner-file GPU lock, runs pre/post
+inference gates, compares paged and vLLM in one session, and writes ratio
+summaries.
+
+**Tech Stack:** Bash, llama.cpp `llama-server`, vLLM, `h2h_cli3.py`, DGX GB10.
+
+---
+
+## Task 1: Red Check
+
+- [x] **Step 1: Prove no reusable current-stack harness exists**
+
+  Command:
+
+  ```bash
+  test -e backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
+  ```
+
+  Result:
+
+  - exited `1` before the patch, as expected.
+
+## Task 2: Add Harness
+
+- [x] **Step 1: Create script**
+
+  File:
+
+  - `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
+
+  Features:
+
+  - defaults to `~/llama-phase6-source`, not stale `~/llama-paged-dev`;
+  - checks docker, `local-ai-worker`, GPU compute processes, and owner-file lock;
+  - builds `llama-server`, `llama-completion`, and `test-backend-ops`;
+  - runs pre/post `paged-inference-gates.sh`;
+  - runs paged and vLLM serving arms with the same h2h client;
+  - writes `summary.tsv` with paged/vLLM ratios;
+  - supports `DRY_RUN=1` for path/preflight validation without servers.
+
+## Task 3: Verify Harness
+
+- [x] **Step 1: Local syntax/help checks**
+
+  Commands:
+
+  ```bash
+  test -x backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
+  bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
+  backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help
+  ```
+
+  Result:
+
+  - all passed.
+
+- [x] **Step 2: DGX dry run**
+
+  Command:
+
+  ```bash
+  DRY_RUN=1 ART=~/bench/phase21_harness_dryrun/20260701_051757 \
+    /tmp/paged-current-serving-snapshot.sh
+  ```
+
+  Result:
+
+  - verified `docker=0`, `local_ai_worker=0`, `compute=0`;
+  - verified owner file was free;
+  - found current source `f2521ab12`;
+  - validated required paths and printed the build/paged/vLLM commands without
+    launching servers.
+
+  Artifact:
+
+  - `/home/mudler/bench/phase21_harness_dryrun/20260701_051757`
+
+## Task 4: Future Use
+
+- [x] **Step 1: Prefer this harness for current snapshots**
+
+  Use this script for future current-stack GB10 parity snapshots:
+
+  ```bash
+  backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
+  ```
+
+  Do not use the stale DGX `~/bench/combined_definitive.sh` without first
+  porting it to the clean mirror and owner-file lock discipline.
+
+## Self-Review
+
+- No llama.cpp source behavior changed.
+- The harness is repeatable and defaults to the current clean mirror.
+- The dry run covered path validation and DGX preflight without consuming GPU
+  benchmark time.