From 440129c98e4cb73876d29440bd7a6095eb871998 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 08:07:48 +0000 Subject: [PATCH] fix(paged): harden serving snapshot readiness Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 78 +++++++++ .../docs/PARITY_HANDOFF.md | 19 ++ .../docs/VLLM_PARITY_LEVER_MAP.md | 27 +++ .../paged-current-serving-snapshot.sh | 45 +++-- ...26-07-01-dense-serving-snapshot-phase47.md | 82 +++++++++ ...07-01-serving-harness-readiness-phase48.md | 165 ++++++++++++++++++ 6 files changed, 401 insertions(+), 15 deletions(-) create mode 100644 docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md create mode 100644 docs/superpowers/plans/2026-07-01-serving-harness-readiness-phase48.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 4ae785c11..ab2aad514 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -2615,3 +2615,81 @@ Decision: - This does not claim a new parity result. Full runs still require the normal preflight, `hardware.txt`, pre/post md5 gates, `MUL_MAT`/`MUL_MAT_ID`, and KL-if-md5-changes gates before interpreting throughput. + +## Phase 47 Dense Serving Snapshot Attempt + +Phase 47 attempted to use the Phase46 model-name override for a dense +paged-vs-vLLM serving snapshot. The first full attempt is incomplete and must +not be used as a dense parity result. + +Artifacts: + +- Dry-run: `/home/mudler/bench/phase47_dense_serving_dryrun/20260701_095141` +- Incomplete full attempt: + `/home/mudler/bench/phase47_dense_serving/20260701_095151` + +Run shape: + +- `MODEL=$HOME/bench/q36-27b-nvfp4.gguf` +- `VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm` +- `SERVED_MODEL_NAME=dense-q36` +- `NPL="1 8 32 128"`, `PARALLEL=128`, `CTX=131072`, `PTOK=128`, `GEN=64` +- `OPS=MUL_MAT,MUL_MAT_ID` + +Completed before failure: + +- Preflight was clean: docker `0`, `local-ai-worker` `0`, GPU compute `0`. +- Pre-gates were green: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense + md5 `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, + `MUL_MAT_ID` `806/806`. +- Paged dense arm completed through `n=128`: + +| n | paged decode agg t/s | paged per-seq t/s | paged agg t/s | paged TTFT ms | +|---|----------------------|-------------------|----------------|---------------| +| 1 | `13.3` | `13.14` | `12.5` | `312.3` | +| 8 | `85.5` | `10.35` | `62.5` | `2068.5` | +| 32 | `198.1` | `5.44` | `105.1` | `7608.5` | +| 128 | `361.8` | `1.89` | `143.0` | `20501.7` | + +Failure/root cause: + +- vLLM dense startup exceeded the old fixed `240` one-second readiness budget. + The server log showed weight loading alone took about `199.43s`, followed by + compile, autotune, CUDA graph capture, and multimodal warmup before the server + began listening. +- `vllm/models.json` is empty and `models.json.err` contains an initial + connection failure, so no vLLM result JSONs were produced. +- Cleanup then waited on the vLLM server PID after `SIGTERM`; manual cleanup was + required. DGX was returned to idle with owner + `FREE released-by-codex-phase47-cleanup 1782892962`. + +Decision: + +- Treat this artifact as a harness failure investigation, not a benchmark. +- Retry Phase47 only after the Phase48 readiness/cleanup hardening is present. + +## Phase 48 Serving Harness Readiness Hardening + +Phase 48 fixes the harness behavior exposed by the failed dense snapshot +attempt. It is a harness reliability change, not an inference change. + +Changes: + +- Add `LLAMA_READY_ATTEMPTS` (default `240`) and `VLLM_READY_ATTEMPTS` (default + `600`) so slow vLLM model load/compile paths can be pre-budgeted. +- Bound each HTTP readiness probe with `curl --max-time 2` so a single probe + cannot hang the readiness loop. +- Replace direct `kill` plus unbounded `wait` with `stop_server_pid`, which + sends `SIGTERM`, waits up to 30 seconds, then sends `SIGKILL` before `wait`. +- Use the bounded cleanup helper for normal paged teardown, normal vLLM + teardown, and error-path `release_lock`. + +Verification: + +- Red checks first proved `VLLM_READY_ATTEMPTS`, bounded curl, and hard-kill + cleanup were absent. +- Green checks after the patch included `bash -n`, help-text grep, grep for + `curl --max-time 2 -fsS "$url"`, grep for `kill -9 "$SERVER_PID"`, and a DGX + dense dry-run with `VLLM_READY_ATTEMPTS=700`. +- DGX dry-run artifact: + `/home/mudler/bench/phase48_readiness_harness_dryrun/20260701_100533`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 5486c3dac..85d7e63dc 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -582,6 +582,22 @@ engines. DGX dry run: `SERVED_MODEL_NAME=dense-q36` printed during `DRY_RUN=1`. This is harness-only hardware-pivot readiness, not a throughput result. +Phase 47 attempted the first dense serving snapshot using the Phase46 override. +Dry-run artifact: +`/home/mudler/bench/phase47_dense_serving_dryrun/20260701_095141`; incomplete +full artifact: `/home/mudler/bench/phase47_dense_serving/20260701_095151`. +Pre-gates were green and the paged dense arm completed through `n=128`, but the +artifact is not a dense parity result because vLLM produced no result JSONs. +Root cause: dense vLLM startup exceeded the old fixed readiness budget, and the +cleanup path could wait indefinitely on the server PID after `SIGTERM`. + +Phase 48 hardens the serving snapshot harness for that failure mode. It adds +`LLAMA_READY_ATTEMPTS` and `VLLM_READY_ATTEMPTS`, bounds HTTP readiness probes +with `curl --max-time 2`, and uses bounded server cleanup that escalates from +`SIGTERM` to `SIGKILL`. Dry-run artifact: +`/home/mudler/bench/phase48_readiness_harness_dryrun/20260701_100533`, with +`VLLM_READY_ATTEMPTS=700` printed and clean DGX preflight. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -669,6 +685,9 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase44_hardware_pivot_harness_dryrun/20260701_094038` - harness-only dry-run artifact proving the vLLM serving config overrides are printed and preflighted before any server starts. - `~/bench/phase45_inference_gate_guard/20260701_094320` - post-Phase44 inference guard; MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` backend-op gates green. - `~/bench/phase46_served_model_name_dryrun/20260701_094849` - harness-only dry-run artifact proving `SERVED_MODEL_NAME` is printed and preflighted before any server starts. +- `~/bench/phase47_dense_serving_dryrun/20260701_095141` - dense serving dry-run with `SERVED_MODEL_NAME=dense-q36`. +- `~/bench/phase47_dense_serving/20260701_095151` - incomplete dense serving attempt; pre-gates and paged arm completed, vLLM did not produce result JSONs under the old readiness budget. +- `~/bench/phase48_readiness_harness_dryrun/20260701_100533` - harness dry-run proving configurable readiness budgets and clean preflight before retrying dense serving. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index cd9ad24b4..52c071f63 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -1138,6 +1138,33 @@ Preflight was clean and the dry run printed harness-only portability step for dense or hardware-pivot snapshots; it does not change inference code or produce a new throughput result. +### Phase 47 dense serving snapshot attempt + +Phase 47 attempted a dense audited serving snapshot with +`MODEL=$HOME/bench/q36-27b-nvfp4.gguf`, +`VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm`, and +`SERVED_MODEL_NAME=dense-q36`. Dry-run artifact: +`/home/mudler/bench/phase47_dense_serving_dryrun/20260701_095141`. + +The full attempt at +`/home/mudler/bench/phase47_dense_serving/20260701_095151` is incomplete and is +not a parity result. Pre-gates passed and the paged dense arm completed through +`n=128`, but vLLM dense startup exceeded the old fixed readiness budget before +any vLLM result JSONs were produced. Use this artifact only as the root-cause +input for Phase48. + +### Phase 48 serving harness readiness hardening + +Phase 48 fixes the harness issue exposed by Phase47. It adds +`LLAMA_READY_ATTEMPTS` and `VLLM_READY_ATTEMPTS`, bounds each readiness probe +with `curl --max-time 2`, and replaces direct server waits with bounded cleanup +that escalates from `SIGTERM` to `SIGKILL`. + +DGX dry-run artifact: +`/home/mudler/bench/phase48_readiness_harness_dryrun/20260701_100533`. The dry +run printed `VLLM_READY_ATTEMPTS=700` with clean preflight. Retry dense serving +snapshots with this hardening before interpreting dense paged-vs-vLLM ratios. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh index dc44f5f72..0762d0b57 100755 --- a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +++ b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh @@ -28,8 +28,10 @@ Environment overrides: BATCH llama-server logical batch (default: 2048) UBATCH llama-server physical batch (default: 512) LLAMA_PORT llama-server port (default: 8098) + LLAMA_READY_ATTEMPTS llama-server readiness attempts, one per second (default: 240) VLLM_PORT vLLM port (default: 8000) VLLM_BIN vLLM executable (default: ~/vllm-bench/bin/vllm) + VLLM_READY_ATTEMPTS vLLM readiness attempts, one per second (default: 600) VLLM_GPU_MEMORY_UTILIZATION vLLM --gpu-memory-utilization (default: 0.85) VLLM_MAX_MODEL_LEN vLLM --max-model-len (default: 4096) VLLM_MAX_NUM_SEQS vLLM --max-num-seqs (default: 256) @@ -80,8 +82,10 @@ PARALLEL=${PARALLEL:-128} BATCH=${BATCH:-2048} UBATCH=${UBATCH:-512} LLAMA_PORT=${LLAMA_PORT:-8098} +LLAMA_READY_ATTEMPTS=${LLAMA_READY_ATTEMPTS:-240} VLLM_PORT=${VLLM_PORT:-8000} VLLM_BIN=${VLLM_BIN:-"$HOME/vllm-bench/bin/vllm"} +VLLM_READY_ATTEMPTS=${VLLM_READY_ATTEMPTS:-600} VLLM_GPU_MEMORY_UTILIZATION=${VLLM_GPU_MEMORY_UTILIZATION:-0.85} VLLM_MAX_MODEL_LEN=${VLLM_MAX_MODEL_LEN:-4096} VLLM_MAX_NUM_SEQS=${VLLM_MAX_NUM_SEQS:-256} @@ -179,24 +183,38 @@ acquire_lock() { } release_lock() { - if [[ -n "$SERVER_PID" ]]; then - kill "$SERVER_PID" >/dev/null 2>&1 || true - wait "$SERVER_PID" >/dev/null 2>&1 || true - SERVER_PID="" - fi + stop_server_pid pkill -9 -f "[l]lama-server.*--port $LLAMA_PORT" >/dev/null 2>&1 || true pkill -9 -u "$(id -u)" -f "[v]llm serve" >/dev/null 2>&1 || true mkdir -p "$LOCK_DIR" echo "FREE released-by-codex-current-serving-snapshot $(date +%s)" > "$OWNER" } +stop_server_pid() { + if [[ -n "$SERVER_PID" ]]; then + kill "$SERVER_PID" >/dev/null 2>&1 || true + for _ in $(seq 1 30); do + if ! kill -0 "$SERVER_PID" >/dev/null 2>&1; then + break + fi + sleep 1 + done + if kill -0 "$SERVER_PID" >/dev/null 2>&1; then + kill -9 "$SERVER_PID" >/dev/null 2>&1 || true + fi + wait "$SERVER_PID" >/dev/null 2>&1 || true + SERVER_PID="" + fi +} + wait_http() { local url="$1" local pattern="$2" local log_file="$3" local health="$4" - for _ in $(seq 1 240); do - if curl -fsS "$url" > "$health" 2>"$health.err" && grep -q "$pattern" "$health"; then + local attempts="$5" + for _ in $(seq 1 "$attempts"); do + if curl --max-time 2 -fsS "$url" > "$health" 2>"$health.err" && grep -q "$pattern" "$health"; then return 0 fi if [[ -n "$SERVER_PID" ]] && ! kill -0 "$SERVER_PID" >/dev/null 2>&1; then @@ -231,7 +249,7 @@ run_paged() { --parallel "$PARALLEL" --host 127.0.0.1 --port "$LLAMA_PORT" --no-webui \ > "$arm_dir/server.log" 2>&1 & SERVER_PID=$! - wait_http "http://127.0.0.1:$LLAMA_PORT/health" "ok" "$arm_dir/server.log" "$arm_dir/health.json" + wait_http "http://127.0.0.1:$LLAMA_PORT/health" "ok" "$arm_dir/server.log" "$arm_dir/health.json" "$LLAMA_READY_ATTEMPTS" python3 "$H2H" --url "http://127.0.0.1:$LLAMA_PORT/v1/completions" \ --model "$SERVED_MODEL_NAME" -n 8 --ptok "$PTOK" --gen 16 --nonce "warm_paged_$(date +%s)" --no-cache >/dev/null for n in $NPL; do @@ -241,9 +259,7 @@ run_paged() { --nonce "paged_${n}_$(date +%s)" --no-cache > "$arm_dir/n${n}.json" cat "$arm_dir/n${n}.json" | tee -a "$ART/run.log" done - kill "$SERVER_PID" >/dev/null 2>&1 || true - wait "$SERVER_PID" >/dev/null 2>&1 || true - SERVER_PID="" + stop_server_pid sleep 3 } @@ -264,7 +280,7 @@ run_vllm() { "${extra_args[@]}" \ > "$arm_dir/server.log" 2>&1 & SERVER_PID=$! - wait_http "http://127.0.0.1:$VLLM_PORT/v1/models" "$SERVED_MODEL_NAME" "$arm_dir/server.log" "$arm_dir/models.json" + wait_http "http://127.0.0.1:$VLLM_PORT/v1/models" "$SERVED_MODEL_NAME" "$arm_dir/server.log" "$arm_dir/models.json" "$VLLM_READY_ATTEMPTS" python3 "$H2H" --url "http://127.0.0.1:$VLLM_PORT/v1/completions" \ --model "$SERVED_MODEL_NAME" -n 8 --ptok "$PTOK" --gen 16 --nonce "warm_vllm_$(date +%s)" --no-cache >/dev/null for n in $NPL; do @@ -274,10 +290,8 @@ run_vllm() { --nonce "vllm_${n}_$(date +%s)" --no-cache > "$arm_dir/n${n}.json" cat "$arm_dir/n${n}.json" | tee -a "$ART/run.log" done - kill "$SERVER_PID" >/dev/null 2>&1 || true + stop_server_pid pkill -9 -u "$(id -u)" -f "[v]llm serve" >/dev/null 2>&1 || true - wait "$SERVER_PID" >/dev/null 2>&1 || true - SERVER_PID="" sleep 5 } @@ -397,6 +411,7 @@ if [[ "$DRY_RUN" == "1" ]]; then log "dry run only; commands validated" log "would build: cmake --build $BUILD_DIR --target llama-server llama-completion test-backend-ops -j8" log "served model: SERVED_MODEL_NAME=$SERVED_MODEL_NAME" + log "readiness: LLAMA_READY_ATTEMPTS=$LLAMA_READY_ATTEMPTS VLLM_READY_ATTEMPTS=$VLLM_READY_ATTEMPTS" log "would run paged NPL=[$NPL] PTOK=$PTOK GEN=$GEN" log "would run vLLM NPL=[$NPL] PTOK=$PTOK GEN=$GEN" log "vLLM config: VLLM_GPU_MEMORY_UTILIZATION=$VLLM_GPU_MEMORY_UTILIZATION VLLM_MAX_MODEL_LEN=$VLLM_MAX_MODEL_LEN VLLM_MAX_NUM_SEQS=$VLLM_MAX_NUM_SEQS VLLM_TENSOR_PARALLEL_SIZE=$VLLM_TENSOR_PARALLEL_SIZE VLLM_EXTRA_ARGS=[$VLLM_EXTRA_ARGS]" diff --git a/docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md b/docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md new file mode 100644 index 000000000..b1984a6c1 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md @@ -0,0 +1,82 @@ +# Phase47 Dense Serving Snapshot Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Use the newly parameterized harness to collect an audited dense paged-vs-vLLM serving snapshot, without changing inference code. + +**Architecture:** Run `paged-current-serving-snapshot.sh` against the dense GGUF and dense vLLM model with `SERVED_MODEL_NAME=dense-q36`. Keep the standard pre/post paged inference gates and `MUL_MAT,MUL_MAT_ID` op checks. + +**Tech Stack:** Bash serving harness, DGX, LocalAI parity docs. + +--- + +### Task 1: Dry-run dense snapshot inputs + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Run DGX dry-run** + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase47_dense_serving_dryrun/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin MODEL=$HOME/bench/q36-27b-nvfp4.gguf VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm SERVED_MODEL_NAME=dense-q36 ART=$ART NPL="1" PARALLEL=1 CTX=4096 PTOK=16 GEN=4 DRY_RUN=1 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`, docker/local-ai-worker/GPU compute all zero, dense model paths validated, and `SERVED_MODEL_NAME=dense-q36` printed. + +### Task 2: Run audited dense serving snapshot + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [ ] **Step 1: Run full dense snapshot after Phase48 hardening** + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase47_dense_serving/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin MODEL=$HOME/bench/q36-27b-nvfp4.gguf VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm SERVED_MODEL_NAME=dense-q36 ART=$ART NPL="1 8 32 128" PARALLEL=128 CTX=131072 PTOK=128 GEN=64 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: full run exits `0`, pre/post gates are green, and `summary.tsv` contains paged-vs-vLLM ratios for `n=1/8/32/128`. + +First attempt status: incomplete at +`/home/mudler/bench/phase47_dense_serving/20260701_095151`. Pre-gates and the +paged arm completed, but vLLM startup exceeded the old fixed readiness budget +and produced no vLLM result JSONs. Retry only after Phase48 readiness hardening. + +### Task 3: Record dense snapshot result + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md` + +- [ ] **Step 1: Summarize artifact outputs** + +Record the dry-run artifact, full snapshot artifact, pre/post md5/op gate status, and the ratio rows from `summary.tsv`. + +- [ ] **Step 2: Mark completed plan items** + +Mark this plan's checkboxes complete only after the corresponding command or docs update has happened. + +### Task 4: Commit + +**Files:** +- Commit Phase47 docs and plan changes. + +- [ ] **Step 1: Run final checks** + +```bash +git diff --check +git status --short +``` + +Expected: no whitespace errors; only intended docs/plan changes plus the pre-existing untracked `.claude/`. + +- [ ] **Step 2: Commit** + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md +git commit -m "docs(paged): record dense serving snapshot" -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-serving-harness-readiness-phase48.md b/docs/superpowers/plans/2026-07-01-serving-harness-readiness-phase48.md new file mode 100644 index 000000000..d8ec2b8f4 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-serving-harness-readiness-phase48.md @@ -0,0 +1,165 @@ +# Phase48 Serving Harness Readiness Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Make the audited serving snapshot harness robust to slow vLLM dense startup and non-exiting server processes. + +**Architecture:** Keep the fix local to `paged-current-serving-snapshot.sh`: add separate llama/vLLM readiness budgets, bound each HTTP probe with `curl --max-time`, and replace unbounded server cleanup waits with a short graceful wait followed by `SIGKILL`. + +**Tech Stack:** Bash harness, DGX dry-run, LocalAI parity docs. + +--- + +### Task 1: Prove the robustness controls are absent + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Run readiness-budget red check** + +```bash +backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'VLLM_READY_ATTEMPTS' +``` + +Expected: exit `1`. + +- [x] **Step 2: Run bounded-curl red check** + +```bash +grep -F 'curl --max-time' backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `1`. + +- [x] **Step 3: Run cleanup hard-kill red check** + +```bash +grep -F 'kill -9 "$SERVER_PID"' backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `1`. + +### Task 2: Patch readiness and cleanup + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Add documented environment variables** + +Add: + +```bash + LLAMA_READY_ATTEMPTS llama-server readiness attempts, one per second (default: 240) + VLLM_READY_ATTEMPTS vLLM readiness attempts, one per second (default: 600) +``` + +- [x] **Step 2: Add defaults** + +```bash +LLAMA_READY_ATTEMPTS=${LLAMA_READY_ATTEMPTS:-240} +VLLM_READY_ATTEMPTS=${VLLM_READY_ATTEMPTS:-600} +``` + +- [x] **Step 3: Bound HTTP probes** + +Change `wait_http()` to accept an attempts argument and run: + +```bash +curl --max-time 2 -fsS "$url" > "$health" 2>"$health.err" +``` + +- [x] **Step 4: Use per-server readiness budgets** + +Call `wait_http` with `$LLAMA_READY_ATTEMPTS` for llama-server and `$VLLM_READY_ATTEMPTS` for vLLM. + +- [x] **Step 5: Add bounded process cleanup** + +Create `stop_server_pid()` that sends `SIGTERM`, waits up to 30 seconds, sends `SIGKILL` if needed, and only then calls `wait`. + +### Task 3: Verify the harness fix + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Shell syntax check** + +```bash +bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`. + +- [x] **Step 2: Help-text green check** + +```bash +backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'VLLM_READY_ATTEMPTS' +``` + +Expected: exit `0`. + +- [x] **Step 3: Bounded-curl green check** + +```bash +grep -F 'curl --max-time 2 -fsS "$url"' backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`. + +- [x] **Step 4: Cleanup hard-kill green check** + +```bash +grep -F 'kill -9 "$SERVER_PID"' backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`. + +- [x] **Step 5: DGX dry-run with long vLLM readiness budget** + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase48_readiness_harness_dryrun/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin MODEL=$HOME/bench/q36-27b-nvfp4.gguf VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm SERVED_MODEL_NAME=dense-q36 ART=$ART NPL="1" PARALLEL=1 CTX=4096 PTOK=16 GEN=4 DRY_RUN=1 VLLM_READY_ATTEMPTS=700 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`, clean preflight, and dry-run output includes `VLLM_READY_ATTEMPTS=700`. + +### Task 4: Record Phase48 and failed Phase47 attempt + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md` +- Modify: `docs/superpowers/plans/2026-07-01-serving-harness-readiness-phase48.md` + +- [x] **Step 1: Record Phase47 as failed/incomplete** + +Record the partial artifact and the root cause: vLLM dense startup exceeded the old 240-attempt readiness budget, and cleanup could hang waiting on the server PID. + +- [x] **Step 2: Record Phase48 fix** + +Record the new readiness variables, bounded curl probe, bounded cleanup, and dry-run artifact. + +### Task 5: Commit + +**Files:** +- Commit the Phase48 harness, docs, and plan changes. + +- [x] **Step 1: Run final checks** + +```bash +git diff --check +git status --short +``` + +Expected: no whitespace errors; only intended files changed plus the pre-existing untracked `.claude/`. + +- [x] **Step 2: Commit** + +```bash +git add backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md \ + docs/superpowers/plans/2026-07-01-serving-harness-readiness-phase48.md +git commit -m "fix(paged): harden serving snapshot readiness" -m "Assisted-by: Codex:gpt-5" +```