fix(paged): harden serving snapshot readiness

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 08:07:48 +00:00
parent e69ee0e867
commit 440129c98e
6 changed files with 401 additions and 15 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -2615,3 +2615,81 @@ Decision:
 - This does not claim a new parity result. Full runs still require the normal
  preflight, `hardware.txt`, pre/post md5 gates, `MUL_MAT`/`MUL_MAT_ID`, and
  KL-if-md5-changes gates before interpreting throughput.
+
+## Phase 47 Dense Serving Snapshot Attempt
+
+Phase 47 attempted to use the Phase46 model-name override for a dense
+paged-vs-vLLM serving snapshot. The first full attempt is incomplete and must
+not be used as a dense parity result.
+
+Artifacts:
+
+- Dry-run: `/home/mudler/bench/phase47_dense_serving_dryrun/20260701_095141`
+- Incomplete full attempt:
+  `/home/mudler/bench/phase47_dense_serving/20260701_095151`
+
+Run shape:
+
+- `MODEL=$HOME/bench/q36-27b-nvfp4.gguf`
+- `VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm`
+- `SERVED_MODEL_NAME=dense-q36`
+- `NPL="1 8 32 128"`, `PARALLEL=128`, `CTX=131072`, `PTOK=128`, `GEN=64`
+- `OPS=MUL_MAT,MUL_MAT_ID`
+
+Completed before failure:
+
+- Preflight was clean: docker `0`, `local-ai-worker` `0`, GPU compute `0`.
+- Pre-gates were green: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense
+  md5 `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`,
+  `MUL_MAT_ID` `806/806`.
+- Paged dense arm completed through `n=128`:
+
+| n | paged decode agg t/s | paged per-seq t/s | paged agg t/s | paged TTFT ms |
+|---|----------------------|-------------------|----------------|---------------|
+| 1 | `13.3` | `13.14` | `12.5` | `312.3` |
+| 8 | `85.5` | `10.35` | `62.5` | `2068.5` |
+| 32 | `198.1` | `5.44` | `105.1` | `7608.5` |
+| 128 | `361.8` | `1.89` | `143.0` | `20501.7` |
+
+Failure/root cause:
+
+- vLLM dense startup exceeded the old fixed `240` one-second readiness budget.
+  The server log showed weight loading alone took about `199.43s`, followed by
+  compile, autotune, CUDA graph capture, and multimodal warmup before the server
+  began listening.
+- `vllm/models.json` is empty and `models.json.err` contains an initial
+  connection failure, so no vLLM result JSONs were produced.
+- Cleanup then waited on the vLLM server PID after `SIGTERM`; manual cleanup was
+  required. DGX was returned to idle with owner
+  `FREE released-by-codex-phase47-cleanup 1782892962`.
+
+Decision:
+
+- Treat this artifact as a harness failure investigation, not a benchmark.
+- Retry Phase47 only after the Phase48 readiness/cleanup hardening is present.
+
+## Phase 48 Serving Harness Readiness Hardening
+
+Phase 48 fixes the harness behavior exposed by the failed dense snapshot
+attempt. It is a harness reliability change, not an inference change.
+
+Changes:
+
+- Add `LLAMA_READY_ATTEMPTS` (default `240`) and `VLLM_READY_ATTEMPTS` (default
+  `600`) so slow vLLM model load/compile paths can be pre-budgeted.
+- Bound each HTTP readiness probe with `curl --max-time 2` so a single probe
+  cannot hang the readiness loop.
+- Replace direct `kill` plus unbounded `wait` with `stop_server_pid`, which
+  sends `SIGTERM`, waits up to 30 seconds, then sends `SIGKILL` before `wait`.
+- Use the bounded cleanup helper for normal paged teardown, normal vLLM
+  teardown, and error-path `release_lock`.
+
+Verification:
+
+- Red checks first proved `VLLM_READY_ATTEMPTS`, bounded curl, and hard-kill
+  cleanup were absent.
+- Green checks after the patch included `bash -n`, help-text grep, grep for
+  `curl --max-time 2 -fsS "$url"`, grep for `kill -9 "$SERVER_PID"`, and a DGX
+  dense dry-run with `VLLM_READY_ATTEMPTS=700`.
+- DGX dry-run artifact:
+  `/home/mudler/bench/phase48_readiness_harness_dryrun/20260701_100533`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -582,6 +582,22 @@ engines. DGX dry run:
 `SERVED_MODEL_NAME=dense-q36` printed during `DRY_RUN=1`. This is harness-only
 hardware-pivot readiness, not a throughput result.

+Phase 47 attempted the first dense serving snapshot using the Phase46 override.
+Dry-run artifact:
+`/home/mudler/bench/phase47_dense_serving_dryrun/20260701_095141`; incomplete
+full artifact: `/home/mudler/bench/phase47_dense_serving/20260701_095151`.
+Pre-gates were green and the paged dense arm completed through `n=128`, but the
+artifact is not a dense parity result because vLLM produced no result JSONs.
+Root cause: dense vLLM startup exceeded the old fixed readiness budget, and the
+cleanup path could wait indefinitely on the server PID after `SIGTERM`.
+
+Phase 48 hardens the serving snapshot harness for that failure mode. It adds
+`LLAMA_READY_ATTEMPTS` and `VLLM_READY_ATTEMPTS`, bounds HTTP readiness probes
+with `curl --max-time 2`, and uses bounded server cleanup that escalates from
+`SIGTERM` to `SIGKILL`. Dry-run artifact:
+`/home/mudler/bench/phase48_readiness_harness_dryrun/20260701_100533`, with
+`VLLM_READY_ATTEMPTS=700` printed and clean DGX preflight.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -669,6 +685,9 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/phase44_hardware_pivot_harness_dryrun/20260701_094038` - harness-only dry-run artifact proving the vLLM serving config overrides are printed and preflighted before any server starts.
 - `~/bench/phase45_inference_gate_guard/20260701_094320` - post-Phase44 inference guard; MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` backend-op gates green.
 - `~/bench/phase46_served_model_name_dryrun/20260701_094849` - harness-only dry-run artifact proving `SERVED_MODEL_NAME` is printed and preflighted before any server starts.
+- `~/bench/phase47_dense_serving_dryrun/20260701_095141` - dense serving dry-run with `SERVED_MODEL_NAME=dense-q36`.
+- `~/bench/phase47_dense_serving/20260701_095151` - incomplete dense serving attempt; pre-gates and paged arm completed, vLLM did not produce result JSONs under the old readiness budget.
+- `~/bench/phase48_readiness_harness_dryrun/20260701_100533` - harness dry-run proving configurable readiness budgets and clean preflight before retrying dense serving.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -1138,6 +1138,33 @@ Preflight was clean and the dry run printed
 harness-only portability step for dense or hardware-pivot snapshots; it does not
 change inference code or produce a new throughput result.

+### Phase 47 dense serving snapshot attempt
+
+Phase 47 attempted a dense audited serving snapshot with
+`MODEL=$HOME/bench/q36-27b-nvfp4.gguf`,
+`VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm`, and
+`SERVED_MODEL_NAME=dense-q36`. Dry-run artifact:
+`/home/mudler/bench/phase47_dense_serving_dryrun/20260701_095141`.
+
+The full attempt at
+`/home/mudler/bench/phase47_dense_serving/20260701_095151` is incomplete and is
+not a parity result. Pre-gates passed and the paged dense arm completed through
+`n=128`, but vLLM dense startup exceeded the old fixed readiness budget before
+any vLLM result JSONs were produced. Use this artifact only as the root-cause
+input for Phase48.
+
+### Phase 48 serving harness readiness hardening
+
+Phase 48 fixes the harness issue exposed by Phase47. It adds
+`LLAMA_READY_ATTEMPTS` and `VLLM_READY_ATTEMPTS`, bounds each readiness probe
+with `curl --max-time 2`, and replaces direct server waits with bounded cleanup
+that escalates from `SIGTERM` to `SIGKILL`.
+
+DGX dry-run artifact:
+`/home/mudler/bench/phase48_readiness_harness_dryrun/20260701_100533`. The dry
+run printed `VLLM_READY_ATTEMPTS=700` with clean preflight. Retry dense serving
+snapshots with this hardening before interpreting dense paged-vs-vLLM ratios.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
+++ b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
@@ -28,8 +28,10 @@ Environment overrides:
  BATCH        llama-server logical batch (default: 2048)
  UBATCH       llama-server physical batch (default: 512)
  LLAMA_PORT   llama-server port (default: 8098)
+  LLAMA_READY_ATTEMPTS llama-server readiness attempts, one per second (default: 240)
  VLLM_PORT    vLLM port (default: 8000)
  VLLM_BIN     vLLM executable (default: ~/vllm-bench/bin/vllm)
+  VLLM_READY_ATTEMPTS  vLLM readiness attempts, one per second (default: 600)
  VLLM_GPU_MEMORY_UTILIZATION vLLM --gpu-memory-utilization (default: 0.85)
  VLLM_MAX_MODEL_LEN          vLLM --max-model-len (default: 4096)
  VLLM_MAX_NUM_SEQS           vLLM --max-num-seqs (default: 256)
@@ -80,8 +82,10 @@ PARALLEL=${PARALLEL:-128}
 BATCH=${BATCH:-2048}
 UBATCH=${UBATCH:-512}
 LLAMA_PORT=${LLAMA_PORT:-8098}
+LLAMA_READY_ATTEMPTS=${LLAMA_READY_ATTEMPTS:-240}
 VLLM_PORT=${VLLM_PORT:-8000}
 VLLM_BIN=${VLLM_BIN:-"$HOME/vllm-bench/bin/vllm"}
+VLLM_READY_ATTEMPTS=${VLLM_READY_ATTEMPTS:-600}
 VLLM_GPU_MEMORY_UTILIZATION=${VLLM_GPU_MEMORY_UTILIZATION:-0.85}
 VLLM_MAX_MODEL_LEN=${VLLM_MAX_MODEL_LEN:-4096}
 VLLM_MAX_NUM_SEQS=${VLLM_MAX_NUM_SEQS:-256}
@@ -179,24 +183,38 @@ acquire_lock() {
 }

 release_lock() {
-  if [[ -n "$SERVER_PID" ]]; then
-    kill "$SERVER_PID" >/dev/null 2>&1 || true
-    wait "$SERVER_PID" >/dev/null 2>&1 || true
-    SERVER_PID=""
-  fi
+  stop_server_pid
  pkill -9 -f "[l]lama-server.*--port $LLAMA_PORT" >/dev/null 2>&1 || true
  pkill -9 -u "$(id -u)" -f "[v]llm serve" >/dev/null 2>&1 || true
  mkdir -p "$LOCK_DIR"
  echo "FREE released-by-codex-current-serving-snapshot $(date +%s)" > "$OWNER"
 }

+stop_server_pid() {
+  if [[ -n "$SERVER_PID" ]]; then
+    kill "$SERVER_PID" >/dev/null 2>&1 || true
+    for _ in $(seq 1 30); do
+      if ! kill -0 "$SERVER_PID" >/dev/null 2>&1; then
+        break
+      fi
+      sleep 1
+    done
+    if kill -0 "$SERVER_PID" >/dev/null 2>&1; then
+      kill -9 "$SERVER_PID" >/dev/null 2>&1 || true
+    fi
+    wait "$SERVER_PID" >/dev/null 2>&1 || true
+    SERVER_PID=""
+  fi
+}
+
 wait_http() {
  local url="$1"
  local pattern="$2"
  local log_file="$3"
  local health="$4"
-  for _ in $(seq 1 240); do
-    if curl -fsS "$url" > "$health" 2>"$health.err" && grep -q "$pattern" "$health"; then
+  local attempts="$5"
+  for _ in $(seq 1 "$attempts"); do
+    if curl --max-time 2 -fsS "$url" > "$health" 2>"$health.err" && grep -q "$pattern" "$health"; then
      return 0
    fi
    if [[ -n "$SERVER_PID" ]] && ! kill -0 "$SERVER_PID" >/dev/null 2>&1; then
@@ -231,7 +249,7 @@ run_paged() {
      --parallel "$PARALLEL" --host 127.0.0.1 --port "$LLAMA_PORT" --no-webui \
      > "$arm_dir/server.log" 2>&1 &
  SERVER_PID=$!
-  wait_http "http://127.0.0.1:$LLAMA_PORT/health" "ok" "$arm_dir/server.log" "$arm_dir/health.json"
+  wait_http "http://127.0.0.1:$LLAMA_PORT/health" "ok" "$arm_dir/server.log" "$arm_dir/health.json" "$LLAMA_READY_ATTEMPTS"
  python3 "$H2H" --url "http://127.0.0.1:$LLAMA_PORT/v1/completions" \
    --model "$SERVED_MODEL_NAME" -n 8 --ptok "$PTOK" --gen 16 --nonce "warm_paged_$(date +%s)" --no-cache >/dev/null
  for n in $NPL; do
@@ -241,9 +259,7 @@ run_paged() {
      --nonce "paged_${n}_$(date +%s)" --no-cache > "$arm_dir/n${n}.json"
    cat "$arm_dir/n${n}.json" | tee -a "$ART/run.log"
  done
-  kill "$SERVER_PID" >/dev/null 2>&1 || true
-  wait "$SERVER_PID" >/dev/null 2>&1 || true
-  SERVER_PID=""
+  stop_server_pid
  sleep 3
 }

@@ -264,7 +280,7 @@ run_vllm() {
    "${extra_args[@]}" \
    > "$arm_dir/server.log" 2>&1 &
  SERVER_PID=$!
-  wait_http "http://127.0.0.1:$VLLM_PORT/v1/models" "$SERVED_MODEL_NAME" "$arm_dir/server.log" "$arm_dir/models.json"
+  wait_http "http://127.0.0.1:$VLLM_PORT/v1/models" "$SERVED_MODEL_NAME" "$arm_dir/server.log" "$arm_dir/models.json" "$VLLM_READY_ATTEMPTS"
  python3 "$H2H" --url "http://127.0.0.1:$VLLM_PORT/v1/completions" \
    --model "$SERVED_MODEL_NAME" -n 8 --ptok "$PTOK" --gen 16 --nonce "warm_vllm_$(date +%s)" --no-cache >/dev/null
  for n in $NPL; do
@@ -274,10 +290,8 @@ run_vllm() {
      --nonce "vllm_${n}_$(date +%s)" --no-cache > "$arm_dir/n${n}.json"
    cat "$arm_dir/n${n}.json" | tee -a "$ART/run.log"
  done
-  kill "$SERVER_PID" >/dev/null 2>&1 || true
+  stop_server_pid
  pkill -9 -u "$(id -u)" -f "[v]llm serve" >/dev/null 2>&1 || true
-  wait "$SERVER_PID" >/dev/null 2>&1 || true
-  SERVER_PID=""
  sleep 5
 }

@@ -397,6 +411,7 @@ if [[ "$DRY_RUN" == "1" ]]; then
  log "dry run only; commands validated"
  log "would build: cmake --build $BUILD_DIR --target llama-server llama-completion test-backend-ops -j8"
  log "served model: SERVED_MODEL_NAME=$SERVED_MODEL_NAME"
+  log "readiness: LLAMA_READY_ATTEMPTS=$LLAMA_READY_ATTEMPTS VLLM_READY_ATTEMPTS=$VLLM_READY_ATTEMPTS"
  log "would run paged NPL=[$NPL] PTOK=$PTOK GEN=$GEN"
  log "would run vLLM NPL=[$NPL] PTOK=$PTOK GEN=$GEN"
  log "vLLM config: VLLM_GPU_MEMORY_UTILIZATION=$VLLM_GPU_MEMORY_UTILIZATION VLLM_MAX_MODEL_LEN=$VLLM_MAX_MODEL_LEN VLLM_MAX_NUM_SEQS=$VLLM_MAX_NUM_SEQS VLLM_TENSOR_PARALLEL_SIZE=$VLLM_TENSOR_PARALLEL_SIZE VLLM_EXTRA_ARGS=[$VLLM_EXTRA_ARGS]"