diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md index 426176294..f492af432 100644 --- a/backend/cpp/llama-cpp-localai-paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -614,3 +614,10 @@ DGX mirror `f2521ab12`, artifact | 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% | | 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% | | 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% | + +Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving +snapshots. It targets the clean `~/llama-phase6-source` mirror, checks +docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post +inference gates, and emits paged/vLLM ratios. Do not use the stale DGX +`~/bench/combined_definitive.sh` without first porting it to the current mirror +and lock discipline. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 65fb1e00b..7488ab4b1 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -1405,3 +1405,39 @@ Decision: - Keep MTP scheduler work closed. The next credible parity path is either a datacenter-Blackwell rerun or a larger fused-kernel project outside the low-conflict GB10 patch stack. + +## Phase 21 Current-Stack Serving Harness + +Phase 21 made the Phase 20 current-stack serving snapshot repeatable from the +LocalAI backend tree. + +New script: + +- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +Purpose: + +- targets the clean `~/llama-phase6-source` mirror by default; +- rejects busy docker, `local-ai-worker`, GPU compute, or owned GPU-lock state; +- builds the current llama.cpp targets; +- runs pre/post `paged-inference-gates.sh`; +- runs paged and vLLM serving arms with the same h2h client; +- writes paged/vLLM ratio summaries. + +Verification: + +- local `bash -n` passed; +- local `--help` passed; +- DGX `DRY_RUN=1` validated required paths and preflight without launching + servers. + +Dry-run artifact: + +- `/home/mudler/bench/phase21_harness_dryrun/20260701_051757` + +Decision: + +- Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving + snapshots. +- Do not use stale DGX `~/bench/combined_definitive.sh` without porting it to + `~/llama-phase6-source` and the owner-file lock discipline. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 7167e39c2..9854e1eba 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -304,6 +304,17 @@ This keeps the GB10 shortcut closure intact: do not reopen MTP or small scheduler work. The credible next parity path is a datacenter-Blackwell rerun or a larger fused-kernel project outside this low-conflict patch stack. +Phase 21 added a reusable current-stack serving harness: +`backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`. +It defaults to `~/llama-phase6-source`, validates docker/`local-ai-worker`/GPU +idle state, uses the owner-file lock, runs pre/post inference gates, compares +paged and vLLM with h2h, and writes ratio summaries. DGX dry run passed at +`/home/mudler/bench/phase21_harness_dryrun/20260701_051757`. + +Use this harness for future current-stack GB10 snapshots. Do not reuse +`~/bench/combined_definitive.sh` unless it is first ported away from stale +`~/llama-paged-dev` paths and old lock assumptions. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 391356a11..9dd0e82d6 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -644,6 +644,23 @@ credible parity path is not another MTP/scheduler shortcut; it is either the documented datacenter-Blackwell rerun or a larger fused-kernel project outside the low-conflict GB10 patch stack. +### Phase 21 current-stack harness + +Phase 21 added `paged-current-serving-snapshot.sh` so Phase 20 can be repeated +without the stale DGX `combined_definitive.sh` assumptions. The script defaults +to `~/llama-phase6-source`, enforces docker/`local-ai-worker`/GPU-idle preflight, +uses the owner-file lock, runs pre/post md5/op gates, runs paged and vLLM in the +same session, and emits ratio rows in `summary.tsv`. + +Verification: + +- local `bash -n` and `--help` passed; +- DGX `DRY_RUN=1` passed and wrote + `/home/mudler/bench/phase21_harness_dryrun/20260701_051757`. + +Use this harness for future current-stack GB10 snapshots before making parity +claims. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh new file mode 100755 index 000000000..730de4960 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh @@ -0,0 +1,268 @@ +#!/usr/bin/env bash +set -euo pipefail + +usage() { + cat <<'EOF' +Usage: paged-current-serving-snapshot.sh + +Run a current-stack paged llama.cpp vs vLLM MoE serving snapshot on DGX. + +This harness uses the clean llama.cpp mirror by default, not stale development +trees. It runs pre/post paged inference gates, then a same-session serving +comparison with the h2h client. + +Environment overrides: + SRC llama.cpp source dir (default: ~/llama-phase6-source) + BIN llama.cpp build bin dir (default: $SRC/build-cuda/bin) + MODEL paged GGUF path (default: ~/bench/q36-35b-a3b-nvfp4.gguf) + VLLM_MODEL vLLM model dir (default: ~/bench/q36-35b-a3b-nvfp4-vllm) + H2H h2h client (default: ~/bench/h2h_cli3.py) + ART artifact dir (default: ~/bench/phase_current_serving_snapshot/) + NPL concurrency list (default: "8 32 128") + PTOK prompt filler words (default: 128) + GEN generated tokens (default: 64) + CTX llama-server context (default: 131072) + PARALLEL llama-server parallel slots (default: 128) + BATCH llama-server logical batch (default: 2048) + UBATCH llama-server physical batch (default: 512) + LLAMA_PORT llama-server port (default: 8098) + VLLM_PORT vLLM port (default: 8000) + VLLM_BIN vLLM executable (default: ~/vllm-bench/bin/vllm) + SKIP_GATES=1 to skip pre/post paged inference gates + DRY_RUN=1 validate inputs/preflight and print commands without running servers +EOF +} + +if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then + usage + exit 0 +fi + +SRC=${SRC:-"$HOME/llama-phase6-source"} +BIN=${BIN:-"$SRC/build-cuda/bin"} +MODEL=${MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4.gguf"} +VLLM_MODEL=${VLLM_MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4-vllm"} +H2H=${H2H:-"$HOME/bench/h2h_cli3.py"} +ART=${ART:-"$HOME/bench/phase_current_serving_snapshot/$(date +%Y%m%d_%H%M%S)"} +NPL=${NPL:-"8 32 128"} +PTOK=${PTOK:-128} +GEN=${GEN:-64} +CTX=${CTX:-131072} +PARALLEL=${PARALLEL:-128} +BATCH=${BATCH:-2048} +UBATCH=${UBATCH:-512} +LLAMA_PORT=${LLAMA_PORT:-8098} +VLLM_PORT=${VLLM_PORT:-8000} +VLLM_BIN=${VLLM_BIN:-"$HOME/vllm-bench/bin/vllm"} +SKIP_GATES=${SKIP_GATES:-0} +DRY_RUN=${DRY_RUN:-0} + +LOCK_DIR="$HOME/gpu_bench_lock" +OWNER="$LOCK_DIR/owner" +SERVER_PID="" + +log() { + printf '[%s] %s\n' "$(date -Is)" "$*" | tee -a "$ART/run.log" +} + +require_path() { + if [[ ! -e "$1" ]]; then + echo "missing required path: $1" >&2 + exit 2 + fi +} + +preflight() { + mkdir -p "$ART" + local docker_count local_ai compute owner + docker_count=$(docker ps -q | wc -l) + local_ai=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true) + compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed '/^$/d' | wc -l) + owner="FREE-no-lock-file" + if [[ -f "$OWNER" ]]; then + owner=$(cat "$OWNER") + fi + { + echo "docker=$docker_count" + echo "local_ai_worker=$local_ai" + echo "compute=$compute" + echo "$owner" + } | tee "$ART/preflight.txt" + [[ "$docker_count" == "0" ]] + [[ "$local_ai" == "0" ]] + [[ "$compute" == "0" ]] + case "$owner" in + FREE*|FREE-no-lock-file) ;; + *) echo "GPU lock is busy: $owner" >&2; exit 3 ;; + esac +} + +acquire_lock() { + mkdir -p "$LOCK_DIR" + echo "codex-current-serving-snapshot $(date +%s)" > "$OWNER" +} + +release_lock() { + if [[ -n "$SERVER_PID" ]]; then + kill "$SERVER_PID" >/dev/null 2>&1 || true + wait "$SERVER_PID" >/dev/null 2>&1 || true + SERVER_PID="" + fi + pkill -9 -f "[l]lama-server.*--port $LLAMA_PORT" >/dev/null 2>&1 || true + pkill -9 -u "$(id -u)" -f "[v]llm serve" >/dev/null 2>&1 || true + mkdir -p "$LOCK_DIR" + echo "FREE released-by-codex-current-serving-snapshot $(date +%s)" > "$OWNER" +} + +wait_http() { + local url="$1" + local pattern="$2" + local log_file="$3" + local health="$4" + for _ in $(seq 1 240); do + if curl -fsS "$url" > "$health" 2>"$health.err" && grep -q "$pattern" "$health"; then + return 0 + fi + if [[ -n "$SERVER_PID" ]] && ! kill -0 "$SERVER_PID" >/dev/null 2>&1; then + tail -120 "$log_file" >&2 || true + return 1 + fi + sleep 1 + done + tail -120 "$log_file" >&2 || true + return 1 +} + +run_gate() { + local name="$1" + if [[ "$SKIP_GATES" == "1" ]]; then + log "skipping $name inference gate" + return + fi + log "running $name inference gate" + ART="$ART/gate_$name" "$HOME/paged-inference-gates.sh" > "$ART/gate_$name.log" 2>&1 + cat "$ART/gate_$name.log" | tee -a "$ART/run.log" +} + +run_paged() { + local arm_dir="$ART/paged" + mkdir -p "$arm_dir" + log "starting paged current-stack server" + cd "$BIN" + env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 \ + ./llama-server \ + -m "$MODEL" -ngl 99 -fa on -c "$CTX" -b "$BATCH" -ub "$UBATCH" \ + --parallel "$PARALLEL" --host 127.0.0.1 --port "$LLAMA_PORT" --no-webui \ + > "$arm_dir/server.log" 2>&1 & + SERVER_PID=$! + wait_http "http://127.0.0.1:$LLAMA_PORT/health" "ok" "$arm_dir/server.log" "$arm_dir/health.json" + python3 "$H2H" --url "http://127.0.0.1:$LLAMA_PORT/v1/completions" \ + --model q36 -n 8 --ptok "$PTOK" --gen 16 --nonce "warm_paged_$(date +%s)" --no-cache >/dev/null + for n in $NPL; do + log "paged n=$n" + python3 "$H2H" --url "http://127.0.0.1:$LLAMA_PORT/v1/completions" \ + --model q36 -n "$n" --ptok "$PTOK" --gen "$GEN" \ + --nonce "paged_${n}_$(date +%s)" --no-cache > "$arm_dir/n${n}.json" + cat "$arm_dir/n${n}.json" | tee -a "$ART/run.log" + done + kill "$SERVER_PID" >/dev/null 2>&1 || true + wait "$SERVER_PID" >/dev/null 2>&1 || true + SERVER_PID="" + sleep 3 +} + +run_vllm() { + local arm_dir="$ART/vllm" + mkdir -p "$arm_dir" + export PATH="$(dirname "$VLLM_BIN"):$PATH" + export VLLM_LOGGING_LEVEL=${VLLM_LOGGING_LEVEL:-INFO} + export HF_HUB_OFFLINE=${HF_HUB_OFFLINE:-1} + log "starting vLLM server" + nohup "$VLLM_BIN" serve "$VLLM_MODEL" \ + --served-model-name q36 --gpu-memory-utilization 0.85 --max-model-len 4096 \ + --max-num-seqs 256 --host 127.0.0.1 --port "$VLLM_PORT" --tensor-parallel-size 1 \ + > "$arm_dir/server.log" 2>&1 & + SERVER_PID=$! + wait_http "http://127.0.0.1:$VLLM_PORT/v1/models" "q36" "$arm_dir/server.log" "$arm_dir/models.json" + python3 "$H2H" --url "http://127.0.0.1:$VLLM_PORT/v1/completions" \ + --model q36 -n 8 --ptok "$PTOK" --gen 16 --nonce "warm_vllm_$(date +%s)" --no-cache >/dev/null + for n in $NPL; do + log "vllm n=$n" + python3 "$H2H" --url "http://127.0.0.1:$VLLM_PORT/v1/completions" \ + --model q36 -n "$n" --ptok "$PTOK" --gen "$GEN" \ + --nonce "vllm_${n}_$(date +%s)" --no-cache > "$arm_dir/n${n}.json" + cat "$arm_dir/n${n}.json" | tee -a "$ART/run.log" + done + kill "$SERVER_PID" >/dev/null 2>&1 || true + pkill -9 -u "$(id -u)" -f "[v]llm serve" >/dev/null 2>&1 || true + wait "$SERVER_PID" >/dev/null 2>&1 || true + SERVER_PID="" + sleep 5 +} + +write_summary() { + python3 - "$ART" <<'PY' | tee "$ART/summary.tsv" +import json +import sys +from pathlib import Path + +art = Path(sys.argv[1]) +rows = [] +for arm in ("paged", "vllm"): + for path in sorted((art / arm).glob("n*.json")): + data = json.loads(path.read_text()) + rows.append((arm, data["n"], data["agg_tps"], data["decode_agg_tps"], + data["decode_perseq_tps"], data["prefill_tps"], + data["ttft_mean_ms"], data["wall_s"])) + +print("arm\tn\tagg_tps\tdecode_agg_tps\tdecode_perseq_tps\tprefill_tps\tttft_mean_ms\twall_s") +for row in rows: + print("\t".join(str(x) for x in row)) + +by_key = {(row[0], row[1]): row for row in rows} +print("\nratio\tn\tpaged_decode_over_vllm\tpaged_perseq_over_vllm\tpaged_agg_over_vllm\tpaged_ttft_over_vllm") +for n in sorted({row[1] for row in rows}): + paged = by_key.get(("paged", n)) + vllm = by_key.get(("vllm", n)) + if not paged or not vllm: + continue + print(f"ratio\t{n}\t{paged[3]/vllm[3]:.4f}\t{paged[4]/vllm[4]:.4f}\t{paged[2]/vllm[2]:.4f}\t{paged[6]/vllm[6]:.4f}") +PY +} + +require_path "$SRC" +require_path "$BIN/llama-server" +require_path "$BIN/llama-completion" +require_path "$BIN/test-backend-ops" +require_path "$MODEL" +require_path "$VLLM_MODEL" +require_path "$H2H" +require_path "$VLLM_BIN" +require_path "$HOME/paged-inference-gates.sh" + +preflight +log "artifact=$ART" +log "source=$(git -C "$SRC" log --oneline -1)" + +if [[ "$DRY_RUN" == "1" ]]; then + log "dry run only; commands validated" + log "would build: cmake --build $SRC/build-cuda --target llama-server llama-completion test-backend-ops -j8" + log "would run paged NPL=[$NPL] PTOK=$PTOK GEN=$GEN" + log "would run vLLM NPL=[$NPL] PTOK=$PTOK GEN=$GEN" + exit 0 +fi + +log "building llama-server, llama-completion, and test-backend-ops" +cmake --build "$SRC/build-cuda" --target llama-server llama-completion test-backend-ops -j 8 \ + > "$ART/build.log" 2>&1 + +run_gate pre +acquire_lock +trap release_lock EXIT +run_paged +run_vllm +release_lock +trap - EXIT +run_gate post +write_summary +log "artifacts: $ART" diff --git a/docs/superpowers/plans/2026-07-01-current-serving-harness-phase21.md b/docs/superpowers/plans/2026-07-01-current-serving-harness-phase21.md new file mode 100644 index 000000000..c4e035a86 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-current-serving-harness-phase21.md @@ -0,0 +1,107 @@ +# Current Serving Harness Phase 21 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:verification-before-completion before recording the phase result. +> Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** make the Phase 20 current-stack paged-vs-vLLM serving snapshot +repeatable from the LocalAI backend tree. + +**Architecture:** add a standalone shell harness beside the existing paged +inference gate and MTP serving harness. The script targets the clean +`~/llama-phase6-source` mirror, uses the owner-file GPU lock, runs pre/post +inference gates, compares paged and vLLM in one session, and writes ratio +summaries. + +**Tech Stack:** Bash, llama.cpp `llama-server`, vLLM, `h2h_cli3.py`, DGX GB10. + +--- + +## Task 1: Red Check + +- [x] **Step 1: Prove no reusable current-stack harness exists** + + Command: + + ```bash + test -e backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh + ``` + + Result: + + - exited `1` before the patch, as expected. + +## Task 2: Add Harness + +- [x] **Step 1: Create script** + + File: + + - `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + + Features: + + - defaults to `~/llama-phase6-source`, not stale `~/llama-paged-dev`; + - checks docker, `local-ai-worker`, GPU compute processes, and owner-file lock; + - builds `llama-server`, `llama-completion`, and `test-backend-ops`; + - runs pre/post `paged-inference-gates.sh`; + - runs paged and vLLM serving arms with the same h2h client; + - writes `summary.tsv` with paged/vLLM ratios; + - supports `DRY_RUN=1` for path/preflight validation without servers. + +## Task 3: Verify Harness + +- [x] **Step 1: Local syntax/help checks** + + Commands: + + ```bash + test -x backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh + bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh + backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help + ``` + + Result: + + - all passed. + +- [x] **Step 2: DGX dry run** + + Command: + + ```bash + DRY_RUN=1 ART=~/bench/phase21_harness_dryrun/20260701_051757 \ + /tmp/paged-current-serving-snapshot.sh + ``` + + Result: + + - verified `docker=0`, `local_ai_worker=0`, `compute=0`; + - verified owner file was free; + - found current source `f2521ab12`; + - validated required paths and printed the build/paged/vLLM commands without + launching servers. + + Artifact: + + - `/home/mudler/bench/phase21_harness_dryrun/20260701_051757` + +## Task 4: Future Use + +- [x] **Step 1: Prefer this harness for current snapshots** + + Use this script for future current-stack GB10 parity snapshots: + + ```bash + backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh + ``` + + Do not use the stale DGX `~/bench/combined_definitive.sh` without first + porting it to the clean mirror and owner-file lock discipline. + +## Self-Review + +- No llama.cpp source behavior changed. +- The harness is repeatable and defaults to the current clean mirror. +- The dry run covered path validation and DGX preflight without consuming GPU + benchmark time.