diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 10ee7416d..8a5ac5429 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -2747,3 +2747,25 @@ Verification: dense dry-run with `VLLM_READY_ATTEMPTS=700`. - DGX dry-run artifact: `/home/mudler/bench/phase48_readiness_harness_dryrun/20260701_100533`. + +## Phase 49 vLLM Env Hygiene + +Phase 49 cleans up benchmark log noise observed during the Phase47 retry. vLLM +warned about harness-owned environment variables such as `VLLM_READY_ATTEMPTS` +and `VLLM_MODEL` because they were inherited by the `vllm serve` process. + +Change: + +- Wrap `vllm serve` with `env -u` for harness-owned variables: + `VLLM_MODEL`, `VLLM_BIN`, `VLLM_READY_ATTEMPTS`, + `VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_MODEL_LEN`, `VLLM_MAX_NUM_SEQS`, + `VLLM_TENSOR_PARALLEL_SIZE`, and `VLLM_EXTRA_ARGS`. +- Keep intentional vLLM runtime variables such as `VLLM_LOGGING_LEVEL`. + +Verification: + +- Red grep first proved the scrub was absent. +- Green checks after the patch included `bash -n`, grep for `-u VLLM_MODEL`, + and a DGX dense dry-run with `VLLM_READY_ATTEMPTS=700`. +- DGX dry-run artifact: + `/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 09ec2a4f6..ea588bec8 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -606,6 +606,13 @@ gates were green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `1.1560x` at `n=8`) but falls behind at `n=32/128` (`0.9036x`, `0.7912x`), and TTFT remains `1.87x` to `4.05x` vLLM. This does not change the GB10 conclusion. +Phase 49 removes vLLM log noise from harness-owned environment variables. The +`vllm serve` child now unsets `VLLM_MODEL`, `VLLM_BIN`, +`VLLM_READY_ATTEMPTS`, `VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_MODEL_LEN`, +`VLLM_MAX_NUM_SEQS`, `VLLM_TENSOR_PARALLEL_SIZE`, and `VLLM_EXTRA_ARGS` while +preserving intentional vLLM runtime variables such as `VLLM_LOGGING_LEVEL`. Dry +run: `/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -697,6 +704,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase47_dense_serving/20260701_095151` - incomplete dense serving attempt; pre-gates and paged arm completed, vLLM did not produce result JSONs under the old readiness budget. - `~/bench/phase48_readiness_harness_dryrun/20260701_100533` - harness dry-run proving configurable readiness budgets and clean preflight before retrying dense serving. - `~/bench/phase47_dense_serving_retry/20260701_100811` - completed dense serving snapshot after Phase48; pre/post md5 and op gates green; paged low-N decode ahead, high-N aggregate and TTFT behind. +- `~/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138` - harness dry-run after scrubbing harness-owned `VLLM_*` variables from the `vllm serve` child environment. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh index 0762d0b57..e069765a4 100755 --- a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +++ b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh @@ -274,7 +274,11 @@ run_vllm() { read -r -a extra_args <<< "$VLLM_EXTRA_ARGS" fi log "starting vLLM server" - nohup "$VLLM_BIN" serve "$VLLM_MODEL" \ + nohup env \ + -u VLLM_MODEL -u VLLM_BIN -u VLLM_READY_ATTEMPTS \ + -u VLLM_GPU_MEMORY_UTILIZATION -u VLLM_MAX_MODEL_LEN -u VLLM_MAX_NUM_SEQS \ + -u VLLM_TENSOR_PARALLEL_SIZE -u VLLM_EXTRA_ARGS \ + "$VLLM_BIN" serve "$VLLM_MODEL" \ --served-model-name "$SERVED_MODEL_NAME" --gpu-memory-utilization "$VLLM_GPU_MEMORY_UTILIZATION" --max-model-len "$VLLM_MAX_MODEL_LEN" \ --max-num-seqs "$VLLM_MAX_NUM_SEQS" --host 127.0.0.1 --port "$VLLM_PORT" --tensor-parallel-size "$VLLM_TENSOR_PARALLEL_SIZE" \ "${extra_args[@]}" \ diff --git a/docs/superpowers/plans/2026-07-01-vllm-env-hygiene-phase49.md b/docs/superpowers/plans/2026-07-01-vllm-env-hygiene-phase49.md new file mode 100644 index 000000000..35114f4b8 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-vllm-env-hygiene-phase49.md @@ -0,0 +1,98 @@ +# Phase49 vLLM Env Hygiene Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Keep vLLM benchmark logs clean by preventing harness-only `VLLM_*` variables from being inherited by the vLLM server process. + +**Architecture:** Add an `env -u ...` wrapper around the `vllm serve` command in `paged-current-serving-snapshot.sh`. Only unset harness-owned variables (`VLLM_MODEL`, `VLLM_BIN`, `VLLM_READY_ATTEMPTS`, `VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_MODEL_LEN`, `VLLM_MAX_NUM_SEQS`, `VLLM_TENSOR_PARALLEL_SIZE`, `VLLM_EXTRA_ARGS`) and keep intentional vLLM runtime variables like `VLLM_LOGGING_LEVEL`. + +**Tech Stack:** Bash serving harness, LocalAI parity docs. + +--- + +### Task 1: Prove env scrubbing is absent + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Run red grep** + +```bash +grep -F 'env -u VLLM_MODEL' backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `1`. + +### Task 2: Add vLLM child env scrub + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Wrap the vLLM command** + +Change: + +```bash +nohup "$VLLM_BIN" serve "$VLLM_MODEL" \ +``` + +to: + +```bash +nohup env \ + -u VLLM_MODEL -u VLLM_BIN -u VLLM_READY_ATTEMPTS \ + -u VLLM_GPU_MEMORY_UTILIZATION -u VLLM_MAX_MODEL_LEN -u VLLM_MAX_NUM_SEQS \ + -u VLLM_TENSOR_PARALLEL_SIZE -u VLLM_EXTRA_ARGS \ + "$VLLM_BIN" serve "$VLLM_MODEL" \ +``` + +### Task 3: Verify + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Shell syntax check** + +```bash +bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`. + +- [x] **Step 2: Green grep** + +```bash +grep -F -- '-u VLLM_MODEL' backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`. + +- [x] **Step 3: DGX dry-run still passes** + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase49_vllm_env_hygiene_dryrun/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin MODEL=$HOME/bench/q36-27b-nvfp4.gguf VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm SERVED_MODEL_NAME=dense-q36 ART=$ART NPL="1" PARALLEL=1 CTX=4096 PTOK=16 GEN=4 DRY_RUN=1 VLLM_READY_ATTEMPTS=700 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`, clean preflight, and dry-run output still prints `VLLM_READY_ATTEMPTS=700`. + +### Task 4: Record and commit + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-vllm-env-hygiene-phase49.md` + +- [x] **Step 1: Record Phase49** + +Record the dry-run artifact and state that this is log hygiene only. + +- [x] **Step 2: Final checks and commit** + +```bash +git diff --check +git add backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-vllm-env-hygiene-phase49.md +git commit -m "fix(paged): scrub harness vars for vllm serve" -m "Assisted-by: Codex:gpt-5" +```