From ae8284f5fbc306793edc76cc127f37c4e7ba4ba2 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 07:41:55 +0000 Subject: [PATCH] feat(paged): parameterize vllm serving snapshot Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 38 +++++ .../docs/PARITY_HANDOFF.md | 11 ++ .../docs/VLLM_PARITY_LEVER_MAP.md | 15 ++ .../paged-current-serving-snapshot.sh | 20 ++- ...26-07-01-hardware-pivot-harness-phase44.md | 154 ++++++++++++++++++ 5 files changed, 236 insertions(+), 2 deletions(-) create mode 100644 docs/superpowers/plans/2026-07-01-hardware-pivot-harness-phase44.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 283075431..7cba7fe61 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -2501,3 +2501,41 @@ Decision: larger funded kernel/loader effort with its own design, or a hardware pivot benchmark. Any future implementation still needs the canonical MoE/dense md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates before benchmarking. + +## Phase 44 Hardware-Pivot Harness Readiness + +Phase 44 prepares the audited current-stack serving snapshot for hardware-pivot +runs without editing the harness between hosts. This is a harness-only change: +it does not modify llama.cpp inference code, patch-series source, md5 gates, op +gates, or any benchmark result. + +New vLLM serving overrides: + +| variable | default | vLLM flag | +|----------|---------|-----------| +| `VLLM_GPU_MEMORY_UTILIZATION` | `0.85` | `--gpu-memory-utilization` | +| `VLLM_MAX_MODEL_LEN` | `4096` | `--max-model-len` | +| `VLLM_MAX_NUM_SEQS` | `256` | `--max-num-seqs` | +| `VLLM_TENSOR_PARALLEL_SIZE` | `1` | `--tensor-parallel-size` | +| `VLLM_EXTRA_ARGS` | empty | whitespace-split args appended to `vllm serve` | + +Verification scope: + +- Red help-text check first proved `VLLM_MAX_NUM_SEQS` was absent from + `paged-current-serving-snapshot.sh --help`. +- Red DGX dry-run check first proved the harness did not print + `VLLM_MAX_NUM_SEQS=512` when the override was supplied. +- Green checks after the patch included `bash -n`, help-text grep, and DGX + `DRY_RUN=1` preflight with the override values printed before any server + starts. Artifact: + `/home/mudler/bench/phase44_hardware_pivot_harness_dryrun/20260701_094038`. + +Decision: + +- Use the same audited harness for a future datacenter-Blackwell or other + non-GB10 parity snapshot by overriding vLLM limits in the environment instead + of editing the script. +- This does not reopen GB10 shortcut work and does not claim parity. A real + hardware-pivot benchmark still needs the normal preflight, `hardware.txt`, + pre/post MoE/dense md5 gates, `MUL_MAT`/`MUL_MAT_ID` checks, and + KL-if-md5-changes before interpreting throughput. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 4a8a3b187..71dff2cde 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -557,6 +557,16 @@ low-conflict GB10 shortcut justified by current evidence; future work is either a larger kernel/loader design or a hardware-pivot benchmark, still gated by MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` and KL if md5 changes. +Phase 44 makes the current-stack serving snapshot harness ready for hardware +pivots by parameterizing the vLLM side instead of hardcoding the GB10 defaults. +`paged-current-serving-snapshot.sh` now accepts `VLLM_GPU_MEMORY_UTILIZATION`, +`VLLM_MAX_MODEL_LEN`, `VLLM_MAX_NUM_SEQS`, `VLLM_TENSOR_PARALLEL_SIZE`, and +whitespace-split `VLLM_EXTRA_ARGS`, and prints the resolved values during +`DRY_RUN=1`. This is not a new benchmark and does not change inference code or +gate behavior. Use it when the next parity run targets datacenter Blackwell or +another non-GB10 vLLM serving shape, while keeping `hardware.txt`, pre/post +MoE/dense md5, `MUL_MAT`/`MUL_MAT_ID`, and KL-if-md5-changes as mandatory gates. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -641,6 +651,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase39_gate_sgemm_profile/phase27_reanalysis` - Phase27 serving profile re-analysis used to reject graph-time fused gate weight concat; `concat_layout=459.84ms` (`2.29%`) in the serving kernel window. - `~/bench/phase40_max_concurrency/20260701_090012` - max-concurrency C1 check at `NPL=128/192/256`, `PTOK=128`, `GEN=64`, `PARALLEL=256`, `CTX=262144`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green, but vLLM also fit at `n=256` and stayed ahead (`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`). - `~/bench/phase41_low_concurrency/20260701_091437` - low-concurrency serving check at `NPL=1/8/32`, `PTOK=128`, `GEN=64`, `PARALLEL=32`, `CTX=32768`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green; paged is `0.7493`, `0.7518`, and `0.6649` of vLLM decode at `n=1/8/32`, with TTFT still much worse by `n=8/32`; does not reopen D1. +- `~/bench/phase44_hardware_pivot_harness_dryrun/20260701_094038` - harness-only dry-run artifact proving the vLLM serving config overrides are printed and preflighted before any server starts. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 4183d89db..aa0276a58 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -1096,6 +1096,21 @@ is justified by the current evidence. Future work needs either a larger funded kernel/loader design or a hardware-pivot benchmark, with the canonical MoE/dense md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates. +### Phase 44 hardware-pivot harness readiness + +Phase 44 makes `paged-current-serving-snapshot.sh` usable for hardware-pivot +comparisons without editing the script for each vLLM deployment shape. It adds +environment overrides for `VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_MODEL_LEN`, +`VLLM_MAX_NUM_SEQS`, `VLLM_TENSOR_PARALLEL_SIZE`, and whitespace-split +`VLLM_EXTRA_ARGS`, then prints the resolved values in `DRY_RUN=1` output. + +This is deliberately a harness-only phase. It does not change inference code, +does not regenerate the llama.cpp patch series, and does not produce a new +throughput result. Its purpose is to keep the audited methodology portable: +future non-GB10 snapshots can carry the same `hardware.txt`, pre/post md5, +`MUL_MAT`/`MUL_MAT_ID`, and KL-if-md5-changes gates while using hardware-specific +vLLM serving limits. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh index a6cf9d22b..21199a953 100755 --- a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +++ b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh @@ -29,6 +29,11 @@ Environment overrides: LLAMA_PORT llama-server port (default: 8098) VLLM_PORT vLLM port (default: 8000) VLLM_BIN vLLM executable (default: ~/vllm-bench/bin/vllm) + VLLM_GPU_MEMORY_UTILIZATION vLLM --gpu-memory-utilization (default: 0.85) + VLLM_MAX_MODEL_LEN vLLM --max-model-len (default: 4096) + VLLM_MAX_NUM_SEQS vLLM --max-num-seqs (default: 256) + VLLM_TENSOR_PARALLEL_SIZE vLLM --tensor-parallel-size (default: 1) + VLLM_EXTRA_ARGS whitespace-split extra args appended to vLLM serve (default: empty) SKIP_GATES=1 to skip pre/post paged inference gates DRY_RUN=1 validate inputs/preflight, write hardware.txt, and print commands without running servers @@ -75,6 +80,11 @@ UBATCH=${UBATCH:-512} LLAMA_PORT=${LLAMA_PORT:-8098} VLLM_PORT=${VLLM_PORT:-8000} VLLM_BIN=${VLLM_BIN:-"$HOME/vllm-bench/bin/vllm"} +VLLM_GPU_MEMORY_UTILIZATION=${VLLM_GPU_MEMORY_UTILIZATION:-0.85} +VLLM_MAX_MODEL_LEN=${VLLM_MAX_MODEL_LEN:-4096} +VLLM_MAX_NUM_SEQS=${VLLM_MAX_NUM_SEQS:-256} +VLLM_TENSOR_PARALLEL_SIZE=${VLLM_TENSOR_PARALLEL_SIZE:-1} +VLLM_EXTRA_ARGS=${VLLM_EXTRA_ARGS:-} SKIP_GATES=${SKIP_GATES:-0} DRY_RUN=${DRY_RUN:-0} MOE_MD5_EXPECTED=8cb0ce23777bf55f92f63d0292c756b0 @@ -237,14 +247,19 @@ run_paged() { run_vllm() { local arm_dir="$ART/vllm" + local extra_args=() mkdir -p "$arm_dir" export PATH="$(dirname "$VLLM_BIN"):$PATH" export VLLM_LOGGING_LEVEL=${VLLM_LOGGING_LEVEL:-INFO} export HF_HUB_OFFLINE=${HF_HUB_OFFLINE:-1} + if [[ -n "$VLLM_EXTRA_ARGS" ]]; then + read -r -a extra_args <<< "$VLLM_EXTRA_ARGS" + fi log "starting vLLM server" nohup "$VLLM_BIN" serve "$VLLM_MODEL" \ - --served-model-name q36 --gpu-memory-utilization 0.85 --max-model-len 4096 \ - --max-num-seqs 256 --host 127.0.0.1 --port "$VLLM_PORT" --tensor-parallel-size 1 \ + --served-model-name q36 --gpu-memory-utilization "$VLLM_GPU_MEMORY_UTILIZATION" --max-model-len "$VLLM_MAX_MODEL_LEN" \ + --max-num-seqs "$VLLM_MAX_NUM_SEQS" --host 127.0.0.1 --port "$VLLM_PORT" --tensor-parallel-size "$VLLM_TENSOR_PARALLEL_SIZE" \ + "${extra_args[@]}" \ > "$arm_dir/server.log" 2>&1 & SERVER_PID=$! wait_http "http://127.0.0.1:$VLLM_PORT/v1/models" "q36" "$arm_dir/server.log" "$arm_dir/models.json" @@ -381,6 +396,7 @@ if [[ "$DRY_RUN" == "1" ]]; then log "would build: cmake --build $BUILD_DIR --target llama-server llama-completion test-backend-ops -j8" log "would run paged NPL=[$NPL] PTOK=$PTOK GEN=$GEN" log "would run vLLM NPL=[$NPL] PTOK=$PTOK GEN=$GEN" + log "vLLM config: VLLM_GPU_MEMORY_UTILIZATION=$VLLM_GPU_MEMORY_UTILIZATION VLLM_MAX_MODEL_LEN=$VLLM_MAX_MODEL_LEN VLLM_MAX_NUM_SEQS=$VLLM_MAX_NUM_SEQS VLLM_TENSOR_PARALLEL_SIZE=$VLLM_TENSOR_PARALLEL_SIZE VLLM_EXTRA_ARGS=[$VLLM_EXTRA_ARGS]" exit 0 fi diff --git a/docs/superpowers/plans/2026-07-01-hardware-pivot-harness-phase44.md b/docs/superpowers/plans/2026-07-01-hardware-pivot-harness-phase44.md new file mode 100644 index 000000000..ba3a23ec3 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-hardware-pivot-harness-phase44.md @@ -0,0 +1,154 @@ +# Phase44 Hardware Pivot Harness Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Make the current-stack serving snapshot harness configurable enough to run the same audited paged-vs-vLLM methodology on hardware beyond the current GB10 defaults. + +**Architecture:** Keep this as a harness-only change: add environment overrides for vLLM serving limits and print them in `DRY_RUN=1` output. Do not touch llama.cpp inference code, patch-series source, md5 gates, or op gates. + +**Tech Stack:** Bash harness, DGX preflight over ssh, LocalAI parity documentation. + +--- + +### Task 1: Prove the vLLM config knobs are absent + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Run help-text red check** + +```bash +backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'VLLM_MAX_NUM_SEQS' +``` + +Expected: exit `1`, because the harness does not document the override yet. + +- [x] **Step 2: Run DGX dry-run red check** + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase44_hardware_pivot_harness_dryrun_red/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="1" PARALLEL=1 CTX=4096 PTOK=16 GEN=4 DRY_RUN=1 VLLM_GPU_MEMORY_UTILIZATION=0.90 VLLM_MAX_MODEL_LEN=8192 VLLM_MAX_NUM_SEQS=512 VLLM_TENSOR_PARALLEL_SIZE=2 VLLM_EXTRA_ARGS="--disable-log-requests" OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh | grep -F 'VLLM_MAX_NUM_SEQS=512' +``` + +Expected: exit `1`, because `DRY_RUN=1` validates inputs but does not print the vLLM serving config yet. + +### Task 2: Add vLLM serving overrides + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Document the environment variables in `usage()`** + +Add these lines under `VLLM_BIN`: + +```bash + VLLM_GPU_MEMORY_UTILIZATION vLLM --gpu-memory-utilization (default: 0.85) + VLLM_MAX_MODEL_LEN vLLM --max-model-len (default: 4096) + VLLM_MAX_NUM_SEQS vLLM --max-num-seqs (default: 256) + VLLM_TENSOR_PARALLEL_SIZE vLLM --tensor-parallel-size (default: 1) + VLLM_EXTRA_ARGS whitespace-split extra args appended to vLLM serve (default: empty) +``` + +- [x] **Step 2: Add conservative defaults beside `VLLM_BIN`** + +```bash +VLLM_GPU_MEMORY_UTILIZATION=${VLLM_GPU_MEMORY_UTILIZATION:-0.85} +VLLM_MAX_MODEL_LEN=${VLLM_MAX_MODEL_LEN:-4096} +VLLM_MAX_NUM_SEQS=${VLLM_MAX_NUM_SEQS:-256} +VLLM_TENSOR_PARALLEL_SIZE=${VLLM_TENSOR_PARALLEL_SIZE:-1} +VLLM_EXTRA_ARGS=${VLLM_EXTRA_ARGS:-} +``` + +- [x] **Step 3: Use the variables in `run_vllm()`** + +Use an array for `VLLM_EXTRA_ARGS`: + +```bash + local extra_args=() + if [[ -n "$VLLM_EXTRA_ARGS" ]]; then + read -r -a extra_args <<< "$VLLM_EXTRA_ARGS" + fi +``` + +Then replace the hardcoded vLLM flags with: + +```bash + --served-model-name q36 --gpu-memory-utilization "$VLLM_GPU_MEMORY_UTILIZATION" --max-model-len "$VLLM_MAX_MODEL_LEN" \ + --max-num-seqs "$VLLM_MAX_NUM_SEQS" --host 127.0.0.1 --port "$VLLM_PORT" --tensor-parallel-size "$VLLM_TENSOR_PARALLEL_SIZE" \ + "${extra_args[@]}" \ +``` + +- [x] **Step 4: Print the vLLM config during `DRY_RUN=1`** + +```bash + log "vLLM config: VLLM_GPU_MEMORY_UTILIZATION=$VLLM_GPU_MEMORY_UTILIZATION VLLM_MAX_MODEL_LEN=$VLLM_MAX_MODEL_LEN VLLM_MAX_NUM_SEQS=$VLLM_MAX_NUM_SEQS VLLM_TENSOR_PARALLEL_SIZE=$VLLM_TENSOR_PARALLEL_SIZE VLLM_EXTRA_ARGS=[$VLLM_EXTRA_ARGS]" +``` + +### Task 3: Verify the harness + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Shell syntax check** + +```bash +bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`. + +- [x] **Step 2: Help-text green check** + +```bash +backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'VLLM_MAX_NUM_SEQS' +``` + +Expected: exit `0`. + +- [x] **Step 3: DGX dry-run green check** + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase44_hardware_pivot_harness_dryrun/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="1" PARALLEL=1 CTX=4096 PTOK=16 GEN=4 DRY_RUN=1 VLLM_GPU_MEMORY_UTILIZATION=0.90 VLLM_MAX_MODEL_LEN=8192 VLLM_MAX_NUM_SEQS=512 VLLM_TENSOR_PARALLEL_SIZE=2 VLLM_EXTRA_ARGS="--disable-log-requests" OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`, preflight shows docker/local-ai-worker/GPU compute idle, and output includes `VLLM_MAX_NUM_SEQS=512`. + +### Task 4: Record Phase44 in docs + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-hardware-pivot-harness-phase44.md` + +- [x] **Step 1: Append the Phase44 result** + +Record that Phase44 is a harness-readiness change only. It does not claim a new performance result, does not run inference, and does not modify md5/op gate behavior. + +- [x] **Step 2: Mark all plan tasks complete** + +Change each remaining `- [ ]` entry in this file to `- [x]` only after the corresponding verification has been run. + +### Task 5: Commit + +**Files:** +- Commit all Phase44 script, docs, and plan changes. + +- [x] **Step 1: Run final diff checks** + +```bash +git diff --check +git status --short +``` + +Expected: no whitespace errors; only intended files changed plus the pre-existing untracked `.claude/`. + +- [x] **Step 2: Commit** + +```bash +git add backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-hardware-pivot-harness-phase44.md +git commit -m "feat(paged): parameterize vllm serving snapshot" -m "Assisted-by: Codex:gpt-5" +```