feat(paged): parameterize vllm serving snapshot

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 07:41:55 +00:00
parent ecaf406c0b
commit ae8284f5fb
5 changed files with 236 additions and 2 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -2501,3 +2501,41 @@ Decision:
  larger funded kernel/loader effort with its own design, or a hardware pivot
  benchmark. Any future implementation still needs the canonical MoE/dense md5,
  `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates before benchmarking.
+
+## Phase 44 Hardware-Pivot Harness Readiness
+
+Phase 44 prepares the audited current-stack serving snapshot for hardware-pivot
+runs without editing the harness between hosts. This is a harness-only change:
+it does not modify llama.cpp inference code, patch-series source, md5 gates, op
+gates, or any benchmark result.
+
+New vLLM serving overrides:
+
+| variable | default | vLLM flag |
+|----------|---------|-----------|
+| `VLLM_GPU_MEMORY_UTILIZATION` | `0.85` | `--gpu-memory-utilization` |
+| `VLLM_MAX_MODEL_LEN` | `4096` | `--max-model-len` |
+| `VLLM_MAX_NUM_SEQS` | `256` | `--max-num-seqs` |
+| `VLLM_TENSOR_PARALLEL_SIZE` | `1` | `--tensor-parallel-size` |
+| `VLLM_EXTRA_ARGS` | empty | whitespace-split args appended to `vllm serve` |
+
+Verification scope:
+
+- Red help-text check first proved `VLLM_MAX_NUM_SEQS` was absent from
+  `paged-current-serving-snapshot.sh --help`.
+- Red DGX dry-run check first proved the harness did not print
+  `VLLM_MAX_NUM_SEQS=512` when the override was supplied.
+- Green checks after the patch included `bash -n`, help-text grep, and DGX
+  `DRY_RUN=1` preflight with the override values printed before any server
+  starts. Artifact:
+  `/home/mudler/bench/phase44_hardware_pivot_harness_dryrun/20260701_094038`.
+
+Decision:
+
+- Use the same audited harness for a future datacenter-Blackwell or other
+  non-GB10 parity snapshot by overriding vLLM limits in the environment instead
+  of editing the script.
+- This does not reopen GB10 shortcut work and does not claim parity. A real
+  hardware-pivot benchmark still needs the normal preflight, `hardware.txt`,
+  pre/post MoE/dense md5 gates, `MUL_MAT`/`MUL_MAT_ID` checks, and
+  KL-if-md5-changes before interpreting throughput.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -557,6 +557,16 @@ low-conflict GB10 shortcut justified by current evidence; future work is either
 a larger kernel/loader design or a hardware-pivot benchmark, still gated by
 MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` and KL if md5 changes.

+Phase 44 makes the current-stack serving snapshot harness ready for hardware
+pivots by parameterizing the vLLM side instead of hardcoding the GB10 defaults.
+`paged-current-serving-snapshot.sh` now accepts `VLLM_GPU_MEMORY_UTILIZATION`,
+`VLLM_MAX_MODEL_LEN`, `VLLM_MAX_NUM_SEQS`, `VLLM_TENSOR_PARALLEL_SIZE`, and
+whitespace-split `VLLM_EXTRA_ARGS`, and prints the resolved values during
+`DRY_RUN=1`. This is not a new benchmark and does not change inference code or
+gate behavior. Use it when the next parity run targets datacenter Blackwell or
+another non-GB10 vLLM serving shape, while keeping `hardware.txt`, pre/post
+MoE/dense md5, `MUL_MAT`/`MUL_MAT_ID`, and KL-if-md5-changes as mandatory gates.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -641,6 +651,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/phase39_gate_sgemm_profile/phase27_reanalysis` - Phase27 serving profile re-analysis used to reject graph-time fused gate weight concat; `concat_layout=459.84ms` (`2.29%`) in the serving kernel window.
 - `~/bench/phase40_max_concurrency/20260701_090012` - max-concurrency C1 check at `NPL=128/192/256`, `PTOK=128`, `GEN=64`, `PARALLEL=256`, `CTX=262144`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green, but vLLM also fit at `n=256` and stayed ahead (`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`).
 - `~/bench/phase41_low_concurrency/20260701_091437` - low-concurrency serving check at `NPL=1/8/32`, `PTOK=128`, `GEN=64`, `PARALLEL=32`, `CTX=32768`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green; paged is `0.7493`, `0.7518`, and `0.6649` of vLLM decode at `n=1/8/32`, with TTFT still much worse by `n=8/32`; does not reopen D1.
+- `~/bench/phase44_hardware_pivot_harness_dryrun/20260701_094038` - harness-only dry-run artifact proving the vLLM serving config overrides are printed and preflighted before any server starts.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -1096,6 +1096,21 @@ is justified by the current evidence. Future work needs either a larger funded
 kernel/loader design or a hardware-pivot benchmark, with the canonical
 MoE/dense md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates.

+### Phase 44 hardware-pivot harness readiness
+
+Phase 44 makes `paged-current-serving-snapshot.sh` usable for hardware-pivot
+comparisons without editing the script for each vLLM deployment shape. It adds
+environment overrides for `VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_MODEL_LEN`,
+`VLLM_MAX_NUM_SEQS`, `VLLM_TENSOR_PARALLEL_SIZE`, and whitespace-split
+`VLLM_EXTRA_ARGS`, then prints the resolved values in `DRY_RUN=1` output.
+
+This is deliberately a harness-only phase. It does not change inference code,
+does not regenerate the llama.cpp patch series, and does not produce a new
+throughput result. Its purpose is to keep the audited methodology portable:
+future non-GB10 snapshots can carry the same `hardware.txt`, pre/post md5,
+`MUL_MAT`/`MUL_MAT_ID`, and KL-if-md5-changes gates while using hardware-specific
+vLLM serving limits.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
+++ b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
@@ -29,6 +29,11 @@ Environment overrides:
  LLAMA_PORT   llama-server port (default: 8098)
  VLLM_PORT    vLLM port (default: 8000)
  VLLM_BIN     vLLM executable (default: ~/vllm-bench/bin/vllm)
+  VLLM_GPU_MEMORY_UTILIZATION vLLM --gpu-memory-utilization (default: 0.85)
+  VLLM_MAX_MODEL_LEN          vLLM --max-model-len (default: 4096)
+  VLLM_MAX_NUM_SEQS           vLLM --max-num-seqs (default: 256)
+  VLLM_TENSOR_PARALLEL_SIZE   vLLM --tensor-parallel-size (default: 1)
+  VLLM_EXTRA_ARGS             whitespace-split extra args appended to vLLM serve (default: empty)
  SKIP_GATES=1 to skip pre/post paged inference gates
  DRY_RUN=1    validate inputs/preflight, write hardware.txt, and print commands without running servers

@@ -75,6 +80,11 @@ UBATCH=${UBATCH:-512}
 LLAMA_PORT=${LLAMA_PORT:-8098}
 VLLM_PORT=${VLLM_PORT:-8000}
 VLLM_BIN=${VLLM_BIN:-"$HOME/vllm-bench/bin/vllm"}
+VLLM_GPU_MEMORY_UTILIZATION=${VLLM_GPU_MEMORY_UTILIZATION:-0.85}
+VLLM_MAX_MODEL_LEN=${VLLM_MAX_MODEL_LEN:-4096}
+VLLM_MAX_NUM_SEQS=${VLLM_MAX_NUM_SEQS:-256}
+VLLM_TENSOR_PARALLEL_SIZE=${VLLM_TENSOR_PARALLEL_SIZE:-1}
+VLLM_EXTRA_ARGS=${VLLM_EXTRA_ARGS:-}
 SKIP_GATES=${SKIP_GATES:-0}
 DRY_RUN=${DRY_RUN:-0}
 MOE_MD5_EXPECTED=8cb0ce23777bf55f92f63d0292c756b0
@@ -237,14 +247,19 @@ run_paged() {

 run_vllm() {
  local arm_dir="$ART/vllm"
+  local extra_args=()
  mkdir -p "$arm_dir"
  export PATH="$(dirname "$VLLM_BIN"):$PATH"
  export VLLM_LOGGING_LEVEL=${VLLM_LOGGING_LEVEL:-INFO}
  export HF_HUB_OFFLINE=${HF_HUB_OFFLINE:-1}
+  if [[ -n "$VLLM_EXTRA_ARGS" ]]; then
+    read -r -a extra_args <<< "$VLLM_EXTRA_ARGS"
+  fi
  log "starting vLLM server"
  nohup "$VLLM_BIN" serve "$VLLM_MODEL" \
-    --served-model-name q36 --gpu-memory-utilization 0.85 --max-model-len 4096 \
-    --max-num-seqs 256 --host 127.0.0.1 --port "$VLLM_PORT" --tensor-parallel-size 1 \
+    --served-model-name q36 --gpu-memory-utilization "$VLLM_GPU_MEMORY_UTILIZATION" --max-model-len "$VLLM_MAX_MODEL_LEN" \
+    --max-num-seqs "$VLLM_MAX_NUM_SEQS" --host 127.0.0.1 --port "$VLLM_PORT" --tensor-parallel-size "$VLLM_TENSOR_PARALLEL_SIZE" \
+    "${extra_args[@]}" \
    > "$arm_dir/server.log" 2>&1 &
  SERVER_PID=$!
  wait_http "http://127.0.0.1:$VLLM_PORT/v1/models" "q36" "$arm_dir/server.log" "$arm_dir/models.json"
@@ -381,6 +396,7 @@ if [[ "$DRY_RUN" == "1" ]]; then
  log "would build: cmake --build $BUILD_DIR --target llama-server llama-completion test-backend-ops -j8"
  log "would run paged NPL=[$NPL] PTOK=$PTOK GEN=$GEN"
  log "would run vLLM NPL=[$NPL] PTOK=$PTOK GEN=$GEN"
+  log "vLLM config: VLLM_GPU_MEMORY_UTILIZATION=$VLLM_GPU_MEMORY_UTILIZATION VLLM_MAX_MODEL_LEN=$VLLM_MAX_MODEL_LEN VLLM_MAX_NUM_SEQS=$VLLM_MAX_NUM_SEQS VLLM_TENSOR_PARALLEL_SIZE=$VLLM_TENSOR_PARALLEL_SIZE VLLM_EXTRA_ARGS=[$VLLM_EXTRA_ARGS]"
  exit 0
 fi