chore(paged): record snapshot hardware class

Add a hardware report to the current serving snapshot harness so every paged-vs-vLLM artifact records the GPU identity and conservative hardware class before any server starts. Document the Phase 24 dry run and the GB10 classification for future parity comparisons. Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 03:31:11 +00:00
parent 7aa15ce539
commit 7108b68a70
6 changed files with 233 additions and 2 deletions
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -618,6 +618,9 @@ DGX mirror `f2521ab12`, artifact
 Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving
 snapshots. It targets the clean `~/llama-phase6-source` mirror, checks
 docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post
-inference gates, and emits paged/vLLM ratios. Do not use the stale DGX
+inference gates, writes `hardware.txt`, and emits paged/vLLM ratios.
+`hardware.txt` records the GPU identity and hardware class so GB10/workstation
+Blackwell evidence is not confused with a future datacenter-Blackwell rerun.
+Do not use the stale DGX
 `~/bench/combined_definitive.sh` without first porting it to the current mirror
 and lock discipline.
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -1467,3 +1467,50 @@ Decision:

 - The patch series is drift-free against fork branch `localai-paged` at
  `fb9402661 feat(server): trace speculative batch shapes`.
+
+## Phase 24 Snapshot Hardware Report
+
+Phase 24 made the current-stack serving harness record hardware identity before
+any server starts. This keeps GB10/workstation Blackwell evidence separate from
+future datacenter-Blackwell reruns.
+
+Script change:
+
+- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` now
+  writes `hardware.txt` after preflight and before the `DRY_RUN=1` exit.
+
+Recorded fields:
+
+- `nvidia-smi -L`;
+- `nvidia-smi --query-gpu=name,driver_version,memory.total,compute_cap`, with
+  fallback to name/driver/memory if `compute_cap` is unavailable;
+- `gpu_name`;
+- `hardware_class`;
+- parity note for that hardware class.
+
+Verification:
+
+- local `bash -n` passed;
+- local `--help` passed;
+- DGX `DRY_RUN=1` validated preflight and wrote `hardware.txt` without launching
+  servers.
+
+Dry-run artifact:
+
+- `/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`
+
+DGX hardware result:
+
+```text
+GPU 0: NVIDIA GB10
+driver=580.159.03
+compute_cap=12.1
+hardware_class=gb10_or_workstation_blackwell
+```
+
+Decision:
+
+- Future snapshot artifacts are self-describing enough to prevent accidental
+  GB10-to-datacenter generalization.
+- The Phase 20 GB10 closure still applies to `gb10_or_workstation_blackwell`;
+  datacenter Blackwell needs a fresh run of the same methodology.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -129,6 +129,9 @@ python3 /home/mudler/bench/h2h_cli3.py   # OpenAI /v1/completions, ignore_eos, f
 **vLLM side** (for both-engine parity): `~/vllm-bench/bin/vllm` (version **0.23.0**), served `gpu-util 0.85 max-model-len 4096 max-num-seqs 256 tp1`, models `~/bench/q36-35b-a3b-nvfp4-vllm/` and `~/bench/q36-27b-nvfp4-vllm/`.

 **Current-stack serving snapshots use `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`.** It targets the clean `~/llama-phase6-source` mirror, checks docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post inference gates, then compares paged and vLLM with the same h2h client. The older `dgx:~/bench/combined_definitive.sh` is historical: do not reuse it without first porting away from stale `~/llama-paged-dev` paths and old lock assumptions.
+The harness also writes `hardware.txt` before any server starts, including
+`DRY_RUN=1`, so every new snapshot records the GPU model, driver, compute
+capability when exposed by `nvidia-smi`, and a conservative hardware class.

 ### 3.4 THE DECODE-PROFILING RULE (this trap caused 4 wrong analyses)
 Decode runs as a **replayed CUDA graph**. `nsys` **without** `--cuda-graph-trace=node` collapses each graph replay into ONE opaque launch, so every per-kernel attribution becomes an artifact. This is exactly what made the old "paged 159 us/tok, GPU ~16% busy, host-bound, 5.4x more GPU-efficient" story wrong, and produced the wrong ~56% headline.
@@ -321,6 +324,14 @@ Makefile pin `0ed235ea2c17a19fc8238668653946721ed136fd` produced tree
 `5bdbf8ea3d750fe6fa1f85175fd6357d36222edb`, exactly matching fork branch
 `localai-paged` HEAD `fb9402661 feat(server): trace speculative batch shapes`.

+Phase 24 extended `paged-current-serving-snapshot.sh` to write the snapshot
+hardware report. DGX dry run passed at
+`/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`; it recorded
+`GPU 0: NVIDIA GB10`, driver `580.159.03`, compute capability `12.1`, and
+`hardware_class=gb10_or_workstation_blackwell`. This makes future parity
+artifacts self-describing: GB10/workstation Blackwell results must not be used
+as datacenter-Blackwell parity evidence.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -384,6 +395,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - historical same-session both-engine run.
 - `~/bench/phase20_current_snapshot/20260701_050621` - current clean-stack paged-vs-vLLM MoE serving snapshot.
 - `~/bench/phase21_harness_dryrun/20260701_051757` - current snapshot harness dry-run artifact.
+- `~/bench/phase24_hardware_report_dryrun/20260701_052741` - current snapshot harness dry run proving `hardware.txt` captures the DGX as `hardware_class=gb10_or_workstation_blackwell`.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -661,6 +661,21 @@ Verification:
 Use this harness for future current-stack GB10 snapshots before making parity
 claims.

+### Phase 24 snapshot hardware report
+
+Phase 24 extended `paged-current-serving-snapshot.sh` to write `hardware.txt`
+after preflight and before any server launch, including in `DRY_RUN=1`. The
+report records `nvidia-smi -L`, GPU name, driver, memory, compute capability
+when available, `hardware_class`, and a parity note for that class.
+
+DGX dry run passed and wrote
+`/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`. It
+classified the current DGX as `hardware_class=gb10_or_workstation_blackwell`
+with `GPU 0: NVIDIA GB10`, driver `580.159.03`, and compute capability `12.1`.
+
+Use `hardware.txt` when comparing future snapshots. GB10/workstation Blackwell
+results do not establish datacenter-Blackwell parity.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
+++ b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
@@ -29,7 +29,7 @@ Environment overrides:
  VLLM_PORT    vLLM port (default: 8000)
  VLLM_BIN     vLLM executable (default: ~/vllm-bench/bin/vllm)
  SKIP_GATES=1 to skip pre/post paged inference gates
-  DRY_RUN=1    validate inputs/preflight and print commands without running servers
+  DRY_RUN=1    validate inputs/preflight, write hardware.txt, and print commands without running servers
 EOF
 }

@@ -97,6 +97,47 @@ preflight() {
  esac
 }

+write_hardware_report() {
+  local out="$ART/hardware.txt"
+  local gpu_name hardware_class
+
+  gpu_name=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1 || true)
+  hardware_class="unknown"
+  case "$gpu_name" in
+    *B200*|*B100*|*GB200*) hardware_class="datacenter_blackwell" ;;
+    *H200*|*H100*) hardware_class="datacenter_other" ;;
+    *GB10*|*"DGX Spark"*|*RTX*|*"PRO 6000"*) hardware_class="gb10_or_workstation_blackwell" ;;
+  esac
+
+  {
+    echo "nvidia_smi_L:"
+    nvidia-smi -L || true
+    echo
+    echo "nvidia_smi_query:"
+    if ! nvidia-smi --query-gpu=name,driver_version,memory.total,compute_cap --format=csv,noheader; then
+      nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader || true
+    fi
+    echo
+    echo "gpu_name=$gpu_name"
+    echo "hardware_class=$hardware_class"
+    case "$hardware_class" in
+      datacenter_blackwell)
+        echo "parity_note=datacenter Blackwell hardware: full parity methodology can choose new levers"
+        ;;
+      datacenter_other)
+        echo "parity_note=datacenter non-Blackwell hardware: do not generalize GB10 parity decisions"
+        ;;
+      gb10_or_workstation_blackwell)
+        echo "parity_note=GB10/workstation Blackwell hardware: GB10 shortcut closures apply unless new evidence says otherwise"
+        ;;
+      *)
+        echo "parity_note=unknown hardware: classify before making parity claims"
+        ;;
+    esac
+  } > "$out"
+  log "hardware report: $out"
+}
+
 acquire_lock() {
  mkdir -p "$LOCK_DIR"
  echo "codex-current-serving-snapshot $(date +%s)" > "$OWNER"
@@ -241,6 +282,7 @@ require_path "$VLLM_BIN"
 require_path "$HOME/paged-inference-gates.sh"

 preflight
+write_hardware_report
 log "artifact=$ART"
 log "source=$(git -C "$SRC" log --oneline -1)"

--- a/docs/superpowers/plans/2026-07-01-snapshot-hardware-report-phase24.md
+++ b/docs/superpowers/plans/2026-07-01-snapshot-hardware-report-phase24.md
@@ -0,0 +1,112 @@
+# Snapshot Hardware Report Phase 24 Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use
+> superpowers:verification-before-completion before recording the phase result.
+> Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** make current-stack paged-vs-vLLM serving snapshots record the hardware
+class so GB10/workstation Blackwell results are not confused with future
+datacenter-Blackwell parity runs.
+
+**Architecture:** extend the existing current serving snapshot harness with a
+small pre-server hardware report. Keep it additive and outside llama.cpp source:
+no patch-series change, no inference behavior change, and no GPU server launch
+in dry-run mode.
+
+**Tech Stack:** Bash, `nvidia-smi`, DGX GB10.
+
+---
+
+## Task 1: Red Check
+
+- [x] **Step 1: Prove the previous dry-run artifact lacks hardware identity**
+
+  Command:
+
+  ```bash
+  ssh dgx.casa 'test -e ~/bench/phase21_harness_dryrun/20260701_051757/hardware.txt'
+  ```
+
+  Result:
+
+  - exited `1`, confirming the existing harness did not write a hardware report.
+
+## Task 2: Add Hardware Report
+
+- [x] **Step 1: Extend `paged-current-serving-snapshot.sh`**
+
+  File:
+
+  - `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
+
+  Behavior:
+
+  - writes `$ART/hardware.txt` immediately after preflight;
+  - records `nvidia-smi -L`;
+  - records GPU name, driver, memory, and compute capability when available;
+  - falls back if `compute_cap` is unavailable in `nvidia-smi`;
+  - classifies hardware as `datacenter_blackwell`, `datacenter_other`,
+    `gb10_or_workstation_blackwell`, or `unknown`;
+  - writes a parity note for the detected hardware class;
+  - runs in `DRY_RUN=1` before the script exits.
+
+## Task 3: Verify
+
+- [x] **Step 1: Local syntax/help checks**
+
+  Commands:
+
+  ```bash
+  bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
+  backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help
+  ```
+
+  Result:
+
+  - both passed.
+
+- [x] **Step 2: DGX dry run**
+
+  Command:
+
+  ```bash
+  DRY_RUN=1 ART=~/bench/phase24_hardware_report_dryrun/20260701_052741 \
+    /tmp/paged-current-serving-snapshot.sh
+  ```
+
+  Result:
+
+  - preflight verified `docker=0`, `local_ai_worker=0`, `compute=0`;
+  - no paged or vLLM server launched;
+  - `hardware.txt` was written.
+
+  Artifact:
+
+  - `/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`
+
+  Hardware report:
+
+  ```text
+  GPU 0: NVIDIA GB10
+  driver=580.159.03
+  compute_cap=12.1
+  hardware_class=gb10_or_workstation_blackwell
+  ```
+
+## Task 4: Record Result
+
+- [x] **Step 1: Update parity docs**
+
+  Updated files:
+
+  - `backend/cpp/llama-cpp-localai-paged/README.md`
+  - `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
+  - `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
+  - `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
+
+## Self-Review
+
+- No llama.cpp source behavior changed.
+- The harness remains dry-run safe.
+- Future snapshot artifacts now carry enough hardware identity to separate GB10
+  closure evidence from datacenter-Blackwell parity evidence.