diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md index f492af432..838ac17a2 100644 --- a/backend/cpp/llama-cpp-localai-paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -618,6 +618,9 @@ DGX mirror `f2521ab12`, artifact Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving snapshots. It targets the clean `~/llama-phase6-source` mirror, checks docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post -inference gates, and emits paged/vLLM ratios. Do not use the stale DGX +inference gates, writes `hardware.txt`, and emits paged/vLLM ratios. +`hardware.txt` records the GPU identity and hardware class so GB10/workstation +Blackwell evidence is not confused with a future datacenter-Blackwell rerun. +Do not use the stale DGX `~/bench/combined_definitive.sh` without first porting it to the current mirror and lock discipline. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 9cce412d4..58761a7cd 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -1467,3 +1467,50 @@ Decision: - The patch series is drift-free against fork branch `localai-paged` at `fb9402661 feat(server): trace speculative batch shapes`. + +## Phase 24 Snapshot Hardware Report + +Phase 24 made the current-stack serving harness record hardware identity before +any server starts. This keeps GB10/workstation Blackwell evidence separate from +future datacenter-Blackwell reruns. + +Script change: + +- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` now + writes `hardware.txt` after preflight and before the `DRY_RUN=1` exit. + +Recorded fields: + +- `nvidia-smi -L`; +- `nvidia-smi --query-gpu=name,driver_version,memory.total,compute_cap`, with + fallback to name/driver/memory if `compute_cap` is unavailable; +- `gpu_name`; +- `hardware_class`; +- parity note for that hardware class. + +Verification: + +- local `bash -n` passed; +- local `--help` passed; +- DGX `DRY_RUN=1` validated preflight and wrote `hardware.txt` without launching + servers. + +Dry-run artifact: + +- `/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741` + +DGX hardware result: + +```text +GPU 0: NVIDIA GB10 +driver=580.159.03 +compute_cap=12.1 +hardware_class=gb10_or_workstation_blackwell +``` + +Decision: + +- Future snapshot artifacts are self-describing enough to prevent accidental + GB10-to-datacenter generalization. +- The Phase 20 GB10 closure still applies to `gb10_or_workstation_blackwell`; + datacenter Blackwell needs a fresh run of the same methodology. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 6e09a23d7..5c756a99f 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -129,6 +129,9 @@ python3 /home/mudler/bench/h2h_cli3.py # OpenAI /v1/completions, ignore_eos, f **vLLM side** (for both-engine parity): `~/vllm-bench/bin/vllm` (version **0.23.0**), served `gpu-util 0.85 max-model-len 4096 max-num-seqs 256 tp1`, models `~/bench/q36-35b-a3b-nvfp4-vllm/` and `~/bench/q36-27b-nvfp4-vllm/`. **Current-stack serving snapshots use `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`.** It targets the clean `~/llama-phase6-source` mirror, checks docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post inference gates, then compares paged and vLLM with the same h2h client. The older `dgx:~/bench/combined_definitive.sh` is historical: do not reuse it without first porting away from stale `~/llama-paged-dev` paths and old lock assumptions. +The harness also writes `hardware.txt` before any server starts, including +`DRY_RUN=1`, so every new snapshot records the GPU model, driver, compute +capability when exposed by `nvidia-smi`, and a conservative hardware class. ### 3.4 THE DECODE-PROFILING RULE (this trap caused 4 wrong analyses) Decode runs as a **replayed CUDA graph**. `nsys` **without** `--cuda-graph-trace=node` collapses each graph replay into ONE opaque launch, so every per-kernel attribution becomes an artifact. This is exactly what made the old "paged 159 us/tok, GPU ~16% busy, host-bound, 5.4x more GPU-efficient" story wrong, and produced the wrong ~56% headline. @@ -321,6 +324,14 @@ Makefile pin `0ed235ea2c17a19fc8238668653946721ed136fd` produced tree `5bdbf8ea3d750fe6fa1f85175fd6357d36222edb`, exactly matching fork branch `localai-paged` HEAD `fb9402661 feat(server): trace speculative batch shapes`. +Phase 24 extended `paged-current-serving-snapshot.sh` to write the snapshot +hardware report. DGX dry run passed at +`/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`; it recorded +`GPU 0: NVIDIA GB10`, driver `580.159.03`, compute capability `12.1`, and +`hardware_class=gb10_or_workstation_blackwell`. This makes future parity +artifacts self-describing: GB10/workstation Blackwell results must not be used +as datacenter-Blackwell parity evidence. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -384,6 +395,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - historical same-session both-engine run. - `~/bench/phase20_current_snapshot/20260701_050621` - current clean-stack paged-vs-vLLM MoE serving snapshot. - `~/bench/phase21_harness_dryrun/20260701_051757` - current snapshot harness dry-run artifact. +- `~/bench/phase24_hardware_report_dryrun/20260701_052741` - current snapshot harness dry run proving `hardware.txt` captures the DGX as `hardware_class=gb10_or_workstation_blackwell`. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 9dd0e82d6..2b3a0e176 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -661,6 +661,21 @@ Verification: Use this harness for future current-stack GB10 snapshots before making parity claims. +### Phase 24 snapshot hardware report + +Phase 24 extended `paged-current-serving-snapshot.sh` to write `hardware.txt` +after preflight and before any server launch, including in `DRY_RUN=1`. The +report records `nvidia-smi -L`, GPU name, driver, memory, compute capability +when available, `hardware_class`, and a parity note for that class. + +DGX dry run passed and wrote +`/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`. It +classified the current DGX as `hardware_class=gb10_or_workstation_blackwell` +with `GPU 0: NVIDIA GB10`, driver `580.159.03`, and compute capability `12.1`. + +Use `hardware.txt` when comparing future snapshots. GB10/workstation Blackwell +results do not establish datacenter-Blackwell parity. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh index 730de4960..9ed6277c1 100755 --- a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +++ b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh @@ -29,7 +29,7 @@ Environment overrides: VLLM_PORT vLLM port (default: 8000) VLLM_BIN vLLM executable (default: ~/vllm-bench/bin/vllm) SKIP_GATES=1 to skip pre/post paged inference gates - DRY_RUN=1 validate inputs/preflight and print commands without running servers + DRY_RUN=1 validate inputs/preflight, write hardware.txt, and print commands without running servers EOF } @@ -97,6 +97,47 @@ preflight() { esac } +write_hardware_report() { + local out="$ART/hardware.txt" + local gpu_name hardware_class + + gpu_name=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1 || true) + hardware_class="unknown" + case "$gpu_name" in + *B200*|*B100*|*GB200*) hardware_class="datacenter_blackwell" ;; + *H200*|*H100*) hardware_class="datacenter_other" ;; + *GB10*|*"DGX Spark"*|*RTX*|*"PRO 6000"*) hardware_class="gb10_or_workstation_blackwell" ;; + esac + + { + echo "nvidia_smi_L:" + nvidia-smi -L || true + echo + echo "nvidia_smi_query:" + if ! nvidia-smi --query-gpu=name,driver_version,memory.total,compute_cap --format=csv,noheader; then + nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader || true + fi + echo + echo "gpu_name=$gpu_name" + echo "hardware_class=$hardware_class" + case "$hardware_class" in + datacenter_blackwell) + echo "parity_note=datacenter Blackwell hardware: full parity methodology can choose new levers" + ;; + datacenter_other) + echo "parity_note=datacenter non-Blackwell hardware: do not generalize GB10 parity decisions" + ;; + gb10_or_workstation_blackwell) + echo "parity_note=GB10/workstation Blackwell hardware: GB10 shortcut closures apply unless new evidence says otherwise" + ;; + *) + echo "parity_note=unknown hardware: classify before making parity claims" + ;; + esac + } > "$out" + log "hardware report: $out" +} + acquire_lock() { mkdir -p "$LOCK_DIR" echo "codex-current-serving-snapshot $(date +%s)" > "$OWNER" @@ -241,6 +282,7 @@ require_path "$VLLM_BIN" require_path "$HOME/paged-inference-gates.sh" preflight +write_hardware_report log "artifact=$ART" log "source=$(git -C "$SRC" log --oneline -1)" diff --git a/docs/superpowers/plans/2026-07-01-snapshot-hardware-report-phase24.md b/docs/superpowers/plans/2026-07-01-snapshot-hardware-report-phase24.md new file mode 100644 index 000000000..411ce1a23 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-snapshot-hardware-report-phase24.md @@ -0,0 +1,112 @@ +# Snapshot Hardware Report Phase 24 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:verification-before-completion before recording the phase result. +> Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** make current-stack paged-vs-vLLM serving snapshots record the hardware +class so GB10/workstation Blackwell results are not confused with future +datacenter-Blackwell parity runs. + +**Architecture:** extend the existing current serving snapshot harness with a +small pre-server hardware report. Keep it additive and outside llama.cpp source: +no patch-series change, no inference behavior change, and no GPU server launch +in dry-run mode. + +**Tech Stack:** Bash, `nvidia-smi`, DGX GB10. + +--- + +## Task 1: Red Check + +- [x] **Step 1: Prove the previous dry-run artifact lacks hardware identity** + + Command: + + ```bash + ssh dgx.casa 'test -e ~/bench/phase21_harness_dryrun/20260701_051757/hardware.txt' + ``` + + Result: + + - exited `1`, confirming the existing harness did not write a hardware report. + +## Task 2: Add Hardware Report + +- [x] **Step 1: Extend `paged-current-serving-snapshot.sh`** + + File: + + - `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + + Behavior: + + - writes `$ART/hardware.txt` immediately after preflight; + - records `nvidia-smi -L`; + - records GPU name, driver, memory, and compute capability when available; + - falls back if `compute_cap` is unavailable in `nvidia-smi`; + - classifies hardware as `datacenter_blackwell`, `datacenter_other`, + `gb10_or_workstation_blackwell`, or `unknown`; + - writes a parity note for the detected hardware class; + - runs in `DRY_RUN=1` before the script exits. + +## Task 3: Verify + +- [x] **Step 1: Local syntax/help checks** + + Commands: + + ```bash + bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh + backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help + ``` + + Result: + + - both passed. + +- [x] **Step 2: DGX dry run** + + Command: + + ```bash + DRY_RUN=1 ART=~/bench/phase24_hardware_report_dryrun/20260701_052741 \ + /tmp/paged-current-serving-snapshot.sh + ``` + + Result: + + - preflight verified `docker=0`, `local_ai_worker=0`, `compute=0`; + - no paged or vLLM server launched; + - `hardware.txt` was written. + + Artifact: + + - `/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741` + + Hardware report: + + ```text + GPU 0: NVIDIA GB10 + driver=580.159.03 + compute_cap=12.1 + hardware_class=gb10_or_workstation_blackwell + ``` + +## Task 4: Record Result + +- [x] **Step 1: Update parity docs** + + Updated files: + + - `backend/cpp/llama-cpp-localai-paged/README.md` + - `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` + - `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + - `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +## Self-Review + +- No llama.cpp source behavior changed. +- The harness remains dry-run safe. +- Future snapshot artifacts now carry enough hardware identity to separate GB10 + closure evidence from datacenter-Blackwell parity evidence.