chore(paged): record snapshot hardware class

Add a hardware report to the current serving snapshot harness so every paged-vs-vLLM artifact records the GPU identity and conservative hardware class before any server starts. Document the Phase 24 dry run and the GB10 classification for future parity comparisons.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 03:31:11 +00:00
parent 7aa15ce539
commit 7108b68a70
6 changed files with 233 additions and 2 deletions

View File

@@ -618,6 +618,9 @@ DGX mirror `f2521ab12`, artifact
Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving
snapshots. It targets the clean `~/llama-phase6-source` mirror, checks
docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post
inference gates, and emits paged/vLLM ratios. Do not use the stale DGX
inference gates, writes `hardware.txt`, and emits paged/vLLM ratios.
`hardware.txt` records the GPU identity and hardware class so GB10/workstation
Blackwell evidence is not confused with a future datacenter-Blackwell rerun.
Do not use the stale DGX
`~/bench/combined_definitive.sh` without first porting it to the current mirror
and lock discipline.

View File

@@ -1467,3 +1467,50 @@ Decision:
- The patch series is drift-free against fork branch `localai-paged` at
`fb9402661 feat(server): trace speculative batch shapes`.
## Phase 24 Snapshot Hardware Report
Phase 24 made the current-stack serving harness record hardware identity before
any server starts. This keeps GB10/workstation Blackwell evidence separate from
future datacenter-Blackwell reruns.
Script change:
- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` now
writes `hardware.txt` after preflight and before the `DRY_RUN=1` exit.
Recorded fields:
- `nvidia-smi -L`;
- `nvidia-smi --query-gpu=name,driver_version,memory.total,compute_cap`, with
fallback to name/driver/memory if `compute_cap` is unavailable;
- `gpu_name`;
- `hardware_class`;
- parity note for that hardware class.
Verification:
- local `bash -n` passed;
- local `--help` passed;
- DGX `DRY_RUN=1` validated preflight and wrote `hardware.txt` without launching
servers.
Dry-run artifact:
- `/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`
DGX hardware result:
```text
GPU 0: NVIDIA GB10
driver=580.159.03
compute_cap=12.1
hardware_class=gb10_or_workstation_blackwell
```
Decision:
- Future snapshot artifacts are self-describing enough to prevent accidental
GB10-to-datacenter generalization.
- The Phase 20 GB10 closure still applies to `gb10_or_workstation_blackwell`;
datacenter Blackwell needs a fresh run of the same methodology.

View File

@@ -129,6 +129,9 @@ python3 /home/mudler/bench/h2h_cli3.py # OpenAI /v1/completions, ignore_eos, f
**vLLM side** (for both-engine parity): `~/vllm-bench/bin/vllm` (version **0.23.0**), served `gpu-util 0.85 max-model-len 4096 max-num-seqs 256 tp1`, models `~/bench/q36-35b-a3b-nvfp4-vllm/` and `~/bench/q36-27b-nvfp4-vllm/`.
**Current-stack serving snapshots use `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`.** It targets the clean `~/llama-phase6-source` mirror, checks docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post inference gates, then compares paged and vLLM with the same h2h client. The older `dgx:~/bench/combined_definitive.sh` is historical: do not reuse it without first porting away from stale `~/llama-paged-dev` paths and old lock assumptions.
The harness also writes `hardware.txt` before any server starts, including
`DRY_RUN=1`, so every new snapshot records the GPU model, driver, compute
capability when exposed by `nvidia-smi`, and a conservative hardware class.
### 3.4 THE DECODE-PROFILING RULE (this trap caused 4 wrong analyses)
Decode runs as a **replayed CUDA graph**. `nsys` **without** `--cuda-graph-trace=node` collapses each graph replay into ONE opaque launch, so every per-kernel attribution becomes an artifact. This is exactly what made the old "paged 159 us/tok, GPU ~16% busy, host-bound, 5.4x more GPU-efficient" story wrong, and produced the wrong ~56% headline.
@@ -321,6 +324,14 @@ Makefile pin `0ed235ea2c17a19fc8238668653946721ed136fd` produced tree
`5bdbf8ea3d750fe6fa1f85175fd6357d36222edb`, exactly matching fork branch
`localai-paged` HEAD `fb9402661 feat(server): trace speculative batch shapes`.
Phase 24 extended `paged-current-serving-snapshot.sh` to write the snapshot
hardware report. DGX dry run passed at
`/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`; it recorded
`GPU 0: NVIDIA GB10`, driver `580.159.03`, compute capability `12.1`, and
`hardware_class=gb10_or_workstation_blackwell`. This makes future parity
artifacts self-describing: GB10/workstation Blackwell results must not be used
as datacenter-Blackwell parity evidence.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -384,6 +395,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
- `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - historical same-session both-engine run.
- `~/bench/phase20_current_snapshot/20260701_050621` - current clean-stack paged-vs-vLLM MoE serving snapshot.
- `~/bench/phase21_harness_dryrun/20260701_051757` - current snapshot harness dry-run artifact.
- `~/bench/phase24_hardware_report_dryrun/20260701_052741` - current snapshot harness dry run proving `hardware.txt` captures the DGX as `hardware_class=gb10_or_workstation_blackwell`.
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.

View File

@@ -661,6 +661,21 @@ Verification:
Use this harness for future current-stack GB10 snapshots before making parity
claims.
### Phase 24 snapshot hardware report
Phase 24 extended `paged-current-serving-snapshot.sh` to write `hardware.txt`
after preflight and before any server launch, including in `DRY_RUN=1`. The
report records `nvidia-smi -L`, GPU name, driver, memory, compute capability
when available, `hardware_class`, and a parity note for that class.
DGX dry run passed and wrote
`/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`. It
classified the current DGX as `hardware_class=gb10_or_workstation_blackwell`
with `GPU 0: NVIDIA GB10`, driver `580.159.03`, and compute capability `12.1`.
Use `hardware.txt` when comparing future snapshots. GB10/workstation Blackwell
results do not establish datacenter-Blackwell parity.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update

View File

@@ -29,7 +29,7 @@ Environment overrides:
VLLM_PORT vLLM port (default: 8000)
VLLM_BIN vLLM executable (default: ~/vllm-bench/bin/vllm)
SKIP_GATES=1 to skip pre/post paged inference gates
DRY_RUN=1 validate inputs/preflight and print commands without running servers
DRY_RUN=1 validate inputs/preflight, write hardware.txt, and print commands without running servers
EOF
}
@@ -97,6 +97,47 @@ preflight() {
esac
}
write_hardware_report() {
local out="$ART/hardware.txt"
local gpu_name hardware_class
gpu_name=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1 || true)
hardware_class="unknown"
case "$gpu_name" in
*B200*|*B100*|*GB200*) hardware_class="datacenter_blackwell" ;;
*H200*|*H100*) hardware_class="datacenter_other" ;;
*GB10*|*"DGX Spark"*|*RTX*|*"PRO 6000"*) hardware_class="gb10_or_workstation_blackwell" ;;
esac
{
echo "nvidia_smi_L:"
nvidia-smi -L || true
echo
echo "nvidia_smi_query:"
if ! nvidia-smi --query-gpu=name,driver_version,memory.total,compute_cap --format=csv,noheader; then
nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader || true
fi
echo
echo "gpu_name=$gpu_name"
echo "hardware_class=$hardware_class"
case "$hardware_class" in
datacenter_blackwell)
echo "parity_note=datacenter Blackwell hardware: full parity methodology can choose new levers"
;;
datacenter_other)
echo "parity_note=datacenter non-Blackwell hardware: do not generalize GB10 parity decisions"
;;
gb10_or_workstation_blackwell)
echo "parity_note=GB10/workstation Blackwell hardware: GB10 shortcut closures apply unless new evidence says otherwise"
;;
*)
echo "parity_note=unknown hardware: classify before making parity claims"
;;
esac
} > "$out"
log "hardware report: $out"
}
acquire_lock() {
mkdir -p "$LOCK_DIR"
echo "codex-current-serving-snapshot $(date +%s)" > "$OWNER"
@@ -241,6 +282,7 @@ require_path "$VLLM_BIN"
require_path "$HOME/paged-inference-gates.sh"
preflight
write_hardware_report
log "artifact=$ART"
log "source=$(git -C "$SRC" log --oneline -1)"

View File

@@ -0,0 +1,112 @@
# Snapshot Hardware Report Phase 24 Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use
> superpowers:verification-before-completion before recording the phase result.
> Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** make current-stack paged-vs-vLLM serving snapshots record the hardware
class so GB10/workstation Blackwell results are not confused with future
datacenter-Blackwell parity runs.
**Architecture:** extend the existing current serving snapshot harness with a
small pre-server hardware report. Keep it additive and outside llama.cpp source:
no patch-series change, no inference behavior change, and no GPU server launch
in dry-run mode.
**Tech Stack:** Bash, `nvidia-smi`, DGX GB10.
---
## Task 1: Red Check
- [x] **Step 1: Prove the previous dry-run artifact lacks hardware identity**
Command:
```bash
ssh dgx.casa 'test -e ~/bench/phase21_harness_dryrun/20260701_051757/hardware.txt'
```
Result:
- exited `1`, confirming the existing harness did not write a hardware report.
## Task 2: Add Hardware Report
- [x] **Step 1: Extend `paged-current-serving-snapshot.sh`**
File:
- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
Behavior:
- writes `$ART/hardware.txt` immediately after preflight;
- records `nvidia-smi -L`;
- records GPU name, driver, memory, and compute capability when available;
- falls back if `compute_cap` is unavailable in `nvidia-smi`;
- classifies hardware as `datacenter_blackwell`, `datacenter_other`,
`gb10_or_workstation_blackwell`, or `unknown`;
- writes a parity note for the detected hardware class;
- runs in `DRY_RUN=1` before the script exits.
## Task 3: Verify
- [x] **Step 1: Local syntax/help checks**
Commands:
```bash
bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help
```
Result:
- both passed.
- [x] **Step 2: DGX dry run**
Command:
```bash
DRY_RUN=1 ART=~/bench/phase24_hardware_report_dryrun/20260701_052741 \
/tmp/paged-current-serving-snapshot.sh
```
Result:
- preflight verified `docker=0`, `local_ai_worker=0`, `compute=0`;
- no paged or vLLM server launched;
- `hardware.txt` was written.
Artifact:
- `/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`
Hardware report:
```text
GPU 0: NVIDIA GB10
driver=580.159.03
compute_cap=12.1
hardware_class=gb10_or_workstation_blackwell
```
## Task 4: Record Result
- [x] **Step 1: Update parity docs**
Updated files:
- `backend/cpp/llama-cpp-localai-paged/README.md`
- `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
- `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
## Self-Review
- No llama.cpp source behavior changed.
- The harness remains dry-run safe.
- Future snapshot artifacts now carry enough hardware identity to separate GB10
closure evidence from datacenter-Blackwell parity evidence.