mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
chore(paged): record snapshot hardware class
Add a hardware report to the current serving snapshot harness so every paged-vs-vLLM artifact records the GPU identity and conservative hardware class before any server starts. Document the Phase 24 dry run and the GB10 classification for future parity comparisons. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -618,6 +618,9 @@ DGX mirror `f2521ab12`, artifact
|
||||
Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving
|
||||
snapshots. It targets the clean `~/llama-phase6-source` mirror, checks
|
||||
docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post
|
||||
inference gates, and emits paged/vLLM ratios. Do not use the stale DGX
|
||||
inference gates, writes `hardware.txt`, and emits paged/vLLM ratios.
|
||||
`hardware.txt` records the GPU identity and hardware class so GB10/workstation
|
||||
Blackwell evidence is not confused with a future datacenter-Blackwell rerun.
|
||||
Do not use the stale DGX
|
||||
`~/bench/combined_definitive.sh` without first porting it to the current mirror
|
||||
and lock discipline.
|
||||
|
||||
@@ -1467,3 +1467,50 @@ Decision:
|
||||
|
||||
- The patch series is drift-free against fork branch `localai-paged` at
|
||||
`fb9402661 feat(server): trace speculative batch shapes`.
|
||||
|
||||
## Phase 24 Snapshot Hardware Report
|
||||
|
||||
Phase 24 made the current-stack serving harness record hardware identity before
|
||||
any server starts. This keeps GB10/workstation Blackwell evidence separate from
|
||||
future datacenter-Blackwell reruns.
|
||||
|
||||
Script change:
|
||||
|
||||
- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` now
|
||||
writes `hardware.txt` after preflight and before the `DRY_RUN=1` exit.
|
||||
|
||||
Recorded fields:
|
||||
|
||||
- `nvidia-smi -L`;
|
||||
- `nvidia-smi --query-gpu=name,driver_version,memory.total,compute_cap`, with
|
||||
fallback to name/driver/memory if `compute_cap` is unavailable;
|
||||
- `gpu_name`;
|
||||
- `hardware_class`;
|
||||
- parity note for that hardware class.
|
||||
|
||||
Verification:
|
||||
|
||||
- local `bash -n` passed;
|
||||
- local `--help` passed;
|
||||
- DGX `DRY_RUN=1` validated preflight and wrote `hardware.txt` without launching
|
||||
servers.
|
||||
|
||||
Dry-run artifact:
|
||||
|
||||
- `/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`
|
||||
|
||||
DGX hardware result:
|
||||
|
||||
```text
|
||||
GPU 0: NVIDIA GB10
|
||||
driver=580.159.03
|
||||
compute_cap=12.1
|
||||
hardware_class=gb10_or_workstation_blackwell
|
||||
```
|
||||
|
||||
Decision:
|
||||
|
||||
- Future snapshot artifacts are self-describing enough to prevent accidental
|
||||
GB10-to-datacenter generalization.
|
||||
- The Phase 20 GB10 closure still applies to `gb10_or_workstation_blackwell`;
|
||||
datacenter Blackwell needs a fresh run of the same methodology.
|
||||
|
||||
@@ -129,6 +129,9 @@ python3 /home/mudler/bench/h2h_cli3.py # OpenAI /v1/completions, ignore_eos, f
|
||||
**vLLM side** (for both-engine parity): `~/vllm-bench/bin/vllm` (version **0.23.0**), served `gpu-util 0.85 max-model-len 4096 max-num-seqs 256 tp1`, models `~/bench/q36-35b-a3b-nvfp4-vllm/` and `~/bench/q36-27b-nvfp4-vllm/`.
|
||||
|
||||
**Current-stack serving snapshots use `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`.** It targets the clean `~/llama-phase6-source` mirror, checks docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post inference gates, then compares paged and vLLM with the same h2h client. The older `dgx:~/bench/combined_definitive.sh` is historical: do not reuse it without first porting away from stale `~/llama-paged-dev` paths and old lock assumptions.
|
||||
The harness also writes `hardware.txt` before any server starts, including
|
||||
`DRY_RUN=1`, so every new snapshot records the GPU model, driver, compute
|
||||
capability when exposed by `nvidia-smi`, and a conservative hardware class.
|
||||
|
||||
### 3.4 THE DECODE-PROFILING RULE (this trap caused 4 wrong analyses)
|
||||
Decode runs as a **replayed CUDA graph**. `nsys` **without** `--cuda-graph-trace=node` collapses each graph replay into ONE opaque launch, so every per-kernel attribution becomes an artifact. This is exactly what made the old "paged 159 us/tok, GPU ~16% busy, host-bound, 5.4x more GPU-efficient" story wrong, and produced the wrong ~56% headline.
|
||||
@@ -321,6 +324,14 @@ Makefile pin `0ed235ea2c17a19fc8238668653946721ed136fd` produced tree
|
||||
`5bdbf8ea3d750fe6fa1f85175fd6357d36222edb`, exactly matching fork branch
|
||||
`localai-paged` HEAD `fb9402661 feat(server): trace speculative batch shapes`.
|
||||
|
||||
Phase 24 extended `paged-current-serving-snapshot.sh` to write the snapshot
|
||||
hardware report. DGX dry run passed at
|
||||
`/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`; it recorded
|
||||
`GPU 0: NVIDIA GB10`, driver `580.159.03`, compute capability `12.1`, and
|
||||
`hardware_class=gb10_or_workstation_blackwell`. This makes future parity
|
||||
artifacts self-describing: GB10/workstation Blackwell results must not be used
|
||||
as datacenter-Blackwell parity evidence.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
@@ -384,6 +395,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - historical same-session both-engine run.
|
||||
- `~/bench/phase20_current_snapshot/20260701_050621` - current clean-stack paged-vs-vLLM MoE serving snapshot.
|
||||
- `~/bench/phase21_harness_dryrun/20260701_051757` - current snapshot harness dry-run artifact.
|
||||
- `~/bench/phase24_hardware_report_dryrun/20260701_052741` - current snapshot harness dry run proving `hardware.txt` captures the DGX as `hardware_class=gb10_or_workstation_blackwell`.
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
|
||||
@@ -661,6 +661,21 @@ Verification:
|
||||
Use this harness for future current-stack GB10 snapshots before making parity
|
||||
claims.
|
||||
|
||||
### Phase 24 snapshot hardware report
|
||||
|
||||
Phase 24 extended `paged-current-serving-snapshot.sh` to write `hardware.txt`
|
||||
after preflight and before any server launch, including in `DRY_RUN=1`. The
|
||||
report records `nvidia-smi -L`, GPU name, driver, memory, compute capability
|
||||
when available, `hardware_class`, and a parity note for that class.
|
||||
|
||||
DGX dry run passed and wrote
|
||||
`/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`. It
|
||||
classified the current DGX as `hardware_class=gb10_or_workstation_blackwell`
|
||||
with `GPU 0: NVIDIA GB10`, driver `580.159.03`, and compute capability `12.1`.
|
||||
|
||||
Use `hardware.txt` when comparing future snapshots. GB10/workstation Blackwell
|
||||
results do not establish datacenter-Blackwell parity.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -29,7 +29,7 @@ Environment overrides:
|
||||
VLLM_PORT vLLM port (default: 8000)
|
||||
VLLM_BIN vLLM executable (default: ~/vllm-bench/bin/vllm)
|
||||
SKIP_GATES=1 to skip pre/post paged inference gates
|
||||
DRY_RUN=1 validate inputs/preflight and print commands without running servers
|
||||
DRY_RUN=1 validate inputs/preflight, write hardware.txt, and print commands without running servers
|
||||
EOF
|
||||
}
|
||||
|
||||
@@ -97,6 +97,47 @@ preflight() {
|
||||
esac
|
||||
}
|
||||
|
||||
write_hardware_report() {
|
||||
local out="$ART/hardware.txt"
|
||||
local gpu_name hardware_class
|
||||
|
||||
gpu_name=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1 || true)
|
||||
hardware_class="unknown"
|
||||
case "$gpu_name" in
|
||||
*B200*|*B100*|*GB200*) hardware_class="datacenter_blackwell" ;;
|
||||
*H200*|*H100*) hardware_class="datacenter_other" ;;
|
||||
*GB10*|*"DGX Spark"*|*RTX*|*"PRO 6000"*) hardware_class="gb10_or_workstation_blackwell" ;;
|
||||
esac
|
||||
|
||||
{
|
||||
echo "nvidia_smi_L:"
|
||||
nvidia-smi -L || true
|
||||
echo
|
||||
echo "nvidia_smi_query:"
|
||||
if ! nvidia-smi --query-gpu=name,driver_version,memory.total,compute_cap --format=csv,noheader; then
|
||||
nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader || true
|
||||
fi
|
||||
echo
|
||||
echo "gpu_name=$gpu_name"
|
||||
echo "hardware_class=$hardware_class"
|
||||
case "$hardware_class" in
|
||||
datacenter_blackwell)
|
||||
echo "parity_note=datacenter Blackwell hardware: full parity methodology can choose new levers"
|
||||
;;
|
||||
datacenter_other)
|
||||
echo "parity_note=datacenter non-Blackwell hardware: do not generalize GB10 parity decisions"
|
||||
;;
|
||||
gb10_or_workstation_blackwell)
|
||||
echo "parity_note=GB10/workstation Blackwell hardware: GB10 shortcut closures apply unless new evidence says otherwise"
|
||||
;;
|
||||
*)
|
||||
echo "parity_note=unknown hardware: classify before making parity claims"
|
||||
;;
|
||||
esac
|
||||
} > "$out"
|
||||
log "hardware report: $out"
|
||||
}
|
||||
|
||||
acquire_lock() {
|
||||
mkdir -p "$LOCK_DIR"
|
||||
echo "codex-current-serving-snapshot $(date +%s)" > "$OWNER"
|
||||
@@ -241,6 +282,7 @@ require_path "$VLLM_BIN"
|
||||
require_path "$HOME/paged-inference-gates.sh"
|
||||
|
||||
preflight
|
||||
write_hardware_report
|
||||
log "artifact=$ART"
|
||||
log "source=$(git -C "$SRC" log --oneline -1)"
|
||||
|
||||
|
||||
@@ -0,0 +1,112 @@
|
||||
# Snapshot Hardware Report Phase 24 Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use
|
||||
> superpowers:verification-before-completion before recording the phase result.
|
||||
> Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** make current-stack paged-vs-vLLM serving snapshots record the hardware
|
||||
class so GB10/workstation Blackwell results are not confused with future
|
||||
datacenter-Blackwell parity runs.
|
||||
|
||||
**Architecture:** extend the existing current serving snapshot harness with a
|
||||
small pre-server hardware report. Keep it additive and outside llama.cpp source:
|
||||
no patch-series change, no inference behavior change, and no GPU server launch
|
||||
in dry-run mode.
|
||||
|
||||
**Tech Stack:** Bash, `nvidia-smi`, DGX GB10.
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Red Check
|
||||
|
||||
- [x] **Step 1: Prove the previous dry-run artifact lacks hardware identity**
|
||||
|
||||
Command:
|
||||
|
||||
```bash
|
||||
ssh dgx.casa 'test -e ~/bench/phase21_harness_dryrun/20260701_051757/hardware.txt'
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
- exited `1`, confirming the existing harness did not write a hardware report.
|
||||
|
||||
## Task 2: Add Hardware Report
|
||||
|
||||
- [x] **Step 1: Extend `paged-current-serving-snapshot.sh`**
|
||||
|
||||
File:
|
||||
|
||||
- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
|
||||
|
||||
Behavior:
|
||||
|
||||
- writes `$ART/hardware.txt` immediately after preflight;
|
||||
- records `nvidia-smi -L`;
|
||||
- records GPU name, driver, memory, and compute capability when available;
|
||||
- falls back if `compute_cap` is unavailable in `nvidia-smi`;
|
||||
- classifies hardware as `datacenter_blackwell`, `datacenter_other`,
|
||||
`gb10_or_workstation_blackwell`, or `unknown`;
|
||||
- writes a parity note for the detected hardware class;
|
||||
- runs in `DRY_RUN=1` before the script exits.
|
||||
|
||||
## Task 3: Verify
|
||||
|
||||
- [x] **Step 1: Local syntax/help checks**
|
||||
|
||||
Commands:
|
||||
|
||||
```bash
|
||||
bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
|
||||
backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
- both passed.
|
||||
|
||||
- [x] **Step 2: DGX dry run**
|
||||
|
||||
Command:
|
||||
|
||||
```bash
|
||||
DRY_RUN=1 ART=~/bench/phase24_hardware_report_dryrun/20260701_052741 \
|
||||
/tmp/paged-current-serving-snapshot.sh
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
- preflight verified `docker=0`, `local_ai_worker=0`, `compute=0`;
|
||||
- no paged or vLLM server launched;
|
||||
- `hardware.txt` was written.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`
|
||||
|
||||
Hardware report:
|
||||
|
||||
```text
|
||||
GPU 0: NVIDIA GB10
|
||||
driver=580.159.03
|
||||
compute_cap=12.1
|
||||
hardware_class=gb10_or_workstation_blackwell
|
||||
```
|
||||
|
||||
## Task 4: Record Result
|
||||
|
||||
- [x] **Step 1: Update parity docs**
|
||||
|
||||
Updated files:
|
||||
|
||||
- `backend/cpp/llama-cpp-localai-paged/README.md`
|
||||
- `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
|
||||
- `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
|
||||
- `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
|
||||
|
||||
## Self-Review
|
||||
|
||||
- No llama.cpp source behavior changed.
|
||||
- The harness remains dry-run safe.
|
||||
- Future snapshot artifacts now carry enough hardware identity to separate GB10
|
||||
closure evidence from datacenter-Blackwell parity evidence.
|
||||
Reference in New Issue
Block a user