docs(paged): record audited current snapshot

Record the Phase 26 current-stack paged-vs-vLLM serving snapshot with hardware classification and compact pre/post inference gates. Assisted-by: Codex:gpt-5
2026-07-03 12:57:02 -04:00 · 2026-07-01 03:48:27 +00:00
parent a0194125f5
commit ace1ffab28
5 changed files with 189 additions and 4 deletions
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -607,13 +607,15 @@ the structural floors, the parity verdict - is

 Latest current-stack MoE serving snapshot (`PTOK=128`, `GEN=64`, current clean
 DGX mirror `f2521ab12`, artifact
-`/home/mudler/bench/phase20_current_snapshot/20260701_050621`):
+`/home/mudler/bench/phase26_audited_snapshot/20260701_053650`). This run
+includes `hardware.txt` and `gate_summary.tsv`; all pre/post gate rows are
+`ok`:

 | n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
 |---|------------------|-----------------|-------------------|-----------|----------|----------------|
-| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
-| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
-| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
+| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% |
+| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% |
+| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% |

 Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving
 snapshots. It targets the clean `~/llama-phase6-source` mirror, checks
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -1571,3 +1571,69 @@ Decision:
  stayed green before and after the paged-vs-vLLM run.
 - Treat `gate_summary.tsv` plus `hardware.txt` as the quick audit surface before
  accepting a parity snapshot.
+
+## Phase 26 Audited Current-Stack Serving Snapshot
+
+Phase 26 ran a full current-stack paged-vs-vLLM MoE serving snapshot with the
+Phase 24/25 audit files enabled.
+
+Artifact:
+
+- `/home/mudler/bench/phase26_audited_snapshot/20260701_053650`
+
+Current source:
+
+- `/home/mudler/llama-phase6-source`
+- `f2521ab12 feat(server): trace speculative batch shapes`
+
+Hardware report:
+
+- `hardware_class=gb10_or_workstation_blackwell`
+- `GPU 0: NVIDIA GB10`
+- driver `580.159.03`
+- compute capability `12.1`
+
+Pre/post gate summary:
+
+| phase | check | status | actual |
+|-------|-------|--------|--------|
+| pre | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| pre | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| pre | `MUL_MAT_ID` | ok | `806/806` |
+| post | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| post | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| post | `MUL_MAT_ID` | ok | `806/806` |
+
+Serving snapshot:
+
+| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
+|---|------------------|-----------------|-------------------|-----------|----------|----------------|
+| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% |
+| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% |
+| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% |
+
+Latency/prefill snapshot:
+
+| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps |
+|---|---------------|--------------|------------------|--------------------|------------------|
+| 8 | 778.6 | 271.1 | 2.87x | 1679.9 | 4485.6 |
+| 32 | 2607.4 | 749.4 | 3.48x | 1698.8 | 5427.8 |
+| 128 | 7569.6 | 2534.3 | 2.99x | 1668.7 | 5122.0 |
+
+vLLM startup notes:
+
+- vLLM selected the expected GB10 backend mix: FlashInfer FP8 projection
+  kernels, Triton/FLA GDN prefill, FlashAttention, and MARLIN NVFP4 MoE.
+- Startup was long because the server loaded three checkpoint shards, loaded
+  cached torch-compile graphs, ran FlashInfer fp8 GEMM autotuning, and captured
+  CUDA graphs before the API became ready.
+
+Decision:
+
+- The audited current stack still is not at vLLM serving parity on GB10.
+- The Phase 20 conclusion is reproduced with stronger audit artifacts:
+  `hardware.txt`, `gate_summary.tsv`, pre/post full gates, and same-session
+  paged/vLLM ratios.
+- Current paged/vLLM decode ratios remain about `81.5%` at n8, `69.0%` at n32,
+  and `65.7%` at n128; e2e aggregate ratios remain about `70.6%`, `54.6%`,
+  and `49.4%`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -342,6 +342,28 @@ backfilled on the Phase 20 artifact at
 it records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806` as `ok`.

+Phase 26 ran the full audited current-stack snapshot with `hardware.txt`,
+pre/post gates, same-session paged and vLLM serving runs, `summary.tsv`, and
+`gate_summary.tsv`. Artifact:
+`/home/mudler/bench/phase26_audited_snapshot/20260701_053650`. Hardware was
+recorded as `hardware_class=gb10_or_workstation_blackwell`, GPU `NVIDIA GB10`,
+driver `580.159.03`, compute capability `12.1`. Every compact gate row was
+`ok`: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
+`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`, both before and
+after the serving run.
+
+Audited current MoE serving snapshot (`PTOK=128`, `GEN=64`):
+
+| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
+|---|------------------|-----------------|-------------------|-----------|----------|----------------|
+| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% |
+| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% |
+| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% |
+
+Use Phase 26 as the current audit-grade GB10 snapshot. It keeps the Phase 20
+verdict intact, but the artifact is more useful for future regressions because
+it carries hardware classification and compact pre/post inference gates.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -407,6 +429,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/phase21_harness_dryrun/20260701_051757` - current snapshot harness dry-run artifact.
 - `~/bench/phase24_hardware_report_dryrun/20260701_052741` - current snapshot harness dry run proving `hardware.txt` captures the DGX as `hardware_class=gb10_or_workstation_blackwell`.
 - `~/bench/phase25_gate_summary_dryrun/20260701_053353` - dry run after adding `gate_summary.tsv` support; normal dry-run still writes `hardware.txt` and does not emit a gate summary before gates exist.
+- `~/bench/phase26_audited_snapshot/20260701_053650` - current audit-grade full paged-vs-vLLM MoE serving snapshot with `hardware.txt`, pre/post gates, `summary.tsv`, and `gate_summary.tsv`.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -691,6 +691,31 @@ It records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
 Use `hardware.txt` plus `gate_summary.tsv` as the quick audit surface before
 accepting any new parity snapshot.

+### Phase 26 audited current-stack snapshot
+
+Phase 26 ran the full current-stack paged-vs-vLLM MoE serving snapshot with the
+Phase 24/25 audit files enabled:
+`/home/mudler/bench/phase26_audited_snapshot/20260701_053650`.
+
+The artifact records `hardware_class=gb10_or_workstation_blackwell` on GPU
+`NVIDIA GB10` with driver `580.159.03` and compute capability `12.1`.
+`gate_summary.tsv` reports every pre/post gate as `ok`: MoE md5
+`8cb0ce23777bf55f92f63d0292c756b0`, dense md5
+`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
+
+Audited MoE serving result (`PTOK=128`, `GEN=64`):
+
+| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
+|---|------------------|-----------------|-------------------|-----------|----------|----------------|
+| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% |
+| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% |
+| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% |
+
+Decision: the latest audited clean-stack run still does not reach vLLM serving
+parity on GB10. Treat Phase 26 as the current benchmark baseline before funding
+new kernel work, and keep md5/op gates as the first check when changing the
+patch stack.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/docs/superpowers/plans/2026-07-01-audited-current-stack-snapshot-phase26.md
+++ b/docs/superpowers/plans/2026-07-01-audited-current-stack-snapshot-phase26.md
@@ -0,0 +1,69 @@
+# Audited Current Stack Snapshot Phase 26 Plan
+
+**Date:** 2026-07-01
+**Phase:** 26
+**Goal:** run the reusable current-stack paged-vs-vLLM serving harness end to
+end on the DGX, with hardware and compact inference gates attached to the
+artifact, so throughput comparisons cannot hide an inference regression.
+
+## Context
+
+Phase 20 refreshed the current-stack serving numbers. Phase 24 added
+`hardware.txt`; Phase 25 added `gate_summary.tsv`. Phase 26 is the first full
+serving run that uses both audit surfaces in one artifact.
+
+## Checklist
+
+- [x] **Step 1: Preflight DGX**
+  - Verified no running docker containers before launch.
+  - Verified no `local-ai-worker` container before launch.
+  - Verified no active GPU compute processes before launch.
+  - Used the owner-file GPU lock protocol.
+
+- [x] **Step 2: Launch full current-stack snapshot**
+  - Ran `paged-current-serving-snapshot.sh` from the LocalAI worktree copy.
+  - Target source: `dgx:~/llama-phase6-source`.
+  - Source HEAD: `f2521ab12 feat(server): trace speculative batch shapes`.
+  - Artifact: `/home/mudler/bench/phase26_audited_snapshot/20260701_053650`.
+
+- [x] **Step 3: Preserve hardware evidence**
+  - `hardware.txt` recorded `hardware_class=gb10_or_workstation_blackwell`.
+  - `hardware.txt` recorded `GPU 0: NVIDIA GB10`.
+  - Driver: `580.159.03`.
+  - Compute capability: `12.1`.
+
+- [x] **Step 4: Gate inferencing before and after serving**
+  - Pre MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
+  - Pre dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
+  - Pre `MUL_MAT_ID`: `806/806`.
+  - Post MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
+  - Post dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
+  - Post `MUL_MAT_ID`: `806/806`.
+  - `gate_summary.tsv` records all rows as `ok`.
+
+- [x] **Step 5: Capture same-session serving numbers**
+  - Paged and vLLM were run in the same artifact with the same h2h client.
+  - `summary.tsv` records the aggregate, decode, per-sequence, TTFT, and prefill
+    rows plus ratios.
+
+- [x] **Step 6: Record results in project docs**
+  - Updated `README.md` with Phase 26 as the latest current-stack snapshot.
+  - Updated `GB10_PARITY_PHASE0_RESULTS.md` with the full audited result.
+  - Updated `PARITY_HANDOFF.md` with the operational handoff result and artifact
+    index.
+  - Updated `VLLM_PARITY_LEVER_MAP.md` with the current benchmark baseline.
+
+## Result
+
+Phase 26 confirms that the current clean stack still does not reach vLLM serving
+parity on GB10, while the inference gates remain green before and after the
+serving benchmark.
+
+| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
+|---|------------------|-----------------|-------------------|-----------|----------|----------------|
+| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% |
+| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% |
+| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% |
+
+Treat `/home/mudler/bench/phase26_audited_snapshot/20260701_053650` as the
+current audit-grade GB10 baseline.