From ace1ffab28cb42c3aa1423df5bb51a0148cfaead Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 03:48:27 +0000 Subject: [PATCH] docs(paged): record audited current snapshot Record the Phase 26 current-stack paged-vs-vLLM serving snapshot with hardware classification and compact pre/post inference gates. Assisted-by: Codex:gpt-5 --- backend/cpp/llama-cpp-localai-paged/README.md | 10 +-- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 66 ++++++++++++++++++ .../docs/PARITY_HANDOFF.md | 23 +++++++ .../docs/VLLM_PARITY_LEVER_MAP.md | 25 +++++++ ...-audited-current-stack-snapshot-phase26.md | 69 +++++++++++++++++++ 5 files changed, 189 insertions(+), 4 deletions(-) create mode 100644 docs/superpowers/plans/2026-07-01-audited-current-stack-snapshot-phase26.md diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md index 6755f001f..6b8eee31e 100644 --- a/backend/cpp/llama-cpp-localai-paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -607,13 +607,15 @@ the structural floors, the parity verdict - is Latest current-stack MoE serving snapshot (`PTOK=128`, `GEN=64`, current clean DGX mirror `f2521ab12`, artifact -`/home/mudler/bench/phase20_current_snapshot/20260701_050621`): +`/home/mudler/bench/phase26_audited_snapshot/20260701_053650`). This run +includes `hardware.txt` and `gate_summary.tsv`; all pre/post gate rows are +`ok`: | n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | |---|------------------|-----------------|-------------------|-----------|----------|----------------| -| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% | -| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% | -| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% | +| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% | +| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% | +| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% | Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving snapshots. It targets the clean `~/llama-phase6-source` mirror, checks diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 8a777ad92..04909be9d 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -1571,3 +1571,69 @@ Decision: stayed green before and after the paged-vs-vLLM run. - Treat `gate_summary.tsv` plus `hardware.txt` as the quick audit surface before accepting a parity snapshot. + +## Phase 26 Audited Current-Stack Serving Snapshot + +Phase 26 ran a full current-stack paged-vs-vLLM MoE serving snapshot with the +Phase 24/25 audit files enabled. + +Artifact: + +- `/home/mudler/bench/phase26_audited_snapshot/20260701_053650` + +Current source: + +- `/home/mudler/llama-phase6-source` +- `f2521ab12 feat(server): trace speculative batch shapes` + +Hardware report: + +- `hardware_class=gb10_or_workstation_blackwell` +- `GPU 0: NVIDIA GB10` +- driver `580.159.03` +- compute capability `12.1` + +Pre/post gate summary: + +| phase | check | status | actual | +|-------|-------|--------|--------| +| pre | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| pre | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| pre | `MUL_MAT_ID` | ok | `806/806` | +| post | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post | `MUL_MAT_ID` | ok | `806/806` | + +Serving snapshot: + +| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | +|---|------------------|-----------------|-------------------|-----------|----------|----------------| +| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% | +| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% | +| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% | + +Latency/prefill snapshot: + +| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps | +|---|---------------|--------------|------------------|--------------------|------------------| +| 8 | 778.6 | 271.1 | 2.87x | 1679.9 | 4485.6 | +| 32 | 2607.4 | 749.4 | 3.48x | 1698.8 | 5427.8 | +| 128 | 7569.6 | 2534.3 | 2.99x | 1668.7 | 5122.0 | + +vLLM startup notes: + +- vLLM selected the expected GB10 backend mix: FlashInfer FP8 projection + kernels, Triton/FLA GDN prefill, FlashAttention, and MARLIN NVFP4 MoE. +- Startup was long because the server loaded three checkpoint shards, loaded + cached torch-compile graphs, ran FlashInfer fp8 GEMM autotuning, and captured + CUDA graphs before the API became ready. + +Decision: + +- The audited current stack still is not at vLLM serving parity on GB10. +- The Phase 20 conclusion is reproduced with stronger audit artifacts: + `hardware.txt`, `gate_summary.tsv`, pre/post full gates, and same-session + paged/vLLM ratios. +- Current paged/vLLM decode ratios remain about `81.5%` at n8, `69.0%` at n32, + and `65.7%` at n128; e2e aggregate ratios remain about `70.6%`, `54.6%`, + and `49.4%`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index cda3918bd..8b0b5a26e 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -342,6 +342,28 @@ backfilled on the Phase 20 artifact at it records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806` as `ok`. +Phase 26 ran the full audited current-stack snapshot with `hardware.txt`, +pre/post gates, same-session paged and vLLM serving runs, `summary.tsv`, and +`gate_summary.tsv`. Artifact: +`/home/mudler/bench/phase26_audited_snapshot/20260701_053650`. Hardware was +recorded as `hardware_class=gb10_or_workstation_blackwell`, GPU `NVIDIA GB10`, +driver `580.159.03`, compute capability `12.1`. Every compact gate row was +`ok`: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`, both before and +after the serving run. + +Audited current MoE serving snapshot (`PTOK=128`, `GEN=64`): + +| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | +|---|------------------|-----------------|-------------------|-----------|----------|----------------| +| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% | +| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% | +| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% | + +Use Phase 26 as the current audit-grade GB10 snapshot. It keeps the Phase 20 +verdict intact, but the artifact is more useful for future regressions because +it carries hardware classification and compact pre/post inference gates. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -407,6 +429,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase21_harness_dryrun/20260701_051757` - current snapshot harness dry-run artifact. - `~/bench/phase24_hardware_report_dryrun/20260701_052741` - current snapshot harness dry run proving `hardware.txt` captures the DGX as `hardware_class=gb10_or_workstation_blackwell`. - `~/bench/phase25_gate_summary_dryrun/20260701_053353` - dry run after adding `gate_summary.tsv` support; normal dry-run still writes `hardware.txt` and does not emit a gate summary before gates exist. +- `~/bench/phase26_audited_snapshot/20260701_053650` - current audit-grade full paged-vs-vLLM MoE serving snapshot with `hardware.txt`, pre/post gates, `summary.tsv`, and `gate_summary.tsv`. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 552ef6c0b..25ec6b3b8 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -691,6 +691,31 @@ It records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 Use `hardware.txt` plus `gate_summary.tsv` as the quick audit surface before accepting any new parity snapshot. +### Phase 26 audited current-stack snapshot + +Phase 26 ran the full current-stack paged-vs-vLLM MoE serving snapshot with the +Phase 24/25 audit files enabled: +`/home/mudler/bench/phase26_audited_snapshot/20260701_053650`. + +The artifact records `hardware_class=gb10_or_workstation_blackwell` on GPU +`NVIDIA GB10` with driver `580.159.03` and compute capability `12.1`. +`gate_summary.tsv` reports every pre/post gate as `ok`: MoE md5 +`8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +Audited MoE serving result (`PTOK=128`, `GEN=64`): + +| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | +|---|------------------|-----------------|-------------------|-----------|----------|----------------| +| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% | +| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% | +| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% | + +Decision: the latest audited clean-stack run still does not reach vLLM serving +parity on GB10. Treat Phase 26 as the current benchmark baseline before funding +new kernel work, and keep md5/op gates as the first check when changing the +patch stack. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-audited-current-stack-snapshot-phase26.md b/docs/superpowers/plans/2026-07-01-audited-current-stack-snapshot-phase26.md new file mode 100644 index 000000000..57673cedd --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-audited-current-stack-snapshot-phase26.md @@ -0,0 +1,69 @@ +# Audited Current Stack Snapshot Phase 26 Plan + +**Date:** 2026-07-01 +**Phase:** 26 +**Goal:** run the reusable current-stack paged-vs-vLLM serving harness end to +end on the DGX, with hardware and compact inference gates attached to the +artifact, so throughput comparisons cannot hide an inference regression. + +## Context + +Phase 20 refreshed the current-stack serving numbers. Phase 24 added +`hardware.txt`; Phase 25 added `gate_summary.tsv`. Phase 26 is the first full +serving run that uses both audit surfaces in one artifact. + +## Checklist + +- [x] **Step 1: Preflight DGX** + - Verified no running docker containers before launch. + - Verified no `local-ai-worker` container before launch. + - Verified no active GPU compute processes before launch. + - Used the owner-file GPU lock protocol. + +- [x] **Step 2: Launch full current-stack snapshot** + - Ran `paged-current-serving-snapshot.sh` from the LocalAI worktree copy. + - Target source: `dgx:~/llama-phase6-source`. + - Source HEAD: `f2521ab12 feat(server): trace speculative batch shapes`. + - Artifact: `/home/mudler/bench/phase26_audited_snapshot/20260701_053650`. + +- [x] **Step 3: Preserve hardware evidence** + - `hardware.txt` recorded `hardware_class=gb10_or_workstation_blackwell`. + - `hardware.txt` recorded `GPU 0: NVIDIA GB10`. + - Driver: `580.159.03`. + - Compute capability: `12.1`. + +- [x] **Step 4: Gate inferencing before and after serving** + - Pre MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. + - Pre dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. + - Pre `MUL_MAT_ID`: `806/806`. + - Post MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. + - Post dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. + - Post `MUL_MAT_ID`: `806/806`. + - `gate_summary.tsv` records all rows as `ok`. + +- [x] **Step 5: Capture same-session serving numbers** + - Paged and vLLM were run in the same artifact with the same h2h client. + - `summary.tsv` records the aggregate, decode, per-sequence, TTFT, and prefill + rows plus ratios. + +- [x] **Step 6: Record results in project docs** + - Updated `README.md` with Phase 26 as the latest current-stack snapshot. + - Updated `GB10_PARITY_PHASE0_RESULTS.md` with the full audited result. + - Updated `PARITY_HANDOFF.md` with the operational handoff result and artifact + index. + - Updated `VLLM_PARITY_LEVER_MAP.md` with the current benchmark baseline. + +## Result + +Phase 26 confirms that the current clean stack still does not reach vLLM serving +parity on GB10, while the inference gates remain green before and after the +serving benchmark. + +| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | +|---|------------------|-----------------|-------------------|-----------|----------|----------------| +| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% | +| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% | +| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% | + +Treat `/home/mudler/bench/phase26_audited_snapshot/20260701_053650` as the +current audit-grade GB10 baseline.