docs(paged): record audited current snapshot

Record the Phase 26 current-stack paged-vs-vLLM serving snapshot with hardware classification and compact pre/post inference gates.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 03:48:27 +00:00
parent a0194125f5
commit ace1ffab28
5 changed files with 189 additions and 4 deletions

View File

@@ -607,13 +607,15 @@ the structural floors, the parity verdict - is
Latest current-stack MoE serving snapshot (`PTOK=128`, `GEN=64`, current clean
DGX mirror `f2521ab12`, artifact
`/home/mudler/bench/phase20_current_snapshot/20260701_050621`):
`/home/mudler/bench/phase26_audited_snapshot/20260701_053650`). This run
includes `hardware.txt` and `gate_summary.tsv`; all pre/post gate rows are
`ok`:
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% |
| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% |
| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% |
Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving
snapshots. It targets the clean `~/llama-phase6-source` mirror, checks

View File

@@ -1571,3 +1571,69 @@ Decision:
stayed green before and after the paged-vs-vLLM run.
- Treat `gate_summary.tsv` plus `hardware.txt` as the quick audit surface before
accepting a parity snapshot.
## Phase 26 Audited Current-Stack Serving Snapshot
Phase 26 ran a full current-stack paged-vs-vLLM MoE serving snapshot with the
Phase 24/25 audit files enabled.
Artifact:
- `/home/mudler/bench/phase26_audited_snapshot/20260701_053650`
Current source:
- `/home/mudler/llama-phase6-source`
- `f2521ab12 feat(server): trace speculative batch shapes`
Hardware report:
- `hardware_class=gb10_or_workstation_blackwell`
- `GPU 0: NVIDIA GB10`
- driver `580.159.03`
- compute capability `12.1`
Pre/post gate summary:
| phase | check | status | actual |
|-------|-------|--------|--------|
| pre | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
| pre | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
| pre | `MUL_MAT_ID` | ok | `806/806` |
| post | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
| post | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
| post | `MUL_MAT_ID` | ok | `806/806` |
Serving snapshot:
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% |
| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% |
| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% |
Latency/prefill snapshot:
| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps |
|---|---------------|--------------|------------------|--------------------|------------------|
| 8 | 778.6 | 271.1 | 2.87x | 1679.9 | 4485.6 |
| 32 | 2607.4 | 749.4 | 3.48x | 1698.8 | 5427.8 |
| 128 | 7569.6 | 2534.3 | 2.99x | 1668.7 | 5122.0 |
vLLM startup notes:
- vLLM selected the expected GB10 backend mix: FlashInfer FP8 projection
kernels, Triton/FLA GDN prefill, FlashAttention, and MARLIN NVFP4 MoE.
- Startup was long because the server loaded three checkpoint shards, loaded
cached torch-compile graphs, ran FlashInfer fp8 GEMM autotuning, and captured
CUDA graphs before the API became ready.
Decision:
- The audited current stack still is not at vLLM serving parity on GB10.
- The Phase 20 conclusion is reproduced with stronger audit artifacts:
`hardware.txt`, `gate_summary.tsv`, pre/post full gates, and same-session
paged/vLLM ratios.
- Current paged/vLLM decode ratios remain about `81.5%` at n8, `69.0%` at n32,
and `65.7%` at n128; e2e aggregate ratios remain about `70.6%`, `54.6%`,
and `49.4%`.

View File

@@ -342,6 +342,28 @@ backfilled on the Phase 20 artifact at
it records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806` as `ok`.
Phase 26 ran the full audited current-stack snapshot with `hardware.txt`,
pre/post gates, same-session paged and vLLM serving runs, `summary.tsv`, and
`gate_summary.tsv`. Artifact:
`/home/mudler/bench/phase26_audited_snapshot/20260701_053650`. Hardware was
recorded as `hardware_class=gb10_or_workstation_blackwell`, GPU `NVIDIA GB10`,
driver `580.159.03`, compute capability `12.1`. Every compact gate row was
`ok`: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`, both before and
after the serving run.
Audited current MoE serving snapshot (`PTOK=128`, `GEN=64`):
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% |
| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% |
| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% |
Use Phase 26 as the current audit-grade GB10 snapshot. It keeps the Phase 20
verdict intact, but the artifact is more useful for future regressions because
it carries hardware classification and compact pre/post inference gates.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -407,6 +429,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
- `~/bench/phase21_harness_dryrun/20260701_051757` - current snapshot harness dry-run artifact.
- `~/bench/phase24_hardware_report_dryrun/20260701_052741` - current snapshot harness dry run proving `hardware.txt` captures the DGX as `hardware_class=gb10_or_workstation_blackwell`.
- `~/bench/phase25_gate_summary_dryrun/20260701_053353` - dry run after adding `gate_summary.tsv` support; normal dry-run still writes `hardware.txt` and does not emit a gate summary before gates exist.
- `~/bench/phase26_audited_snapshot/20260701_053650` - current audit-grade full paged-vs-vLLM MoE serving snapshot with `hardware.txt`, pre/post gates, `summary.tsv`, and `gate_summary.tsv`.
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.

View File

@@ -691,6 +691,31 @@ It records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
Use `hardware.txt` plus `gate_summary.tsv` as the quick audit surface before
accepting any new parity snapshot.
### Phase 26 audited current-stack snapshot
Phase 26 ran the full current-stack paged-vs-vLLM MoE serving snapshot with the
Phase 24/25 audit files enabled:
`/home/mudler/bench/phase26_audited_snapshot/20260701_053650`.
The artifact records `hardware_class=gb10_or_workstation_blackwell` on GPU
`NVIDIA GB10` with driver `580.159.03` and compute capability `12.1`.
`gate_summary.tsv` reports every pre/post gate as `ok`: MoE md5
`8cb0ce23777bf55f92f63d0292c756b0`, dense md5
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
Audited MoE serving result (`PTOK=128`, `GEN=64`):
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% |
| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% |
| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% |
Decision: the latest audited clean-stack run still does not reach vLLM serving
parity on GB10. Treat Phase 26 as the current benchmark baseline before funding
new kernel work, and keep md5/op gates as the first check when changing the
patch stack.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update