mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 12:57:02 -04:00
docs(paged): record audited current snapshot
Record the Phase 26 current-stack paged-vs-vLLM serving snapshot with hardware classification and compact pre/post inference gates. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -607,13 +607,15 @@ the structural floors, the parity verdict - is
|
||||
|
||||
Latest current-stack MoE serving snapshot (`PTOK=128`, `GEN=64`, current clean
|
||||
DGX mirror `f2521ab12`, artifact
|
||||
`/home/mudler/bench/phase20_current_snapshot/20260701_050621`):
|
||||
`/home/mudler/bench/phase26_audited_snapshot/20260701_053650`). This run
|
||||
includes `hardware.txt` and `gate_summary.tsv`; all pre/post gate rows are
|
||||
`ok`:
|
||||
|
||||
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|
||||
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
|
||||
| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
|
||||
| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
|
||||
| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
|
||||
| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% |
|
||||
| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% |
|
||||
| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% |
|
||||
|
||||
Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving
|
||||
snapshots. It targets the clean `~/llama-phase6-source` mirror, checks
|
||||
|
||||
@@ -1571,3 +1571,69 @@ Decision:
|
||||
stayed green before and after the paged-vs-vLLM run.
|
||||
- Treat `gate_summary.tsv` plus `hardware.txt` as the quick audit surface before
|
||||
accepting a parity snapshot.
|
||||
|
||||
## Phase 26 Audited Current-Stack Serving Snapshot
|
||||
|
||||
Phase 26 ran a full current-stack paged-vs-vLLM MoE serving snapshot with the
|
||||
Phase 24/25 audit files enabled.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase26_audited_snapshot/20260701_053650`
|
||||
|
||||
Current source:
|
||||
|
||||
- `/home/mudler/llama-phase6-source`
|
||||
- `f2521ab12 feat(server): trace speculative batch shapes`
|
||||
|
||||
Hardware report:
|
||||
|
||||
- `hardware_class=gb10_or_workstation_blackwell`
|
||||
- `GPU 0: NVIDIA GB10`
|
||||
- driver `580.159.03`
|
||||
- compute capability `12.1`
|
||||
|
||||
Pre/post gate summary:
|
||||
|
||||
| phase | check | status | actual |
|
||||
|-------|-------|--------|--------|
|
||||
| pre | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
|
||||
| pre | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
|
||||
| pre | `MUL_MAT_ID` | ok | `806/806` |
|
||||
| post | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
|
||||
| post | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
|
||||
| post | `MUL_MAT_ID` | ok | `806/806` |
|
||||
|
||||
Serving snapshot:
|
||||
|
||||
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|
||||
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
|
||||
| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% |
|
||||
| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% |
|
||||
| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% |
|
||||
|
||||
Latency/prefill snapshot:
|
||||
|
||||
| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps |
|
||||
|---|---------------|--------------|------------------|--------------------|------------------|
|
||||
| 8 | 778.6 | 271.1 | 2.87x | 1679.9 | 4485.6 |
|
||||
| 32 | 2607.4 | 749.4 | 3.48x | 1698.8 | 5427.8 |
|
||||
| 128 | 7569.6 | 2534.3 | 2.99x | 1668.7 | 5122.0 |
|
||||
|
||||
vLLM startup notes:
|
||||
|
||||
- vLLM selected the expected GB10 backend mix: FlashInfer FP8 projection
|
||||
kernels, Triton/FLA GDN prefill, FlashAttention, and MARLIN NVFP4 MoE.
|
||||
- Startup was long because the server loaded three checkpoint shards, loaded
|
||||
cached torch-compile graphs, ran FlashInfer fp8 GEMM autotuning, and captured
|
||||
CUDA graphs before the API became ready.
|
||||
|
||||
Decision:
|
||||
|
||||
- The audited current stack still is not at vLLM serving parity on GB10.
|
||||
- The Phase 20 conclusion is reproduced with stronger audit artifacts:
|
||||
`hardware.txt`, `gate_summary.tsv`, pre/post full gates, and same-session
|
||||
paged/vLLM ratios.
|
||||
- Current paged/vLLM decode ratios remain about `81.5%` at n8, `69.0%` at n32,
|
||||
and `65.7%` at n128; e2e aggregate ratios remain about `70.6%`, `54.6%`,
|
||||
and `49.4%`.
|
||||
|
||||
@@ -342,6 +342,28 @@ backfilled on the Phase 20 artifact at
|
||||
it records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806` as `ok`.
|
||||
|
||||
Phase 26 ran the full audited current-stack snapshot with `hardware.txt`,
|
||||
pre/post gates, same-session paged and vLLM serving runs, `summary.tsv`, and
|
||||
`gate_summary.tsv`. Artifact:
|
||||
`/home/mudler/bench/phase26_audited_snapshot/20260701_053650`. Hardware was
|
||||
recorded as `hardware_class=gb10_or_workstation_blackwell`, GPU `NVIDIA GB10`,
|
||||
driver `580.159.03`, compute capability `12.1`. Every compact gate row was
|
||||
`ok`: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`, both before and
|
||||
after the serving run.
|
||||
|
||||
Audited current MoE serving snapshot (`PTOK=128`, `GEN=64`):
|
||||
|
||||
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|
||||
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
|
||||
| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% |
|
||||
| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% |
|
||||
| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% |
|
||||
|
||||
Use Phase 26 as the current audit-grade GB10 snapshot. It keeps the Phase 20
|
||||
verdict intact, but the artifact is more useful for future regressions because
|
||||
it carries hardware classification and compact pre/post inference gates.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
@@ -407,6 +429,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/phase21_harness_dryrun/20260701_051757` - current snapshot harness dry-run artifact.
|
||||
- `~/bench/phase24_hardware_report_dryrun/20260701_052741` - current snapshot harness dry run proving `hardware.txt` captures the DGX as `hardware_class=gb10_or_workstation_blackwell`.
|
||||
- `~/bench/phase25_gate_summary_dryrun/20260701_053353` - dry run after adding `gate_summary.tsv` support; normal dry-run still writes `hardware.txt` and does not emit a gate summary before gates exist.
|
||||
- `~/bench/phase26_audited_snapshot/20260701_053650` - current audit-grade full paged-vs-vLLM MoE serving snapshot with `hardware.txt`, pre/post gates, `summary.tsv`, and `gate_summary.tsv`.
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
|
||||
@@ -691,6 +691,31 @@ It records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
Use `hardware.txt` plus `gate_summary.tsv` as the quick audit surface before
|
||||
accepting any new parity snapshot.
|
||||
|
||||
### Phase 26 audited current-stack snapshot
|
||||
|
||||
Phase 26 ran the full current-stack paged-vs-vLLM MoE serving snapshot with the
|
||||
Phase 24/25 audit files enabled:
|
||||
`/home/mudler/bench/phase26_audited_snapshot/20260701_053650`.
|
||||
|
||||
The artifact records `hardware_class=gb10_or_workstation_blackwell` on GPU
|
||||
`NVIDIA GB10` with driver `580.159.03` and compute capability `12.1`.
|
||||
`gate_summary.tsv` reports every pre/post gate as `ok`: MoE md5
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
|
||||
|
||||
Audited MoE serving result (`PTOK=128`, `GEN=64`):
|
||||
|
||||
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|
||||
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
|
||||
| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% |
|
||||
| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% |
|
||||
| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% |
|
||||
|
||||
Decision: the latest audited clean-stack run still does not reach vLLM serving
|
||||
parity on GB10. Treat Phase 26 as the current benchmark baseline before funding
|
||||
new kernel work, and keep md5/op gates as the first check when changing the
|
||||
patch stack.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -0,0 +1,69 @@
|
||||
# Audited Current Stack Snapshot Phase 26 Plan
|
||||
|
||||
**Date:** 2026-07-01
|
||||
**Phase:** 26
|
||||
**Goal:** run the reusable current-stack paged-vs-vLLM serving harness end to
|
||||
end on the DGX, with hardware and compact inference gates attached to the
|
||||
artifact, so throughput comparisons cannot hide an inference regression.
|
||||
|
||||
## Context
|
||||
|
||||
Phase 20 refreshed the current-stack serving numbers. Phase 24 added
|
||||
`hardware.txt`; Phase 25 added `gate_summary.tsv`. Phase 26 is the first full
|
||||
serving run that uses both audit surfaces in one artifact.
|
||||
|
||||
## Checklist
|
||||
|
||||
- [x] **Step 1: Preflight DGX**
|
||||
- Verified no running docker containers before launch.
|
||||
- Verified no `local-ai-worker` container before launch.
|
||||
- Verified no active GPU compute processes before launch.
|
||||
- Used the owner-file GPU lock protocol.
|
||||
|
||||
- [x] **Step 2: Launch full current-stack snapshot**
|
||||
- Ran `paged-current-serving-snapshot.sh` from the LocalAI worktree copy.
|
||||
- Target source: `dgx:~/llama-phase6-source`.
|
||||
- Source HEAD: `f2521ab12 feat(server): trace speculative batch shapes`.
|
||||
- Artifact: `/home/mudler/bench/phase26_audited_snapshot/20260701_053650`.
|
||||
|
||||
- [x] **Step 3: Preserve hardware evidence**
|
||||
- `hardware.txt` recorded `hardware_class=gb10_or_workstation_blackwell`.
|
||||
- `hardware.txt` recorded `GPU 0: NVIDIA GB10`.
|
||||
- Driver: `580.159.03`.
|
||||
- Compute capability: `12.1`.
|
||||
|
||||
- [x] **Step 4: Gate inferencing before and after serving**
|
||||
- Pre MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Pre dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Pre `MUL_MAT_ID`: `806/806`.
|
||||
- Post MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Post dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Post `MUL_MAT_ID`: `806/806`.
|
||||
- `gate_summary.tsv` records all rows as `ok`.
|
||||
|
||||
- [x] **Step 5: Capture same-session serving numbers**
|
||||
- Paged and vLLM were run in the same artifact with the same h2h client.
|
||||
- `summary.tsv` records the aggregate, decode, per-sequence, TTFT, and prefill
|
||||
rows plus ratios.
|
||||
|
||||
- [x] **Step 6: Record results in project docs**
|
||||
- Updated `README.md` with Phase 26 as the latest current-stack snapshot.
|
||||
- Updated `GB10_PARITY_PHASE0_RESULTS.md` with the full audited result.
|
||||
- Updated `PARITY_HANDOFF.md` with the operational handoff result and artifact
|
||||
index.
|
||||
- Updated `VLLM_PARITY_LEVER_MAP.md` with the current benchmark baseline.
|
||||
|
||||
## Result
|
||||
|
||||
Phase 26 confirms that the current clean stack still does not reach vLLM serving
|
||||
parity on GB10, while the inference gates remain green before and after the
|
||||
serving benchmark.
|
||||
|
||||
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|
||||
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
|
||||
| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% |
|
||||
| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% |
|
||||
| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% |
|
||||
|
||||
Treat `/home/mudler/bench/phase26_audited_snapshot/20260701_053650` as the
|
||||
current audit-grade GB10 baseline.
|
||||
Reference in New Issue
Block a user