docs(paged): record dense serving snapshot

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 08:20:26 +00:00
parent 440129c98e
commit 96825a224e
4 changed files with 93 additions and 5 deletions

View File

@@ -2668,6 +2668,60 @@ Decision:
- Treat this artifact as a harness failure investigation, not a benchmark.
- Retry Phase47 only after the Phase48 readiness/cleanup hardening is present.
## Phase 47 Dense Serving Snapshot Retry
After Phase48 hardening, Phase47 was retried and completed successfully.
Artifact:
- `/home/mudler/bench/phase47_dense_serving_retry/20260701_100811`
Run shape:
- `MODEL=$HOME/bench/q36-27b-nvfp4.gguf`
- `VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm`
- `SERVED_MODEL_NAME=dense-q36`
- `NPL="1 8 32 128"`, `PARALLEL=128`, `CTX=131072`, `PTOK=128`, `GEN=64`
- `OPS=MUL_MAT,MUL_MAT_ID`, `VLLM_READY_ATTEMPTS=700`
Pre/post gates:
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|-------|---------|-----------|-----------|--------------|
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
Results:
| arm | n | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT ms |
|-----|---|---------|-----------------|---------------------|-------------|---------|
| paged | 1 | `12.5` | `13.3` | `13.11` | `515.1` | `312.5` |
| vLLM | 1 | `9.6` | `9.9` | `9.72` | `983.6` | `166.7` |
| paged | 8 | `61.8` | `85.2` | `10.39` | `579.5` | `2201.4` |
| vLLM | 8 | `67.6` | `73.7` | `9.04` | `2147.7` | `544.0` |
| paged | 32 | `105.9` | `198.7` | `5.44` | `595.8` | `7442.7` |
| vLLM | 32 | `171.7` | `219.9` | `6.49` | `2094.4` | `2041.9` |
| paged | 128 | `139.6` | `360.8` | `1.86` | `608.1` | `21177.2` |
| vLLM | 128 | `275.3` | `456.0` | `2.89` | `1889.6` | `6615.7` |
Ratios:
| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM |
|---|---------------------|----------------------|------------------|-------------------|
| 1 | `1.3434` | `1.3488` | `1.3021` | `1.8746` |
| 8 | `1.1560` | `1.1493` | `0.9142` | `4.0467` |
| 32 | `0.9036` | `0.8382` | `0.6168` | `3.6450` |
| 128 | `0.7912` | `0.6436` | `0.5071` | `3.2011` |
Decision:
- Dense decode is ahead of vLLM at low concurrency (`n=1/8`) but falls behind
at `n=32/128`; this mirrors the broader conclusion that low-N decode can be
strong while prefill/TTFT and higher-concurrency serving remain gaps.
- Dense TTFT remains much worse than vLLM at all tested concurrency points, so
dense serving does not change the GB10 conclusion or reopen closed shortcut
work.
## Phase 48 Serving Harness Readiness Hardening
Phase 48 fixes the harness behavior exposed by the failed dense snapshot

View File

@@ -598,6 +598,14 @@ with `curl --max-time 2`, and uses bounded server cleanup that escalates from
`/home/mudler/bench/phase48_readiness_harness_dryrun/20260701_100533`, with
`VLLM_READY_ATTEMPTS=700` printed and clean DGX preflight.
Phase 47 retry completed after Phase48. Artifact:
`/home/mudler/bench/phase47_dense_serving_retry/20260701_100811`. Pre/post
gates were green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
`806/806`. Dense paged decode beats vLLM at low concurrency (`1.3434x` at `n=1`,
`1.1560x` at `n=8`) but falls behind at `n=32/128` (`0.9036x`, `0.7912x`), and
TTFT remains `1.87x` to `4.05x` vLLM. This does not change the GB10 conclusion.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -688,6 +696,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
- `~/bench/phase47_dense_serving_dryrun/20260701_095141` - dense serving dry-run with `SERVED_MODEL_NAME=dense-q36`.
- `~/bench/phase47_dense_serving/20260701_095151` - incomplete dense serving attempt; pre-gates and paged arm completed, vLLM did not produce result JSONs under the old readiness budget.
- `~/bench/phase48_readiness_harness_dryrun/20260701_100533` - harness dry-run proving configurable readiness budgets and clean preflight before retrying dense serving.
- `~/bench/phase47_dense_serving_retry/20260701_100811` - completed dense serving snapshot after Phase48; pre/post md5 and op gates green; paged low-N decode ahead, high-N aggregate and TTFT behind.
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.

View File

@@ -1165,6 +1165,27 @@ DGX dry-run artifact:
run printed `VLLM_READY_ATTEMPTS=700` with clean preflight. Retry dense serving
snapshots with this hardening before interpreting dense paged-vs-vLLM ratios.
### Phase 47 dense serving snapshot retry
After Phase48, the dense snapshot completed at
`/home/mudler/bench/phase47_dense_serving_retry/20260701_100811` with pre/post
gates green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
`806/806`.
Dense paged-vs-vLLM ratios:
| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM |
|---|---------------------|----------------------|------------------|-------------------|
| 1 | `1.3434` | `1.3488` | `1.3021` | `1.8746` |
| 8 | `1.1560` | `1.1493` | `0.9142` | `4.0467` |
| 32 | `0.9036` | `0.8382` | `0.6168` | `3.6450` |
| 128 | `0.7912` | `0.6436` | `0.5071` | `3.2011` |
Decision: dense low-N decode remains a real paged strength, but dense serving
still does not close GB10 parity because TTFT and high-concurrency aggregate
throughput remain substantially behind vLLM.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update

View File

@@ -28,7 +28,7 @@ Expected: exit `0`, docker/local-ai-worker/GPU compute all zero, dense model pat
**Files:**
- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
- [ ] **Step 1: Run full dense snapshot after Phase48 hardening**
- [x] **Step 1: Run full dense snapshot after Phase48 hardening**
```bash
ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase47_dense_serving/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin MODEL=$HOME/bench/q36-27b-nvfp4.gguf VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm SERVED_MODEL_NAME=dense-q36 ART=$ART NPL="1 8 32 128" PARALLEL=128 CTX=131072 PTOK=128 GEN=64 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
@@ -41,6 +41,10 @@ First attempt status: incomplete at
paged arm completed, but vLLM startup exceeded the old fixed readiness budget
and produced no vLLM result JSONs. Retry only after Phase48 readiness hardening.
Retry status: completed at
`/home/mudler/bench/phase47_dense_serving_retry/20260701_100811` after Phase48
with `VLLM_READY_ATTEMPTS=700`.
### Task 3: Record dense snapshot result
**Files:**
@@ -49,11 +53,11 @@ and produced no vLLM result JSONs. Retry only after Phase48 readiness hardening.
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
- Modify: `docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md`
- [ ] **Step 1: Summarize artifact outputs**
- [x] **Step 1: Summarize artifact outputs**
Record the dry-run artifact, full snapshot artifact, pre/post md5/op gate status, and the ratio rows from `summary.tsv`.
- [ ] **Step 2: Mark completed plan items**
- [x] **Step 2: Mark completed plan items**
Mark this plan's checkboxes complete only after the corresponding command or docs update has happened.
@@ -62,7 +66,7 @@ Mark this plan's checkboxes complete only after the corresponding command or doc
**Files:**
- Commit Phase47 docs and plan changes.
- [ ] **Step 1: Run final checks**
- [x] **Step 1: Run final checks**
```bash
git diff --check
@@ -71,7 +75,7 @@ git status --short
Expected: no whitespace errors; only intended docs/plan changes plus the pre-existing untracked `.claude/`.
- [ ] **Step 2: Commit**
- [x] **Step 2: Commit**
```bash
git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \