diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index ab2aad514..10ee7416d 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -2668,6 +2668,60 @@ Decision: - Treat this artifact as a harness failure investigation, not a benchmark. - Retry Phase47 only after the Phase48 readiness/cleanup hardening is present. +## Phase 47 Dense Serving Snapshot Retry + +After Phase48 hardening, Phase47 was retried and completed successfully. + +Artifact: + +- `/home/mudler/bench/phase47_dense_serving_retry/20260701_100811` + +Run shape: + +- `MODEL=$HOME/bench/q36-27b-nvfp4.gguf` +- `VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm` +- `SERVED_MODEL_NAME=dense-q36` +- `NPL="1 8 32 128"`, `PARALLEL=128`, `CTX=131072`, `PTOK=128`, `GEN=64` +- `OPS=MUL_MAT,MUL_MAT_ID`, `VLLM_READY_ATTEMPTS=700` + +Pre/post gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Results: + +| arm | n | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT ms | +|-----|---|---------|-----------------|---------------------|-------------|---------| +| paged | 1 | `12.5` | `13.3` | `13.11` | `515.1` | `312.5` | +| vLLM | 1 | `9.6` | `9.9` | `9.72` | `983.6` | `166.7` | +| paged | 8 | `61.8` | `85.2` | `10.39` | `579.5` | `2201.4` | +| vLLM | 8 | `67.6` | `73.7` | `9.04` | `2147.7` | `544.0` | +| paged | 32 | `105.9` | `198.7` | `5.44` | `595.8` | `7442.7` | +| vLLM | 32 | `171.7` | `219.9` | `6.49` | `2094.4` | `2041.9` | +| paged | 128 | `139.6` | `360.8` | `1.86` | `608.1` | `21177.2` | +| vLLM | 128 | `275.3` | `456.0` | `2.89` | `1889.6` | `6615.7` | + +Ratios: + +| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM | +|---|---------------------|----------------------|------------------|-------------------| +| 1 | `1.3434` | `1.3488` | `1.3021` | `1.8746` | +| 8 | `1.1560` | `1.1493` | `0.9142` | `4.0467` | +| 32 | `0.9036` | `0.8382` | `0.6168` | `3.6450` | +| 128 | `0.7912` | `0.6436` | `0.5071` | `3.2011` | + +Decision: + +- Dense decode is ahead of vLLM at low concurrency (`n=1/8`) but falls behind + at `n=32/128`; this mirrors the broader conclusion that low-N decode can be + strong while prefill/TTFT and higher-concurrency serving remain gaps. +- Dense TTFT remains much worse than vLLM at all tested concurrency points, so + dense serving does not change the GB10 conclusion or reopen closed shortcut + work. + ## Phase 48 Serving Harness Readiness Hardening Phase 48 fixes the harness behavior exposed by the failed dense snapshot diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 85d7e63dc..09ec2a4f6 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -598,6 +598,14 @@ with `curl --max-time 2`, and uses bounded server cleanup that escalates from `/home/mudler/bench/phase48_readiness_harness_dryrun/20260701_100533`, with `VLLM_READY_ATTEMPTS=700` printed and clean DGX preflight. +Phase 47 retry completed after Phase48. Artifact: +`/home/mudler/bench/phase47_dense_serving_retry/20260701_100811`. Pre/post +gates were green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. Dense paged decode beats vLLM at low concurrency (`1.3434x` at `n=1`, +`1.1560x` at `n=8`) but falls behind at `n=32/128` (`0.9036x`, `0.7912x`), and +TTFT remains `1.87x` to `4.05x` vLLM. This does not change the GB10 conclusion. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -688,6 +696,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase47_dense_serving_dryrun/20260701_095141` - dense serving dry-run with `SERVED_MODEL_NAME=dense-q36`. - `~/bench/phase47_dense_serving/20260701_095151` - incomplete dense serving attempt; pre-gates and paged arm completed, vLLM did not produce result JSONs under the old readiness budget. - `~/bench/phase48_readiness_harness_dryrun/20260701_100533` - harness dry-run proving configurable readiness budgets and clean preflight before retrying dense serving. +- `~/bench/phase47_dense_serving_retry/20260701_100811` - completed dense serving snapshot after Phase48; pre/post md5 and op gates green; paged low-N decode ahead, high-N aggregate and TTFT behind. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 52c071f63..7b95ee648 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -1165,6 +1165,27 @@ DGX dry-run artifact: run printed `VLLM_READY_ATTEMPTS=700` with clean preflight. Retry dense serving snapshots with this hardening before interpreting dense paged-vs-vLLM ratios. +### Phase 47 dense serving snapshot retry + +After Phase48, the dense snapshot completed at +`/home/mudler/bench/phase47_dense_serving_retry/20260701_100811` with pre/post +gates green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +Dense paged-vs-vLLM ratios: + +| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM | +|---|---------------------|----------------------|------------------|-------------------| +| 1 | `1.3434` | `1.3488` | `1.3021` | `1.8746` | +| 8 | `1.1560` | `1.1493` | `0.9142` | `4.0467` | +| 32 | `0.9036` | `0.8382` | `0.6168` | `3.6450` | +| 128 | `0.7912` | `0.6436` | `0.5071` | `3.2011` | + +Decision: dense low-N decode remains a real paged strength, but dense serving +still does not close GB10 parity because TTFT and high-concurrency aggregate +throughput remain substantially behind vLLM. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md b/docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md index b1984a6c1..38175c7d4 100644 --- a/docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md +++ b/docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md @@ -28,7 +28,7 @@ Expected: exit `0`, docker/local-ai-worker/GPU compute all zero, dense model pat **Files:** - Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` -- [ ] **Step 1: Run full dense snapshot after Phase48 hardening** +- [x] **Step 1: Run full dense snapshot after Phase48 hardening** ```bash ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase47_dense_serving/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin MODEL=$HOME/bench/q36-27b-nvfp4.gguf VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm SERVED_MODEL_NAME=dense-q36 ART=$ART NPL="1 8 32 128" PARALLEL=128 CTX=131072 PTOK=128 GEN=64 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh @@ -41,6 +41,10 @@ First attempt status: incomplete at paged arm completed, but vLLM startup exceeded the old fixed readiness budget and produced no vLLM result JSONs. Retry only after Phase48 readiness hardening. +Retry status: completed at +`/home/mudler/bench/phase47_dense_serving_retry/20260701_100811` after Phase48 +with `VLLM_READY_ATTEMPTS=700`. + ### Task 3: Record dense snapshot result **Files:** @@ -49,11 +53,11 @@ and produced no vLLM result JSONs. Retry only after Phase48 readiness hardening. - Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` - Modify: `docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md` -- [ ] **Step 1: Summarize artifact outputs** +- [x] **Step 1: Summarize artifact outputs** Record the dry-run artifact, full snapshot artifact, pre/post md5/op gate status, and the ratio rows from `summary.tsv`. -- [ ] **Step 2: Mark completed plan items** +- [x] **Step 2: Mark completed plan items** Mark this plan's checkboxes complete only after the corresponding command or docs update has happened. @@ -62,7 +66,7 @@ Mark this plan's checkboxes complete only after the corresponding command or doc **Files:** - Commit Phase47 docs and plan changes. -- [ ] **Step 1: Run final checks** +- [x] **Step 1: Run final checks** ```bash git diff --check @@ -71,7 +75,7 @@ git status --short Expected: no whitespace errors; only intended docs/plan changes plus the pre-existing untracked `.claude/`. -- [ ] **Step 2: Commit** +- [x] **Step 2: Commit** ```bash git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \