From d44e164c962192616fe4bd591a03f8ce47d4ab69 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 07:13:38 +0000 Subject: [PATCH] docs(paged): record max-concurrency parity check Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 75 +++++++++ .../docs/PARITY_HANDOFF.md | 13 ++ .../docs/VLLM_PARITY_LEVER_MAP.md | 26 +++ .../paged-current-serving-snapshot.sh | 8 +- .../2026-07-01-max-concurrency-phase40.md | 152 ++++++++++++++++++ 5 files changed, 271 insertions(+), 3 deletions(-) create mode 100644 docs/superpowers/plans/2026-07-01-max-concurrency-phase40.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 8a0cf5c8e..81123ae58 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -2295,3 +2295,78 @@ Decision: default-off, keep gate weights in F32, avoid graph-time weight concat, and pass MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` gates before any serving benchmark. If md5 changes, run KL first and reject on KL regression. + +## Phase 40 Max-Concurrency C1 Check + +Phase 40 tested the remaining C1 hypothesis from the lever map: use paged KV's +lower memory footprint to run a higher-concurrency serving point where vLLM +falls behind or fails to fit. + +Artifacts: + +- `/home/mudler/bench/phase40_max_concurrency_dryrun/20260701_090002` +- `/home/mudler/bench/phase40_max_concurrency/20260701_090012` + +Preflight: + +| check | actual | +|-------|--------| +| GPU | `NVIDIA GB10, 580.159.03` | +| docker containers | `0` | +| `local-ai-worker` containers | `0` | +| GPU compute apps | `0` | +| GPU lock owner | `FREE phase39-gate-sgemm-profile-done 1782888737` | + +Harness change: + +- `paged-current-serving-snapshot.sh` now accepts `BUILD_DIR` and defaults + `BIN` from that same directory. This keeps the benchmark build step and runtime + binaries pointed at the same CMake tree. +- Phase 40 used `BUILD_DIR=$HOME/llama-phase6-source/build-phase36`, + `BIN=$HOME/llama-phase6-source/build-phase36/bin`, + `OPS=MUL_MAT,MUL_MAT_ID`, `PARALLEL=256`, `CTX=262144`, `PTOK=128`, + `GEN=64`, `NPL="128 192 256"`. + +Pre/post inference gates: + +| phase | check | status | actual | +|-------|-------|--------|--------| +| pre | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| pre | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| pre | `MUL_MAT` | ok | `1146/1146` | +| pre | `MUL_MAT_ID` | ok | `806/806` | +| post | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post | `MUL_MAT` | ok | `1146/1146` | +| post | `MUL_MAT_ID` | ok | `806/806` | + +Serving result: + +| arm | n | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms | +|-----|---|---------|----------------|--------------------|-------------|--------------| +| paged | 128 | `326.3` | `671.8` | `3.97` | `1695.2` | `8182.3` | +| paged | 192 | `318.3` | `679.9` | `2.50` | `1605.2` | `11151.6` | +| paged | 256 | `337.1` | `829.9` | `2.09` | `1525.7` | `15065.7` | +| vLLM | 128 | `654.4` | `1013.3` | `6.72` | `5206.0` | `2582.6` | +| vLLM | 192 | `697.7` | `1185.2` | `4.88` | `4787.1` | `3690.6` | +| vLLM | 256 | `714.1` | `1306.1` | `3.90` | `4471.0` | `5124.2` | + +Ratios: + +| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM | +|---|---------------------|----------------------|------------------|-------------------| +| 128 | `0.6630` | `0.5908` | `0.4986` | `3.1682` | +| 192 | `0.5737` | `0.5123` | `0.4562` | `3.0216` | +| 256 | `0.6354` | `0.5359` | `0.4721` | `2.9401` | + +Decision: + +- C1 does not close GB10 parity for this workload. Paged safely serves `n=256` + with canonical md5/op gates green before and after the run, but vLLM also + fits and remains materially faster. +- Do not claim a GB10 parity win from higher max concurrency at + `PTOK=128`, `GEN=64`, `n<=256`. +- The next GB10 work should stay on the profile-validated root causes: + prefill GDN, prefill MoE GEMM, and low-concurrency/full-step graph capture. + Any future C1 rerun must push beyond this tested point and keep the same + md5 plus `MUL_MAT`/`MUL_MAT_ID` gates. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 6c190c2fb..41d0f61b9 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -513,6 +513,18 @@ The only future fused-gate design worth scoping is a persistent/load-time F32 combined gate weight with output views, default-off until MoE/dense md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass. +Phase 40 closes the tested GB10 max-concurrency C1 shortcut. Artifact: +`/home/mudler/bench/phase40_max_concurrency/20260701_090012`. The snapshot ran +with `PARALLEL=256`, `CTX=262144`, `PTOK=128`, `GEN=64`, `NPL="128 192 256"`, +and `OPS=MUL_MAT,MUL_MAT_ID`. Pre/post gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. Paged safely served `n=256`, but vLLM also fit and remained faster: +`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`, +`paged_ttft_over_vllm=2.9401`. Do not claim GB10 parity from higher max +concurrency at this prompt/gen length and `n<=256`; a future C1 retry must push +beyond this tested point and keep the same md5/op gates. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -595,6 +607,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase38_gate_baseline/20260701_084410` - current Phase37 build baseline before gate-projection policy work; docker/local-ai-worker/GPU idle preflight green; MoE/dense md5 green; `MUL_MAT` `1146/1146`; `MUL_MAT_ID` `806/806`. - `~/bench/phase39_gate_sgemm_profile/20260701_085211` - short completion profile, diagnostic only because `-n 32` is not a canonical md5 gate; useful for confirming graph-time concat is a real kernel path. - `~/bench/phase39_gate_sgemm_profile/phase27_reanalysis` - Phase27 serving profile re-analysis used to reject graph-time fused gate weight concat; `concat_layout=459.84ms` (`2.29%`) in the serving kernel window. +- `~/bench/phase40_max_concurrency/20260701_090012` - max-concurrency C1 check at `NPL=128/192/256`, `PTOK=128`, `GEN=64`, `PARALLEL=256`, `CTX=262144`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green, but vLLM also fit at `n=256` and stayed ahead (`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`). - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 680e19e3f..754be9781 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -1012,6 +1012,32 @@ into `ffn_moe_logits` and `shared_expert_gate`. That is a model-loader/weight layout feature, not a graph shortcut. It must stay default-off until MoE/dense md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass. +### Phase 40 max-concurrency C1 check + +Phase 40 tested whether paged KV's memory advantage creates a higher-concurrency +GB10 serving point that closes the vLLM gap. Artifact: +`/home/mudler/bench/phase40_max_concurrency/20260701_090012`. The run used +`PARALLEL=256`, `CTX=262144`, `PTOK=128`, `GEN=64`, `NPL="128 192 256"`, and +`OPS=MUL_MAT,MUL_MAT_ID`. + +Pre/post gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +Result: + +| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM | +|---|---------------------|----------------------|------------------|-------------------| +| 128 | `0.6630` | `0.5908` | `0.4986` | `3.1682` | +| 192 | `0.5737` | `0.5123` | `0.4562` | `3.0216` | +| 256 | `0.6354` | `0.5359` | `0.4721` | `2.9401` | + +Decision: C1 does not close GB10 parity at `PTOK=128`, `GEN=64`, and `n<=256`. +Paged safely serves `n=256`, but vLLM also fits and remains faster. Do not use +the memory-footprint advantage as a parity claim at this tested point; any +future C1 retry must push beyond it and keep md5 plus `MUL_MAT`/`MUL_MAT_ID` +gates. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh index af1a7aac1..a6cf9d22b 100755 --- a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +++ b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh @@ -13,6 +13,7 @@ comparison with the h2h client. Environment overrides: SRC llama.cpp source dir (default: ~/llama-phase6-source) + BUILD_DIR llama.cpp CMake build dir (default: $SRC/build-cuda) BIN llama.cpp build bin dir (default: $SRC/build-cuda/bin) MODEL paged GGUF path (default: ~/bench/q36-35b-a3b-nvfp4.gguf) VLLM_MODEL vLLM model dir (default: ~/bench/q36-35b-a3b-nvfp4-vllm) @@ -58,7 +59,8 @@ case "${1:-}" in esac SRC=${SRC:-"$HOME/llama-phase6-source"} -BIN=${BIN:-"$SRC/build-cuda/bin"} +BUILD_DIR=${BUILD_DIR:-"$SRC/build-cuda"} +BIN=${BIN:-"$BUILD_DIR/bin"} MODEL=${MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4.gguf"} VLLM_MODEL=${VLLM_MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4-vllm"} H2H=${H2H:-"$HOME/bench/h2h_cli3.py"} @@ -376,14 +378,14 @@ log "source=$(git -C "$SRC" log --oneline -1)" if [[ "$DRY_RUN" == "1" ]]; then log "dry run only; commands validated" - log "would build: cmake --build $SRC/build-cuda --target llama-server llama-completion test-backend-ops -j8" + log "would build: cmake --build $BUILD_DIR --target llama-server llama-completion test-backend-ops -j8" log "would run paged NPL=[$NPL] PTOK=$PTOK GEN=$GEN" log "would run vLLM NPL=[$NPL] PTOK=$PTOK GEN=$GEN" exit 0 fi log "building llama-server, llama-completion, and test-backend-ops" -cmake --build "$SRC/build-cuda" --target llama-server llama-completion test-backend-ops -j 8 \ +cmake --build "$BUILD_DIR" --target llama-server llama-completion test-backend-ops -j 8 \ > "$ART/build.log" 2>&1 run_gate pre diff --git a/docs/superpowers/plans/2026-07-01-max-concurrency-phase40.md b/docs/superpowers/plans/2026-07-01-max-concurrency-phase40.md new file mode 100644 index 000000000..8dbe486b4 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-max-concurrency-phase40.md @@ -0,0 +1,152 @@ +# Max Concurrency Phase40 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Test whether the paged llama.cpp GB10 memory advantage produces a higher-concurrency serving operating point that closes or beats vLLM. + +**Architecture:** Use the existing same-session serving snapshot harness with pre/post inference gates. Add only a harness-level `BUILD_DIR` override so the benchmark builds and runs the same selected CMake tree. + +**Tech Stack:** Bash harness, DGX GB10, llama.cpp `llama-server`, vLLM OpenAI-compatible server, h2h client, `paged-inference-gates.sh`. + +--- + +### Task 1: Make The Snapshot Harness Build The Selected Tree + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Write the failing check** + +Run: + +```bash +backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'BUILD_DIR' +``` + +Expected before the change: exit `1`. + +- [x] **Step 2: Add `BUILD_DIR`** + +Change the harness so: + +```bash +BUILD_DIR=${BUILD_DIR:-"$SRC/build-cuda"} +BIN=${BIN:-"$BUILD_DIR/bin"} +``` + +and build with: + +```bash +cmake --build "$BUILD_DIR" --target llama-server llama-completion test-backend-ops -j 8 +``` + +- [x] **Step 3: Verify locally** + +Run: + +```bash +bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'BUILD_DIR llama.cpp CMake build dir' +``` + +Expected: both exit `0`. + +- [x] **Step 4: Verify on DGX dry-run** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase40_max_concurrency_dryrun/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="128 192 256" PARALLEL=256 CTX=262144 PTOK=128 GEN=64 DRY_RUN=1 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Observed artifact: `/home/mudler/bench/phase40_max_concurrency_dryrun/20260701_090002`. + +Expected evidence: + +```text +docker=0 +local_ai_worker=0 +compute=0 +would build: cmake --build /home/mudler/llama-phase6-source/build-phase36 --target llama-server llama-completion test-backend-ops -j8 +``` + +### Task 2: Run Max-Concurrency Snapshot With Correctness Gates + +**Files:** +- Read: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` +- Artifact: `dgx:~/bench/phase40_max_concurrency/20260701_090012` + +- [x] **Step 1: Run the gated snapshot** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase40_max_concurrency/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="128 192 256" PARALLEL=256 CTX=262144 PTOK=128 GEN=64 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +- [x] **Step 2: Confirm pre/post inference gates** + +Observed: + +```text +pre moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0 +pre dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439 +pre op_MUL_MAT ok 1146/1146 +pre op_MUL_MAT_ID ok 806/806 +post moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0 +post dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439 +post op_MUL_MAT ok 1146/1146 +post op_MUL_MAT_ID ok 806/806 +``` + +- [x] **Step 3: Record serving result** + +Observed: + +```text +arm n agg_tps decode_agg_tps decode_perseq_tps prefill_tps ttft_mean_ms +paged 128 326.3 671.8 3.97 1695.2 8182.3 +paged 192 318.3 679.9 2.50 1605.2 11151.6 +paged 256 337.1 829.9 2.09 1525.7 15065.7 +vllm 128 654.4 1013.3 6.72 5206.0 2582.6 +vllm 192 697.7 1185.2 4.88 4787.1 3690.6 +vllm 256 714.1 1306.1 3.90 4471.0 5124.2 +``` + +- [x] **Step 4: Record decision** + +Decision: C1 does not close GB10 parity for the tested `PTOK=128`, `GEN=64`, `NPL=128/192/256` workload. Paged runs safely at `n=256`, but vLLM also fits and remains faster (`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`). + +### Task 3: Update Handoff Docs + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-max-concurrency-phase40.md` + +- [x] **Step 1: Add Phase40 sections** + +Record artifact paths, gate evidence, throughput table, and C1 decision in all three handoff documents. + +- [x] **Step 2: Verify docs and script** + +Run: + +```bash +bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +git diff --check +``` + +- [x] **Step 3: Commit** + +Run: + +```bash +git add backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + docs/superpowers/plans/2026-07-01-max-concurrency-phase40.md +git commit -m "docs(paged): record max-concurrency parity check" +```