docs(paged): record max-concurrency parity check

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 07:13:38 +00:00
parent 52c11b1ce5
commit d44e164c96
5 changed files with 271 additions and 3 deletions

View File

@@ -2295,3 +2295,78 @@ Decision:
default-off, keep gate weights in F32, avoid graph-time weight concat, and
pass MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` gates before any serving
benchmark. If md5 changes, run KL first and reject on KL regression.
## Phase 40 Max-Concurrency C1 Check
Phase 40 tested the remaining C1 hypothesis from the lever map: use paged KV's
lower memory footprint to run a higher-concurrency serving point where vLLM
falls behind or fails to fit.
Artifacts:
- `/home/mudler/bench/phase40_max_concurrency_dryrun/20260701_090002`
- `/home/mudler/bench/phase40_max_concurrency/20260701_090012`
Preflight:
| check | actual |
|-------|--------|
| GPU | `NVIDIA GB10, 580.159.03` |
| docker containers | `0` |
| `local-ai-worker` containers | `0` |
| GPU compute apps | `0` |
| GPU lock owner | `FREE phase39-gate-sgemm-profile-done 1782888737` |
Harness change:
- `paged-current-serving-snapshot.sh` now accepts `BUILD_DIR` and defaults
`BIN` from that same directory. This keeps the benchmark build step and runtime
binaries pointed at the same CMake tree.
- Phase 40 used `BUILD_DIR=$HOME/llama-phase6-source/build-phase36`,
`BIN=$HOME/llama-phase6-source/build-phase36/bin`,
`OPS=MUL_MAT,MUL_MAT_ID`, `PARALLEL=256`, `CTX=262144`, `PTOK=128`,
`GEN=64`, `NPL="128 192 256"`.
Pre/post inference gates:
| phase | check | status | actual |
|-------|-------|--------|--------|
| pre | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
| pre | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
| pre | `MUL_MAT` | ok | `1146/1146` |
| pre | `MUL_MAT_ID` | ok | `806/806` |
| post | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
| post | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
| post | `MUL_MAT` | ok | `1146/1146` |
| post | `MUL_MAT_ID` | ok | `806/806` |
Serving result:
| arm | n | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms |
|-----|---|---------|----------------|--------------------|-------------|--------------|
| paged | 128 | `326.3` | `671.8` | `3.97` | `1695.2` | `8182.3` |
| paged | 192 | `318.3` | `679.9` | `2.50` | `1605.2` | `11151.6` |
| paged | 256 | `337.1` | `829.9` | `2.09` | `1525.7` | `15065.7` |
| vLLM | 128 | `654.4` | `1013.3` | `6.72` | `5206.0` | `2582.6` |
| vLLM | 192 | `697.7` | `1185.2` | `4.88` | `4787.1` | `3690.6` |
| vLLM | 256 | `714.1` | `1306.1` | `3.90` | `4471.0` | `5124.2` |
Ratios:
| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM |
|---|---------------------|----------------------|------------------|-------------------|
| 128 | `0.6630` | `0.5908` | `0.4986` | `3.1682` |
| 192 | `0.5737` | `0.5123` | `0.4562` | `3.0216` |
| 256 | `0.6354` | `0.5359` | `0.4721` | `2.9401` |
Decision:
- C1 does not close GB10 parity for this workload. Paged safely serves `n=256`
with canonical md5/op gates green before and after the run, but vLLM also
fits and remains materially faster.
- Do not claim a GB10 parity win from higher max concurrency at
`PTOK=128`, `GEN=64`, `n<=256`.
- The next GB10 work should stay on the profile-validated root causes:
prefill GDN, prefill MoE GEMM, and low-concurrency/full-step graph capture.
Any future C1 rerun must push beyond this tested point and keep the same
md5 plus `MUL_MAT`/`MUL_MAT_ID` gates.

View File

@@ -513,6 +513,18 @@ The only future fused-gate design worth scoping is a persistent/load-time F32
combined gate weight with output views, default-off until MoE/dense md5,
`MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass.
Phase 40 closes the tested GB10 max-concurrency C1 shortcut. Artifact:
`/home/mudler/bench/phase40_max_concurrency/20260701_090012`. The snapshot ran
with `PARALLEL=256`, `CTX=262144`, `PTOK=128`, `GEN=64`, `NPL="128 192 256"`,
and `OPS=MUL_MAT,MUL_MAT_ID`. Pre/post gates stayed green: MoE
`8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
`806/806`. Paged safely served `n=256`, but vLLM also fit and remained faster:
`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`,
`paged_ttft_over_vllm=2.9401`. Do not claim GB10 parity from higher max
concurrency at this prompt/gen length and `n<=256`; a future C1 retry must push
beyond this tested point and keep the same md5/op gates.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -595,6 +607,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
- `~/bench/phase38_gate_baseline/20260701_084410` - current Phase37 build baseline before gate-projection policy work; docker/local-ai-worker/GPU idle preflight green; MoE/dense md5 green; `MUL_MAT` `1146/1146`; `MUL_MAT_ID` `806/806`.
- `~/bench/phase39_gate_sgemm_profile/20260701_085211` - short completion profile, diagnostic only because `-n 32` is not a canonical md5 gate; useful for confirming graph-time concat is a real kernel path.
- `~/bench/phase39_gate_sgemm_profile/phase27_reanalysis` - Phase27 serving profile re-analysis used to reject graph-time fused gate weight concat; `concat_layout=459.84ms` (`2.29%`) in the serving kernel window.
- `~/bench/phase40_max_concurrency/20260701_090012` - max-concurrency C1 check at `NPL=128/192/256`, `PTOK=128`, `GEN=64`, `PARALLEL=256`, `CTX=262144`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green, but vLLM also fit at `n=256` and stayed ahead (`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`).
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.

View File

@@ -1012,6 +1012,32 @@ into `ffn_moe_logits` and `shared_expert_gate`. That is a model-loader/weight
layout feature, not a graph shortcut. It must stay default-off until MoE/dense
md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass.
### Phase 40 max-concurrency C1 check
Phase 40 tested whether paged KV's memory advantage creates a higher-concurrency
GB10 serving point that closes the vLLM gap. Artifact:
`/home/mudler/bench/phase40_max_concurrency/20260701_090012`. The run used
`PARALLEL=256`, `CTX=262144`, `PTOK=128`, `GEN=64`, `NPL="128 192 256"`, and
`OPS=MUL_MAT,MUL_MAT_ID`.
Pre/post gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
`806/806`.
Result:
| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM |
|---|---------------------|----------------------|------------------|-------------------|
| 128 | `0.6630` | `0.5908` | `0.4986` | `3.1682` |
| 192 | `0.5737` | `0.5123` | `0.4562` | `3.0216` |
| 256 | `0.6354` | `0.5359` | `0.4721` | `2.9401` |
Decision: C1 does not close GB10 parity at `PTOK=128`, `GEN=64`, and `n<=256`.
Paged safely serves `n=256`, but vLLM also fits and remains faster. Do not use
the memory-footprint advantage as a parity claim at this tested point; any
future C1 retry must push beyond it and keep md5 plus `MUL_MAT`/`MUL_MAT_ID`
gates.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update

View File

@@ -13,6 +13,7 @@ comparison with the h2h client.
Environment overrides:
SRC llama.cpp source dir (default: ~/llama-phase6-source)
BUILD_DIR llama.cpp CMake build dir (default: $SRC/build-cuda)
BIN llama.cpp build bin dir (default: $SRC/build-cuda/bin)
MODEL paged GGUF path (default: ~/bench/q36-35b-a3b-nvfp4.gguf)
VLLM_MODEL vLLM model dir (default: ~/bench/q36-35b-a3b-nvfp4-vllm)
@@ -58,7 +59,8 @@ case "${1:-}" in
esac
SRC=${SRC:-"$HOME/llama-phase6-source"}
BIN=${BIN:-"$SRC/build-cuda/bin"}
BUILD_DIR=${BUILD_DIR:-"$SRC/build-cuda"}
BIN=${BIN:-"$BUILD_DIR/bin"}
MODEL=${MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4.gguf"}
VLLM_MODEL=${VLLM_MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4-vllm"}
H2H=${H2H:-"$HOME/bench/h2h_cli3.py"}
@@ -376,14 +378,14 @@ log "source=$(git -C "$SRC" log --oneline -1)"
if [[ "$DRY_RUN" == "1" ]]; then
log "dry run only; commands validated"
log "would build: cmake --build $SRC/build-cuda --target llama-server llama-completion test-backend-ops -j8"
log "would build: cmake --build $BUILD_DIR --target llama-server llama-completion test-backend-ops -j8"
log "would run paged NPL=[$NPL] PTOK=$PTOK GEN=$GEN"
log "would run vLLM NPL=[$NPL] PTOK=$PTOK GEN=$GEN"
exit 0
fi
log "building llama-server, llama-completion, and test-backend-ops"
cmake --build "$SRC/build-cuda" --target llama-server llama-completion test-backend-ops -j 8 \
cmake --build "$BUILD_DIR" --target llama-server llama-completion test-backend-ops -j 8 \
> "$ART/build.log" 2>&1
run_gate pre

View File

@@ -0,0 +1,152 @@
# Max Concurrency Phase40 Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Test whether the paged llama.cpp GB10 memory advantage produces a higher-concurrency serving operating point that closes or beats vLLM.
**Architecture:** Use the existing same-session serving snapshot harness with pre/post inference gates. Add only a harness-level `BUILD_DIR` override so the benchmark builds and runs the same selected CMake tree.
**Tech Stack:** Bash harness, DGX GB10, llama.cpp `llama-server`, vLLM OpenAI-compatible server, h2h client, `paged-inference-gates.sh`.
---
### Task 1: Make The Snapshot Harness Build The Selected Tree
**Files:**
- Modify: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
- [x] **Step 1: Write the failing check**
Run:
```bash
backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'BUILD_DIR'
```
Expected before the change: exit `1`.
- [x] **Step 2: Add `BUILD_DIR`**
Change the harness so:
```bash
BUILD_DIR=${BUILD_DIR:-"$SRC/build-cuda"}
BIN=${BIN:-"$BUILD_DIR/bin"}
```
and build with:
```bash
cmake --build "$BUILD_DIR" --target llama-server llama-completion test-backend-ops -j 8
```
- [x] **Step 3: Verify locally**
Run:
```bash
bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'BUILD_DIR llama.cpp CMake build dir'
```
Expected: both exit `0`.
- [x] **Step 4: Verify on DGX dry-run**
Run:
```bash
ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase40_max_concurrency_dryrun/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="128 192 256" PARALLEL=256 CTX=262144 PTOK=128 GEN=64 DRY_RUN=1 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
```
Observed artifact: `/home/mudler/bench/phase40_max_concurrency_dryrun/20260701_090002`.
Expected evidence:
```text
docker=0
local_ai_worker=0
compute=0
would build: cmake --build /home/mudler/llama-phase6-source/build-phase36 --target llama-server llama-completion test-backend-ops -j8
```
### Task 2: Run Max-Concurrency Snapshot With Correctness Gates
**Files:**
- Read: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
- Artifact: `dgx:~/bench/phase40_max_concurrency/20260701_090012`
- [x] **Step 1: Run the gated snapshot**
Run:
```bash
ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase40_max_concurrency/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="128 192 256" PARALLEL=256 CTX=262144 PTOK=128 GEN=64 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
```
- [x] **Step 2: Confirm pre/post inference gates**
Observed:
```text
pre moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0
pre dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439
pre op_MUL_MAT ok 1146/1146
pre op_MUL_MAT_ID ok 806/806
post moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0
post dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439
post op_MUL_MAT ok 1146/1146
post op_MUL_MAT_ID ok 806/806
```
- [x] **Step 3: Record serving result**
Observed:
```text
arm n agg_tps decode_agg_tps decode_perseq_tps prefill_tps ttft_mean_ms
paged 128 326.3 671.8 3.97 1695.2 8182.3
paged 192 318.3 679.9 2.50 1605.2 11151.6
paged 256 337.1 829.9 2.09 1525.7 15065.7
vllm 128 654.4 1013.3 6.72 5206.0 2582.6
vllm 192 697.7 1185.2 4.88 4787.1 3690.6
vllm 256 714.1 1306.1 3.90 4471.0 5124.2
```
- [x] **Step 4: Record decision**
Decision: C1 does not close GB10 parity for the tested `PTOK=128`, `GEN=64`, `NPL=128/192/256` workload. Paged runs safely at `n=256`, but vLLM also fits and remains faster (`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`).
### Task 3: Update Handoff Docs
**Files:**
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
- Modify: `docs/superpowers/plans/2026-07-01-max-concurrency-phase40.md`
- [x] **Step 1: Add Phase40 sections**
Record artifact paths, gate evidence, throughput table, and C1 decision in all three handoff documents.
- [x] **Step 2: Verify docs and script**
Run:
```bash
bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
git diff --check
```
- [x] **Step 3: Commit**
Run:
```bash
git add backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh \
backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \
docs/superpowers/plans/2026-07-01-max-concurrency-phase40.md
git commit -m "docs(paged): record max-concurrency parity check"
```