From aa848d5afb6d7923f88b28d5ae3df0d559aecd5f Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 07:24:18 +0000 Subject: [PATCH] docs(paged): record low-concurrency serving check Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 71 ++++++++++ .../docs/PARITY_HANDOFF.md | 12 ++ .../docs/VLLM_PARITY_LEVER_MAP.md | 25 ++++ .../2026-07-01-low-concurrency-phase41.md | 128 ++++++++++++++++++ 4 files changed, 236 insertions(+) create mode 100644 docs/superpowers/plans/2026-07-01-low-concurrency-phase41.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 81123ae58..d51ddea36 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -2370,3 +2370,74 @@ Decision: prefill GDN, prefill MoE GEMM, and low-concurrency/full-step graph capture. Any future C1 rerun must push beyond this tested point and keep the same md5 plus `MUL_MAT`/`MUL_MAT_ID` gates. + +## Phase 41 Low-Concurrency D1 Check + +Phase 41 measured the opposite serving regime after Phase40 rejected the tested +max-concurrency shortcut: low concurrency and latency-sensitive decode. This is +the regime where the D1/full-step graph-capture direction should matter most. + +Artifacts: + +- `/home/mudler/bench/phase41_low_concurrency_dryrun/20260701_091429` +- `/home/mudler/bench/phase41_low_concurrency/20260701_091437` + +Preflight: + +| check | actual | +|-------|--------| +| GPU | `NVIDIA GB10, 580.159.03` | +| docker containers | `0` | +| `local-ai-worker` containers | `0` | +| GPU compute apps | `0` | +| GPU lock owner | `FREE released-by-codex-current-serving-snapshot 1782889704` | + +Run shape: + +- `BUILD_DIR=$HOME/llama-phase6-source/build-phase36` +- `BIN=$HOME/llama-phase6-source/build-phase36/bin` +- `OPS=MUL_MAT,MUL_MAT_ID` +- `PARALLEL=32`, `CTX=32768`, `PTOK=128`, `GEN=64`, `NPL="1 8 32"` + +Pre/post inference gates: + +| phase | check | status | actual | +|-------|-------|--------|--------| +| pre | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| pre | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| pre | `MUL_MAT` | ok | `1146/1146` | +| pre | `MUL_MAT_ID` | ok | `806/806` | +| post | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post | `MUL_MAT` | ok | `1146/1146` | +| post | `MUL_MAT_ID` | ok | `806/806` | + +Serving result: + +| arm | n | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms | +|-----|---|---------|----------------|--------------------|-------------|--------------| +| paged | 1 | `50.6` | `56.5` | `55.61` | `1221.5` | `131.8` | +| paged | 8 | `159.5` | `222.9` | `26.72` | `1438.8` | `835.9` | +| paged | 32 | `240.1` | `393.9` | `11.15` | `1615.7` | `2784.4` | +| vLLM | 1 | `67.5` | `75.4` | `74.14` | `1720.4` | `95.3` | +| vLLM | 8 | `251.8` | `296.5` | `36.12` | `4558.8` | `266.0` | +| vLLM | 32 | `454.6` | `592.4` | `17.43` | `5376.5` | `818.6` | + +Ratios: + +| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM | +|---|---------------------|----------------------|------------------|-------------------| +| 1 | `0.7493` | `0.7501` | `0.7496` | `1.3830` | +| 8 | `0.7518` | `0.7398` | `0.6334` | `3.1425` | +| 32 | `0.6649` | `0.6397` | `0.5282` | `3.4014` | + +Decision: + +- D1/full-step graph capture remains relevant for low-concurrency and latency + work, but this current-stack snapshot does not show an easy parity bridge: + paged is about `0.75x` vLLM decode at `n=1/8` and `0.665x` at `n=32`. +- TTFT is the bigger user-visible low-concurrency gap, especially by `n=8/32`; + prefill GDN and MoE GEMM work therefore still matters even in a decode-focused + serving discussion. +- The next implementation phase should require a separately built A/B and the + same md5 plus `MUL_MAT`/`MUL_MAT_ID` gates before claiming any D1 improvement. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 41d0f61b9..15e87c511 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -525,6 +525,17 @@ and `OPS=MUL_MAT,MUL_MAT_ID`. Pre/post gates stayed green: MoE concurrency at this prompt/gen length and `n<=256`; a future C1 retry must push beyond this tested point and keep the same md5/op gates. +Phase 41 records the low-concurrency counterpart for D1. Artifact: +`/home/mudler/bench/phase41_low_concurrency/20260701_091437`. The snapshot ran +with `PARALLEL=32`, `CTX=32768`, `PTOK=128`, `GEN=64`, `NPL="1 8 32"`, and +`OPS=MUL_MAT,MUL_MAT_ID`. Pre/post gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. Paged is about `0.75x` vLLM decode at `n=1/8` and `0.665x` at +`n=32`; TTFT is `1.38x`, `3.14x`, and `3.40x` vLLM respectively. Keep D1 in +scope for low-concurrency/latency, but require a separately built A/B and the +same md5/op gates before claiming improvement. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -608,6 +619,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase39_gate_sgemm_profile/20260701_085211` - short completion profile, diagnostic only because `-n 32` is not a canonical md5 gate; useful for confirming graph-time concat is a real kernel path. - `~/bench/phase39_gate_sgemm_profile/phase27_reanalysis` - Phase27 serving profile re-analysis used to reject graph-time fused gate weight concat; `concat_layout=459.84ms` (`2.29%`) in the serving kernel window. - `~/bench/phase40_max_concurrency/20260701_090012` - max-concurrency C1 check at `NPL=128/192/256`, `PTOK=128`, `GEN=64`, `PARALLEL=256`, `CTX=262144`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green, but vLLM also fit at `n=256` and stayed ahead (`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`). +- `~/bench/phase41_low_concurrency/20260701_091437` - low-concurrency D1 check at `NPL=1/8/32`, `PTOK=128`, `GEN=64`, `PARALLEL=32`, `CTX=32768`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green; paged is `0.7493`, `0.7518`, and `0.6649` of vLLM decode at `n=1/8/32`, with TTFT still much worse by `n=8/32`. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 754be9781..7764614c5 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -1038,6 +1038,31 @@ the memory-footprint advantage as a parity claim at this tested point; any future C1 retry must push beyond it and keep md5 plus `MUL_MAT`/`MUL_MAT_ID` gates. +### Phase 41 low-concurrency D1 check + +Phase 41 measured the low-concurrency serving regime where D1/full-step graph +capture should be most useful. Artifact: +`/home/mudler/bench/phase41_low_concurrency/20260701_091437`. The run used +`PARALLEL=32`, `CTX=32768`, `PTOK=128`, `GEN=64`, `NPL="1 8 32"`, and +`OPS=MUL_MAT,MUL_MAT_ID`. + +Pre/post gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +Result: + +| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM | +|---|---------------------|----------------------|------------------|-------------------| +| 1 | `0.7493` | `0.7501` | `0.7496` | `1.3830` | +| 8 | `0.7518` | `0.7398` | `0.6334` | `3.1425` | +| 32 | `0.6649` | `0.6397` | `0.5282` | `3.4014` | + +Decision: D1 remains a real low-concurrency/latency lever, but Phase41 does not +make it a shortcut to parity. The implementation gate remains a separately built +A/B with md5 plus `MUL_MAT`/`MUL_MAT_ID` checks, and TTFT evidence keeps prefill +GDN/MoE work in scope for serving quality. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-low-concurrency-phase41.md b/docs/superpowers/plans/2026-07-01-low-concurrency-phase41.md new file mode 100644 index 000000000..15bd380b4 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-low-concurrency-phase41.md @@ -0,0 +1,128 @@ +# Low Concurrency Phase41 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Quantify the low-concurrency GB10 serving gap after Phase40 rejected the max-concurrency C1 shortcut. + +**Architecture:** Reuse the same current-stack serving harness and canonical pre/post inference gates, changing only the concurrency list and llama-server parallel/context sizing. + +**Tech Stack:** Bash harness, DGX GB10, llama.cpp `llama-server`, vLLM OpenAI-compatible server, h2h client, `paged-inference-gates.sh`. + +--- + +### Task 1: Define Low-Concurrency Snapshot + +**Files:** +- Read: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Select the run shape** + +Use: + +```bash +NPL="1 8 32" +PARALLEL=32 +CTX=32768 +PTOK=128 +GEN=64 +OPS=MUL_MAT,MUL_MAT_ID +``` + +- [x] **Step 2: Validate on DGX dry-run** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase41_low_concurrency_dryrun/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="1 8 32" PARALLEL=32 CTX=32768 PTOK=128 GEN=64 DRY_RUN=1 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Observed artifact: `/home/mudler/bench/phase41_low_concurrency_dryrun/20260701_091429`. + +Expected evidence: + +```text +docker=0 +local_ai_worker=0 +compute=0 +would build: cmake --build /home/mudler/llama-phase6-source/build-phase36 --target llama-server llama-completion test-backend-ops -j8 +would run paged NPL=[1 8 32] PTOK=128 GEN=64 +would run vLLM NPL=[1 8 32] PTOK=128 GEN=64 +``` + +### Task 2: Run Low-Concurrency Snapshot With Gates + +**Files:** +- Artifact: `dgx:~/bench/phase41_low_concurrency/20260701_091437` + +- [x] **Step 1: Run the snapshot** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase41_low_concurrency/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="1 8 32" PARALLEL=32 CTX=32768 PTOK=128 GEN=64 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +- [x] **Step 2: Confirm pre/post gates** + +Observed: + +```text +pre moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0 +pre dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439 +pre op_MUL_MAT ok 1146/1146 +pre op_MUL_MAT_ID ok 806/806 +post moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0 +post dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439 +post op_MUL_MAT ok 1146/1146 +post op_MUL_MAT_ID ok 806/806 +``` + +- [x] **Step 3: Record serving result** + +Observed: + +```text +arm n agg_tps decode_agg_tps decode_perseq_tps prefill_tps ttft_mean_ms +paged 1 50.6 56.5 55.61 1221.5 131.8 +paged 8 159.5 222.9 26.72 1438.8 835.9 +paged 32 240.1 393.9 11.15 1615.7 2784.4 +vllm 1 67.5 75.4 74.14 1720.4 95.3 +vllm 8 251.8 296.5 36.12 4558.8 266.0 +vllm 32 454.6 592.4 17.43 5376.5 818.6 +``` + +- [x] **Step 4: Record decision** + +Decision: Phase41 confirms D1 remains relevant for low-concurrency/latency work, but the measured current-stack gap is around `0.75x` vLLM at `n=1/8` and `0.665x` at `n=32`, not an immediate parity bridge. TTFT remains the larger user-visible gap. + +### Task 3: Update Handoff Docs + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-low-concurrency-phase41.md` + +- [x] **Step 1: Add Phase41 sections** + +Record artifact paths, preflight, gate evidence, serving table, and the D1/TTFT implication in all three handoff documents. + +- [x] **Step 2: Verify docs** + +Run: + +```bash +git diff --check +``` + +- [x] **Step 3: Commit** + +Run: + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-low-concurrency-phase41.md +git commit -m "docs(paged): record low-concurrency serving check" +```