docs(paged): record low-concurrency serving check

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 07:24:18 +00:00
parent d44e164c96
commit aa848d5afb
4 changed files with 236 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -2370,3 +2370,74 @@ Decision:
  prefill GDN, prefill MoE GEMM, and low-concurrency/full-step graph capture.
  Any future C1 rerun must push beyond this tested point and keep the same
  md5 plus `MUL_MAT`/`MUL_MAT_ID` gates.
+
+## Phase 41 Low-Concurrency D1 Check
+
+Phase 41 measured the opposite serving regime after Phase40 rejected the tested
+max-concurrency shortcut: low concurrency and latency-sensitive decode. This is
+the regime where the D1/full-step graph-capture direction should matter most.
+
+Artifacts:
+
+- `/home/mudler/bench/phase41_low_concurrency_dryrun/20260701_091429`
+- `/home/mudler/bench/phase41_low_concurrency/20260701_091437`
+
+Preflight:
+
+| check | actual |
+|-------|--------|
+| GPU | `NVIDIA GB10, 580.159.03` |
+| docker containers | `0` |
+| `local-ai-worker` containers | `0` |
+| GPU compute apps | `0` |
+| GPU lock owner | `FREE released-by-codex-current-serving-snapshot 1782889704` |
+
+Run shape:
+
+- `BUILD_DIR=$HOME/llama-phase6-source/build-phase36`
+- `BIN=$HOME/llama-phase6-source/build-phase36/bin`
+- `OPS=MUL_MAT,MUL_MAT_ID`
+- `PARALLEL=32`, `CTX=32768`, `PTOK=128`, `GEN=64`, `NPL="1 8 32"`
+
+Pre/post inference gates:
+
+| phase | check | status | actual |
+|-------|-------|--------|--------|
+| pre | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| pre | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| pre | `MUL_MAT` | ok | `1146/1146` |
+| pre | `MUL_MAT_ID` | ok | `806/806` |
+| post | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| post | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| post | `MUL_MAT` | ok | `1146/1146` |
+| post | `MUL_MAT_ID` | ok | `806/806` |
+
+Serving result:
+
+| arm | n | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms |
+|-----|---|---------|----------------|--------------------|-------------|--------------|
+| paged | 1 | `50.6` | `56.5` | `55.61` | `1221.5` | `131.8` |
+| paged | 8 | `159.5` | `222.9` | `26.72` | `1438.8` | `835.9` |
+| paged | 32 | `240.1` | `393.9` | `11.15` | `1615.7` | `2784.4` |
+| vLLM | 1 | `67.5` | `75.4` | `74.14` | `1720.4` | `95.3` |
+| vLLM | 8 | `251.8` | `296.5` | `36.12` | `4558.8` | `266.0` |
+| vLLM | 32 | `454.6` | `592.4` | `17.43` | `5376.5` | `818.6` |
+
+Ratios:
+
+| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM |
+|---|---------------------|----------------------|------------------|-------------------|
+| 1 | `0.7493` | `0.7501` | `0.7496` | `1.3830` |
+| 8 | `0.7518` | `0.7398` | `0.6334` | `3.1425` |
+| 32 | `0.6649` | `0.6397` | `0.5282` | `3.4014` |
+
+Decision:
+
+- D1/full-step graph capture remains relevant for low-concurrency and latency
+  work, but this current-stack snapshot does not show an easy parity bridge:
+  paged is about `0.75x` vLLM decode at `n=1/8` and `0.665x` at `n=32`.
+- TTFT is the bigger user-visible low-concurrency gap, especially by `n=8/32`;
+  prefill GDN and MoE GEMM work therefore still matters even in a decode-focused
+  serving discussion.
+- The next implementation phase should require a separately built A/B and the
+  same md5 plus `MUL_MAT`/`MUL_MAT_ID` gates before claiming any D1 improvement.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -525,6 +525,17 @@ and `OPS=MUL_MAT,MUL_MAT_ID`. Pre/post gates stayed green: MoE
 concurrency at this prompt/gen length and `n<=256`; a future C1 retry must push
 beyond this tested point and keep the same md5/op gates.

+Phase 41 records the low-concurrency counterpart for D1. Artifact:
+`/home/mudler/bench/phase41_low_concurrency/20260701_091437`. The snapshot ran
+with `PARALLEL=32`, `CTX=32768`, `PTOK=128`, `GEN=64`, `NPL="1 8 32"`, and
+`OPS=MUL_MAT,MUL_MAT_ID`. Pre/post gates stayed green: MoE
+`8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
+`806/806`. Paged is about `0.75x` vLLM decode at `n=1/8` and `0.665x` at
+`n=32`; TTFT is `1.38x`, `3.14x`, and `3.40x` vLLM respectively. Keep D1 in
+scope for low-concurrency/latency, but require a separately built A/B and the
+same md5/op gates before claiming improvement.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -608,6 +619,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/phase39_gate_sgemm_profile/20260701_085211` - short completion profile, diagnostic only because `-n 32` is not a canonical md5 gate; useful for confirming graph-time concat is a real kernel path.
 - `~/bench/phase39_gate_sgemm_profile/phase27_reanalysis` - Phase27 serving profile re-analysis used to reject graph-time fused gate weight concat; `concat_layout=459.84ms` (`2.29%`) in the serving kernel window.
 - `~/bench/phase40_max_concurrency/20260701_090012` - max-concurrency C1 check at `NPL=128/192/256`, `PTOK=128`, `GEN=64`, `PARALLEL=256`, `CTX=262144`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green, but vLLM also fit at `n=256` and stayed ahead (`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`).
+- `~/bench/phase41_low_concurrency/20260701_091437` - low-concurrency D1 check at `NPL=1/8/32`, `PTOK=128`, `GEN=64`, `PARALLEL=32`, `CTX=32768`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green; paged is `0.7493`, `0.7518`, and `0.6649` of vLLM decode at `n=1/8/32`, with TTFT still much worse by `n=8/32`.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -1038,6 +1038,31 @@ the memory-footprint advantage as a parity claim at this tested point; any
 future C1 retry must push beyond it and keep md5 plus `MUL_MAT`/`MUL_MAT_ID`
 gates.

+### Phase 41 low-concurrency D1 check
+
+Phase 41 measured the low-concurrency serving regime where D1/full-step graph
+capture should be most useful. Artifact:
+`/home/mudler/bench/phase41_low_concurrency/20260701_091437`. The run used
+`PARALLEL=32`, `CTX=32768`, `PTOK=128`, `GEN=64`, `NPL="1 8 32"`, and
+`OPS=MUL_MAT,MUL_MAT_ID`.
+
+Pre/post gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
+`806/806`.
+
+Result:
+
+| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM |
+|---|---------------------|----------------------|------------------|-------------------|
+| 1 | `0.7493` | `0.7501` | `0.7496` | `1.3830` |
+| 8 | `0.7518` | `0.7398` | `0.6334` | `3.1425` |
+| 32 | `0.6649` | `0.6397` | `0.5282` | `3.4014` |
+
+Decision: D1 remains a real low-concurrency/latency lever, but Phase41 does not
+make it a shortcut to parity. The implementation gate remains a separately built
+A/B with md5 plus `MUL_MAT`/`MUL_MAT_ID` checks, and TTFT evidence keeps prefill
+GDN/MoE work in scope for serving quality.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update