docs(paged): record dense true decode profile

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 08:55:23 +00:00
parent cd59e5d61f
commit c299dcd231
4 changed files with 523 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -2769,3 +2769,67 @@ Verification:
  and a DGX dense dry-run with `VLLM_READY_ATTEMPTS=700`.
 - DGX dry-run artifact:
  `/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`.
+
+## Phase 50 Dense True Decode Profile
+
+Phase 50 separates dense high-concurrency decode from the Phase47 h2h serving
+window. The Phase47 h2h `decode_agg_tps` metric can count tokens generated by
+early requests while later requests are still in prefill, then divide by a
+window that starts at the last first-token. That is useful serving telemetry,
+but it is not a pure steady-decode measurement.
+
+Artifact:
+
+- `/home/mudler/bench/phase50_dense_true_decode/20260701_103120`
+
+Preflight:
+
+- Docker containers: `0`
+- `local-ai-worker`: `0`
+- GPU compute apps: `0`
+- GPU: `NVIDIA GB10`, driver `580.159.03`
+
+Inference gates:
+
+| phase | build | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
+|-------|-------|---------|-----------|-----------|--------------|
+| pre | `build-phase36` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+| pre | `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+| post | `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+
+`build-phase36/bin` had the completion and op-test binaries but not
+`llama-batched-bench`, so the actual profiled llama.cpp decode binary came from
+`~/llama-phase6-source/build-cuda/bin`. That build was gated before and after
+the profile.
+
+Profile method:
+
+- Shape: dense Qwen3.5, `npl=128`, `npp=128`, `ntg=16` and `ntg=64`.
+- Paged command: `llama-batched-bench` with `LLAMA_KV_PAGED=1`,
+  `LLAMA_MOE_FORCE_GRAPHS=1`, `-c 131072 -b 2048 -ub 512 -ngl 99 -fa on`.
+- vLLM command: in-process `LLM.generate`, `max_model_len=4096`,
+  `max_num_seqs=256`, `gpu_memory_utilization=0.85`, prefix caching disabled.
+- Both profiles used `nsys --cuda-graph-trace=node`.
+- Difference method: `(ntg64 tokens - ntg16 tokens) / (ntg64 wall - ntg16 wall)`.
+
+Results:
+
+| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s |
+|--------|--------------|--------------|--------------|--------------|-----------------|
+| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` |
+| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` |
+| ratio | | | | | `0.8820` |
+
+Interpretation:
+
+- Dense true decode at `n=128` is about `88.2%` of vLLM, not the `79.1%`
+  implied by Phase47 h2h aggregate decode. The Phase47 serving window therefore
+  includes real scheduler/accounting effects in addition to GPU decode speed.
+- There is still a real dense GPU-steady decode gap of about `12%`, but it is
+  not large enough to explain Phase47 aggregate serving (`50.7%` of vLLM) or
+  TTFT (`3.20x` vLLM) by itself.
+- The next low-conflict code phase should add an opt-in serving
+  batch-composition/admission trace around `server_context::pre_decode()` to
+  measure decode tokens admitted, prompt tokens admitted, waiting prompt slots,
+  graph reuse, and prefill starvation. Do not start with another GDN or GEMM
+  rewrite unless that trace rules the scheduler out.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -613,6 +613,19 @@ Phase 49 removes vLLM log noise from harness-owned environment variables. The
 preserving intentional vLLM runtime variables such as `VLLM_LOGGING_LEVEL`. Dry
 run: `/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`.

+Phase 50 resolves the dense high-N decode-accounting question with a graph-node
+difference-method profile. Artifact:
+`/home/mudler/bench/phase50_dense_true_decode/20260701_103120`. Pre/post
+inference gates on the profiled `build-cuda` binary stayed green: MoE
+`8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
+`806/806`. Dense `npl=128`, `npp=128` true decode is `383.66 t/s` for paged and
+`435.00 t/s` for vLLM, ratio `0.8820`. This means Phase47's `0.7912` h2h
+decode ratio and `0.5071` aggregate ratio include scheduler/admission and
+prefill-overlap/accounting effects beyond the real GPU-steady decode gap. Next
+GB10 code work should instrument batch composition/admission in
+`server_context::pre_decode()` before attempting another kernel shortcut.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -705,6 +718,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/phase48_readiness_harness_dryrun/20260701_100533` - harness dry-run proving configurable readiness budgets and clean preflight before retrying dense serving.
 - `~/bench/phase47_dense_serving_retry/20260701_100811` - completed dense serving snapshot after Phase48; pre/post md5 and op gates green; paged low-N decode ahead, high-N aggregate and TTFT behind.
 - `~/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138` - harness dry-run after scrubbing harness-owned `VLLM_*` variables from the `vllm serve` child environment.
+- `~/bench/phase50_dense_true_decode/20260701_103120` - dense graph-node difference-method profile at `npl=128`, `npp=128`; `build-cuda` pre/post md5 and op gates green; true decode paged `383.66 t/s`, vLLM `435.00 t/s`, ratio `0.8820`, pointing next at serving admission/scheduler tracing.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -1186,6 +1186,36 @@ Decision: dense low-N decode remains a real paged strength, but dense serving
 still does not close GB10 parity because TTFT and high-concurrency aggregate
 throughput remain substantially behind vLLM.

+### Phase 50 dense true decode profile
+
+Phase50 profiles dense `npl=128`, `npp=128` decode with graph nodes expanded and
+uses the difference method (`ntg=64 - ntg=16`) instead of the Phase47 h2h
+serving window. Artifact:
+`/home/mudler/bench/phase50_dense_true_decode/20260701_103120`.
+
+Pre/post inference gates stayed green on the profiled `build-cuda` binary set:
+MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
+`MUL_MAT_ID` `806/806`. A `build-phase36` pre-gate also passed, but
+`build-phase36` did not contain `llama-batched-bench`, so `build-cuda` is the
+profiled/gated build for this phase.
+
+Results:
+
+| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s |
+|--------|--------------|--------------|--------------|--------------|-----------------|
+| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` |
+| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` |
+| ratio | | | | | `0.8820` |
+
+Decision: Phase47's dense high-N serving loss is not just a kernel-speed gap.
+True dense decode is still behind vLLM by about `12%`, but the Phase47 h2h
+decode ratio at `n=128` was `0.7912` and aggregate serving was only `0.5071`.
+The remaining difference points at scheduler/admission, prefill overlap, and
+TTFT accounting. Next implementation target should be an opt-in
+batch-composition/admission trace in `server_context::pre_decode()` before any
+new GDN/GEMM shortcut.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update