docs(paged): record dense true decode profile

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 08:55:23 +00:00
parent cd59e5d61f
commit c299dcd231
4 changed files with 523 additions and 0 deletions

View File

@@ -2769,3 +2769,67 @@ Verification:
and a DGX dense dry-run with `VLLM_READY_ATTEMPTS=700`.
- DGX dry-run artifact:
`/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`.
## Phase 50 Dense True Decode Profile
Phase 50 separates dense high-concurrency decode from the Phase47 h2h serving
window. The Phase47 h2h `decode_agg_tps` metric can count tokens generated by
early requests while later requests are still in prefill, then divide by a
window that starts at the last first-token. That is useful serving telemetry,
but it is not a pure steady-decode measurement.
Artifact:
- `/home/mudler/bench/phase50_dense_true_decode/20260701_103120`
Preflight:
- Docker containers: `0`
- `local-ai-worker`: `0`
- GPU compute apps: `0`
- GPU: `NVIDIA GB10`, driver `580.159.03`
Inference gates:
| phase | build | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|-------|-------|---------|-----------|-----------|--------------|
| pre | `build-phase36` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| pre | `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| post | `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
`build-phase36/bin` had the completion and op-test binaries but not
`llama-batched-bench`, so the actual profiled llama.cpp decode binary came from
`~/llama-phase6-source/build-cuda/bin`. That build was gated before and after
the profile.
Profile method:
- Shape: dense Qwen3.5, `npl=128`, `npp=128`, `ntg=16` and `ntg=64`.
- Paged command: `llama-batched-bench` with `LLAMA_KV_PAGED=1`,
`LLAMA_MOE_FORCE_GRAPHS=1`, `-c 131072 -b 2048 -ub 512 -ngl 99 -fa on`.
- vLLM command: in-process `LLM.generate`, `max_model_len=4096`,
`max_num_seqs=256`, `gpu_memory_utilization=0.85`, prefix caching disabled.
- Both profiles used `nsys --cuda-graph-trace=node`.
- Difference method: `(ntg64 tokens - ntg16 tokens) / (ntg64 wall - ntg16 wall)`.
Results:
| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s |
|--------|--------------|--------------|--------------|--------------|-----------------|
| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` |
| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` |
| ratio | | | | | `0.8820` |
Interpretation:
- Dense true decode at `n=128` is about `88.2%` of vLLM, not the `79.1%`
implied by Phase47 h2h aggregate decode. The Phase47 serving window therefore
includes real scheduler/accounting effects in addition to GPU decode speed.
- There is still a real dense GPU-steady decode gap of about `12%`, but it is
not large enough to explain Phase47 aggregate serving (`50.7%` of vLLM) or
TTFT (`3.20x` vLLM) by itself.
- The next low-conflict code phase should add an opt-in serving
batch-composition/admission trace around `server_context::pre_decode()` to
measure decode tokens admitted, prompt tokens admitted, waiting prompt slots,
graph reuse, and prefill starvation. Do not start with another GDN or GEMM
rewrite unless that trace rules the scheduler out.

View File

@@ -613,6 +613,19 @@ Phase 49 removes vLLM log noise from harness-owned environment variables. The
preserving intentional vLLM runtime variables such as `VLLM_LOGGING_LEVEL`. Dry
run: `/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`.
Phase 50 resolves the dense high-N decode-accounting question with a graph-node
difference-method profile. Artifact:
`/home/mudler/bench/phase50_dense_true_decode/20260701_103120`. Pre/post
inference gates on the profiled `build-cuda` binary stayed green: MoE
`8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
`806/806`. Dense `npl=128`, `npp=128` true decode is `383.66 t/s` for paged and
`435.00 t/s` for vLLM, ratio `0.8820`. This means Phase47's `0.7912` h2h
decode ratio and `0.5071` aggregate ratio include scheduler/admission and
prefill-overlap/accounting effects beyond the real GPU-steady decode gap. Next
GB10 code work should instrument batch composition/admission in
`server_context::pre_decode()` before attempting another kernel shortcut.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -705,6 +718,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
- `~/bench/phase48_readiness_harness_dryrun/20260701_100533` - harness dry-run proving configurable readiness budgets and clean preflight before retrying dense serving.
- `~/bench/phase47_dense_serving_retry/20260701_100811` - completed dense serving snapshot after Phase48; pre/post md5 and op gates green; paged low-N decode ahead, high-N aggregate and TTFT behind.
- `~/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138` - harness dry-run after scrubbing harness-owned `VLLM_*` variables from the `vllm serve` child environment.
- `~/bench/phase50_dense_true_decode/20260701_103120` - dense graph-node difference-method profile at `npl=128`, `npp=128`; `build-cuda` pre/post md5 and op gates green; true decode paged `383.66 t/s`, vLLM `435.00 t/s`, ratio `0.8820`, pointing next at serving admission/scheduler tracing.
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.

View File

@@ -1186,6 +1186,36 @@ Decision: dense low-N decode remains a real paged strength, but dense serving
still does not close GB10 parity because TTFT and high-concurrency aggregate
throughput remain substantially behind vLLM.
### Phase 50 dense true decode profile
Phase50 profiles dense `npl=128`, `npp=128` decode with graph nodes expanded and
uses the difference method (`ntg=64 - ntg=16`) instead of the Phase47 h2h
serving window. Artifact:
`/home/mudler/bench/phase50_dense_true_decode/20260701_103120`.
Pre/post inference gates stayed green on the profiled `build-cuda` binary set:
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
`MUL_MAT_ID` `806/806`. A `build-phase36` pre-gate also passed, but
`build-phase36` did not contain `llama-batched-bench`, so `build-cuda` is the
profiled/gated build for this phase.
Results:
| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s |
|--------|--------------|--------------|--------------|--------------|-----------------|
| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` |
| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` |
| ratio | | | | | `0.8820` |
Decision: Phase47's dense high-N serving loss is not just a kernel-speed gap.
True dense decode is still behind vLLM by about `12%`, but the Phase47 h2h
decode ratio at `n=128` was `0.7912` and aggregate serving was only `0.5071`.
The remaining difference points at scheduler/admission, prefill overlap, and
TTFT accounting. Next implementation target should be an opt-in
batch-composition/admission trace in `server_context::pre_decode()` before any
new GDN/GEMM shortcut.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update