docs(paged): record dense true decode profile

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 08:55:23 +00:00
parent cd59e5d61f
commit c299dcd231
4 changed files with 523 additions and 0 deletions

View File

@@ -2769,3 +2769,67 @@ Verification:
and a DGX dense dry-run with `VLLM_READY_ATTEMPTS=700`.
- DGX dry-run artifact:
`/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`.
## Phase 50 Dense True Decode Profile
Phase 50 separates dense high-concurrency decode from the Phase47 h2h serving
window. The Phase47 h2h `decode_agg_tps` metric can count tokens generated by
early requests while later requests are still in prefill, then divide by a
window that starts at the last first-token. That is useful serving telemetry,
but it is not a pure steady-decode measurement.
Artifact:
- `/home/mudler/bench/phase50_dense_true_decode/20260701_103120`
Preflight:
- Docker containers: `0`
- `local-ai-worker`: `0`
- GPU compute apps: `0`
- GPU: `NVIDIA GB10`, driver `580.159.03`
Inference gates:
| phase | build | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|-------|-------|---------|-----------|-----------|--------------|
| pre | `build-phase36` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| pre | `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| post | `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
`build-phase36/bin` had the completion and op-test binaries but not
`llama-batched-bench`, so the actual profiled llama.cpp decode binary came from
`~/llama-phase6-source/build-cuda/bin`. That build was gated before and after
the profile.
Profile method:
- Shape: dense Qwen3.5, `npl=128`, `npp=128`, `ntg=16` and `ntg=64`.
- Paged command: `llama-batched-bench` with `LLAMA_KV_PAGED=1`,
`LLAMA_MOE_FORCE_GRAPHS=1`, `-c 131072 -b 2048 -ub 512 -ngl 99 -fa on`.
- vLLM command: in-process `LLM.generate`, `max_model_len=4096`,
`max_num_seqs=256`, `gpu_memory_utilization=0.85`, prefix caching disabled.
- Both profiles used `nsys --cuda-graph-trace=node`.
- Difference method: `(ntg64 tokens - ntg16 tokens) / (ntg64 wall - ntg16 wall)`.
Results:
| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s |
|--------|--------------|--------------|--------------|--------------|-----------------|
| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` |
| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` |
| ratio | | | | | `0.8820` |
Interpretation:
- Dense true decode at `n=128` is about `88.2%` of vLLM, not the `79.1%`
implied by Phase47 h2h aggregate decode. The Phase47 serving window therefore
includes real scheduler/accounting effects in addition to GPU decode speed.
- There is still a real dense GPU-steady decode gap of about `12%`, but it is
not large enough to explain Phase47 aggregate serving (`50.7%` of vLLM) or
TTFT (`3.20x` vLLM) by itself.
- The next low-conflict code phase should add an opt-in serving
batch-composition/admission trace around `server_context::pre_decode()` to
measure decode tokens admitted, prompt tokens admitted, waiting prompt slots,
graph reuse, and prefill starvation. Do not start with another GDN or GEMM
rewrite unless that trace rules the scheduler out.

View File

@@ -613,6 +613,19 @@ Phase 49 removes vLLM log noise from harness-owned environment variables. The
preserving intentional vLLM runtime variables such as `VLLM_LOGGING_LEVEL`. Dry
run: `/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`.
Phase 50 resolves the dense high-N decode-accounting question with a graph-node
difference-method profile. Artifact:
`/home/mudler/bench/phase50_dense_true_decode/20260701_103120`. Pre/post
inference gates on the profiled `build-cuda` binary stayed green: MoE
`8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
`806/806`. Dense `npl=128`, `npp=128` true decode is `383.66 t/s` for paged and
`435.00 t/s` for vLLM, ratio `0.8820`. This means Phase47's `0.7912` h2h
decode ratio and `0.5071` aggregate ratio include scheduler/admission and
prefill-overlap/accounting effects beyond the real GPU-steady decode gap. Next
GB10 code work should instrument batch composition/admission in
`server_context::pre_decode()` before attempting another kernel shortcut.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -705,6 +718,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
- `~/bench/phase48_readiness_harness_dryrun/20260701_100533` - harness dry-run proving configurable readiness budgets and clean preflight before retrying dense serving.
- `~/bench/phase47_dense_serving_retry/20260701_100811` - completed dense serving snapshot after Phase48; pre/post md5 and op gates green; paged low-N decode ahead, high-N aggregate and TTFT behind.
- `~/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138` - harness dry-run after scrubbing harness-owned `VLLM_*` variables from the `vllm serve` child environment.
- `~/bench/phase50_dense_true_decode/20260701_103120` - dense graph-node difference-method profile at `npl=128`, `npp=128`; `build-cuda` pre/post md5 and op gates green; true decode paged `383.66 t/s`, vLLM `435.00 t/s`, ratio `0.8820`, pointing next at serving admission/scheduler tracing.
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.

View File

@@ -1186,6 +1186,36 @@ Decision: dense low-N decode remains a real paged strength, but dense serving
still does not close GB10 parity because TTFT and high-concurrency aggregate
throughput remain substantially behind vLLM.
### Phase 50 dense true decode profile
Phase50 profiles dense `npl=128`, `npp=128` decode with graph nodes expanded and
uses the difference method (`ntg=64 - ntg=16`) instead of the Phase47 h2h
serving window. Artifact:
`/home/mudler/bench/phase50_dense_true_decode/20260701_103120`.
Pre/post inference gates stayed green on the profiled `build-cuda` binary set:
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
`MUL_MAT_ID` `806/806`. A `build-phase36` pre-gate also passed, but
`build-phase36` did not contain `llama-batched-bench`, so `build-cuda` is the
profiled/gated build for this phase.
Results:
| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s |
|--------|--------------|--------------|--------------|--------------|-----------------|
| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` |
| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` |
| ratio | | | | | `0.8820` |
Decision: Phase47's dense high-N serving loss is not just a kernel-speed gap.
True dense decode is still behind vLLM by about `12%`, but the Phase47 h2h
decode ratio at `n=128` was `0.7912` and aggregate serving was only `0.5071`.
The remaining difference points at scheduler/admission, prefill overlap, and
TTFT accounting. Next implementation target should be an opt-in
batch-composition/admission trace in `server_context::pre_decode()` before any
new GDN/GEMM shortcut.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update

View File

@@ -0,0 +1,415 @@
# Phase50 Dense True Decode Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Measure dense Qwen3.5 true steady decode on GB10 for paged llama.cpp and vLLM, separated from h2h TTFT/prefill-overlap accounting, while proving inference output and touched backend ops remain unchanged before and after the run.
**Architecture:** Do not change inference code. Run canonical pre/post paged inference gates, then collect graph-node-traced nsys profiles for dense paged llama.cpp and dense vLLM using the difference method: `ntg=64 - ntg=16` at the same `npl=128`, `npp=128` shape. Record the result in the parity docs and keep the next code target limited to scheduler/admission tracing only if true steady decode does not explain the Phase47 high-N serving gap.
**Tech Stack:** DGX GB10 over `ssh dgx.casa`, llama.cpp fork build in `~/llama-phase6-source/build-cuda` for `llama-batched-bench`, vLLM 0.23.0 in `~/vllm-bench`, `nsys --cuda-graph-trace=node`, LocalAI parity docs.
---
### Task 1: Confirm DGX is idle and acquire an artifact directory
**Files:**
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/preflight.txt`
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/hardware.txt`
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/run.log`
- Modify later: `docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md`
- [x] **Step 1: Check the DGX preflight**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
ART=$HOME/bench/phase50_dense_true_decode/$(date +%Y%m%d_%H%M%S)
mkdir -p "$ART"
{
printf "docker="; docker ps -q | wc -l
printf "local_ai_worker="; docker ps --format "{{.Names}}" | grep -c local-ai-worker || true
printf "compute="; nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l
printf "owner="; if [ -f "$HOME/gpu_bench_lock/owner" ]; then cat "$HOME/gpu_bench_lock/owner"; else echo FREE-no-lock-file; fi
} | tee "$ART/preflight.txt"
nvidia-smi -L | tee "$ART/hardware.txt"
nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader | tee -a "$ART/hardware.txt"
echo "$ART"'
```
Expected: `docker=0`, `local_ai_worker=0`, `compute=0`, and `owner=FREE...`.
- [x] **Step 2: Acquire the owner-file lock**
Run with `ART` set to the printed artifact directory:
```bash
ssh dgx.casa 'set -euo pipefail
ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
mkdir -p "$HOME/gpu_bench_lock"
echo "codex-phase50-dense-true-decode $(date +%s)" > "$HOME/gpu_bench_lock/owner"
cat "$HOME/gpu_bench_lock/owner" | tee -a "$ART/run.log"'
```
Expected: owner starts with `codex-phase50-dense-true-decode`.
Actual artifact: `/home/mudler/bench/phase50_dense_true_decode/20260701_103120`.
Preflight was clean: docker `0`, `local-ai-worker` `0`, compute `0`, owner
`FREE released-by-codex-current-serving-snapshot 1782893824`.
### Task 2: Run pre-profile inference gates
**Files:**
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/gate_pre/`
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/gate_pre.log`
- [x] **Step 1: Run the canonical paged gate helper**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
BIN="$HOME/llama-phase6-source/build-phase36/bin" \
ART="$ART/gate_pre" \
OPS=MUL_MAT,MUL_MAT_ID \
"$HOME/paged-inference-gates.sh" 2>&1 | tee "$ART/gate_pre.log"'
```
Expected:
```text
paged inference gates OK
```
Required values:
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
- `MUL_MAT`: `1146/1146`
- `MUL_MAT_ID`: `806/806`
Actual: `build-phase36` pre-gate passed, then `build-cuda` pre-gate also
passed because `build-phase36/bin` does not contain `llama-batched-bench`.
The profiled binary set is therefore `~/llama-phase6-source/build-cuda/bin`.
### Task 3: Profile dense paged llama.cpp true decode
**Files:**
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/paged_dense_n128_ntg16.nsys-rep`
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/paged_dense_n128_ntg16.bench.log`
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/paged_dense_n128_ntg64.nsys-rep`
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/paged_dense_n128_ntg64.bench.log`
- [x] **Step 1: Run ntg=16 graph-node profile**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
BIN="$HOME/llama-phase6-source/build-cuda/bin/llama-batched-bench"
MODEL="$HOME/bench/q36-27b-nvfp4.gguf"
REP="$ART/paged_dense_n128_ntg16"
rm -f "$REP.nsys-rep" "$REP.sqlite"
nsys profile --cuda-graph-trace=node --trace=cuda,nvtx --sample=none --cpuctxsw=none \
--force-overwrite=true -o "$REP" \
env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 \
"$BIN" -m "$MODEL" -c 131072 -b 2048 -ub 512 -ngl 99 -fa on \
-npp 128 -ntg 16 -npl 128 > "$REP.bench.log" 2>&1
grep -E "model|\\| *128|llama_perf|error|Error|Traceback" "$REP.bench.log" | tail -40'
```
Expected: command exits 0 and writes `paged_dense_n128_ntg16.nsys-rep`.
Actual: `T_TG=5.754s`, `S_TG=355.93 t/s`; report
`paged_dense_n128_ntg16.nsys-rep` written.
- [x] **Step 2: Run ntg=64 graph-node profile**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
BIN="$HOME/llama-phase6-source/build-cuda/bin/llama-batched-bench"
MODEL="$HOME/bench/q36-27b-nvfp4.gguf"
REP="$ART/paged_dense_n128_ntg64"
rm -f "$REP.nsys-rep" "$REP.sqlite"
nsys profile --cuda-graph-trace=node --trace=cuda,nvtx --sample=none --cpuctxsw=none \
--force-overwrite=true -o "$REP" \
env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 \
"$BIN" -m "$MODEL" -c 131072 -b 2048 -ub 512 -ngl 99 -fa on \
-npp 128 -ntg 64 -npl 128 > "$REP.bench.log" 2>&1
grep -E "model|\\| *128|llama_perf|error|Error|Traceback" "$REP.bench.log" | tail -40'
```
Expected: command exits 0 and writes `paged_dense_n128_ntg64.nsys-rep`.
Actual: `T_TG=21.768s`, `S_TG=376.33 t/s`; report
`paged_dense_n128_ntg64.nsys-rep` written.
### Task 4: Profile dense vLLM true decode
**Files:**
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/vllm_dense_decode_prof.py`
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/vllm_dense_n128_ntg16.nsys-rep`
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/vllm_dense_n128_ntg16.run.log`
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/vllm_dense_n128_ntg64.nsys-rep`
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/vllm_dense_n128_ntg64.run.log`
- [x] **Step 1: Write the vLLM dense profile driver**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
cat > "$ART/vllm_dense_decode_prof.py" <<'"'"'PY'"'"'
import os, time, torch
os.environ["HF_HUB_OFFLINE"] = "1"
os.environ["VLLM_LOGGING_LEVEL"] = "WARNING"
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
from vllm import LLM, SamplingParams
from vllm.inputs import TokensPrompt
MODEL = os.environ.get("MODEL", "/home/mudler/bench/q36-27b-nvfp4-vllm")
NSEQ = int(os.environ.get("NSEQ", "128"))
PROMPT_TOKS = int(os.environ.get("PT", "128"))
GEN = int(os.environ.get("GEN", "64"))
llm = LLM(
model=MODEL,
enforce_eager=False,
max_model_len=4096,
gpu_memory_utilization=0.85,
max_num_seqs=256,
tensor_parallel_size=1,
enable_prefix_caching=False,
disable_log_stats=True,
)
prompts = [
TokensPrompt(prompt_token_ids=[1000 + (i * 7 + j * 13) % 30000 for j in range(PROMPT_TOKS)])
for i in range(NSEQ)
]
sp = SamplingParams(temperature=0.0, max_tokens=GEN, ignore_eos=True, min_tokens=GEN)
print(f"dense vLLM NSEQ={NSEQ} PT={PROMPT_TOKS} GEN={GEN} warmup...", flush=True)
llm.generate(prompts, sp, use_tqdm=False)
torch.cuda.synchronize()
print("PROFILED GENERATE START", flush=True)
torch.cuda.cudart().cudaProfilerStart()
t0 = time.time()
outs = llm.generate(prompts, sp, use_tqdm=False)
torch.cuda.synchronize()
t1 = time.time()
torch.cuda.cudart().cudaProfilerStop()
ntok = sum(len(o.outputs[0].token_ids) for o in outs)
print(f"PROFILED END seqs={len(outs)} gen_tok={ntok} wall={t1-t0:.3f}s tok/s={ntok/(t1-t0):.1f} incl_prefill", flush=True)
PY'
```
Expected: `vllm_dense_decode_prof.py` exists in the artifact directory.
Actual: used an equivalent self-contained `python -c` target under nsys instead
of writing a DGX source script. No inference code or repo file was changed.
- [x] **Step 2: Run ntg=16 graph-node profile**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
REP="$ART/vllm_dense_n128_ntg16"
rm -f "$REP.nsys-rep" "$REP.sqlite"
PATH="$HOME/vllm-bench/bin:$PATH" HF_HUB_OFFLINE=1 NSEQ=128 PT=128 GEN=16 \
nsys profile --cuda-graph-trace=node --capture-range=cudaProfilerApi --capture-range-end=stop \
--trace=cuda --sample=none --cpuctxsw=none --force-overwrite=true -o "$REP" \
"$HOME/vllm-bench/bin/python" "$ART/vllm_dense_decode_prof.py" > "$REP.run.log" 2>&1
grep -E "PROFILED|Error|error|Traceback" "$REP.run.log" | tail -20'
```
Expected: command exits 0 and writes `vllm_dense_n128_ntg16.nsys-rep`.
Actual: profiled generate `2048` tokens in `13.041s`; report
`vllm_dense_n128_ntg16.nsys-rep` written.
- [x] **Step 3: Run ntg=64 graph-node profile**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
REP="$ART/vllm_dense_n128_ntg64"
rm -f "$REP.nsys-rep" "$REP.sqlite"
PATH="$HOME/vllm-bench/bin:$PATH" HF_HUB_OFFLINE=1 NSEQ=128 PT=128 GEN=64 \
nsys profile --cuda-graph-trace=node --capture-range=cudaProfilerApi --capture-range-end=stop \
--trace=cuda --sample=none --cpuctxsw=none --force-overwrite=true -o "$REP" \
"$HOME/vllm-bench/bin/python" "$ART/vllm_dense_decode_prof.py" > "$REP.run.log" 2>&1
grep -E "PROFILED|Error|error|Traceback" "$REP.run.log" | tail -20'
```
Expected: command exits 0 and writes `vllm_dense_n128_ntg64.nsys-rep`.
Actual: profiled generate `8192` tokens in `27.165s`; report
`vllm_dense_n128_ntg64.nsys-rep` written.
### Task 5: Compute the difference-method summary
**Files:**
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/summary.tsv`
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/profile_files.txt`
- [x] **Step 1: Parse paged and vLLM throughput rows**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
python3 - "$ART" <<'"'"'PY'"'"'
import pathlib, re, sys
art = pathlib.Path(sys.argv[1])
def paged_ttg(name):
text = (art / f"{name}.bench.log").read_text(errors="replace")
rows = [line for line in text.splitlines() if "| 128 |" in line or "| 128 |" in line]
if not rows:
rows = [line for line in text.splitlines() if re.search(r"\|\s*128\s*\|", line)]
if not rows:
raise SystemExit(f"missing paged row in {name}.bench.log")
parts = [p.strip() for p in rows[-1].split("|") if p.strip()]
# columns: PP, TG, B, N_KV, T_PP, S_PP, T_TG, S_TG, T, S
return float(parts[6]), float(parts[7])
def vllm_wall(name):
text = (art / f"{name}.run.log").read_text(errors="replace")
m = re.search(r"PROFILED END seqs=(\d+) gen_tok=(\d+) wall=([0-9.]+)s", text)
if not m:
raise SystemExit(f"missing vLLM PROFILED END in {name}.run.log")
return int(m.group(1)), int(m.group(2)), float(m.group(3))
p16_ttg, p16_stg = paged_ttg("paged_dense_n128_ntg16")
p64_ttg, p64_stg = paged_ttg("paged_dense_n128_ntg64")
v16_seq, v16_tok, v16_wall = vllm_wall("vllm_dense_n128_ntg16")
v64_seq, v64_tok, v64_wall = vllm_wall("vllm_dense_n128_ntg64")
paged_delta_tokens = 128 * (64 - 16)
paged_delta_wall = p64_ttg - p16_ttg
vllm_delta_tokens = v64_tok - v16_tok
vllm_delta_wall = v64_wall - v16_wall
paged_decode = paged_delta_tokens / paged_delta_wall
vllm_decode = vllm_delta_tokens / vllm_delta_wall
with (art / "summary.tsv").open("w") as f:
f.write("engine\tshape\tntg16_wall_s\tntg64_wall_s\tdelta_tokens\tdelta_wall_s\ttrue_decode_tps\n")
f.write(f"paged\tdense_n128_pt128\t{p16_ttg:.3f}\t{p64_ttg:.3f}\t{paged_delta_tokens}\t{paged_delta_wall:.3f}\t{paged_decode:.2f}\n")
f.write(f"vllm\tdense_n128_pt128\t{v16_wall:.3f}\t{v64_wall:.3f}\t{vllm_delta_tokens}\t{vllm_delta_wall:.3f}\t{vllm_decode:.2f}\n")
f.write(f"ratio\tpaged_over_vllm\t\t\t\t\t{paged_decode / vllm_decode:.4f}\n")
print((art / "summary.tsv").read_text())
PY
ls -1 "$ART"/*.nsys-rep "$ART"/*.log > "$ART/profile_files.txt"'
```
Expected: `summary.tsv` contains `paged`, `vllm`, and `ratio` rows.
Actual:
| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s |
|--------|--------------|--------------|--------------|--------------|-----------------|
| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` |
| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` |
| ratio | | | | | `0.8820` |
### Task 6: Run post-profile inference gates and release DGX
**Files:**
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/gate_post/`
- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/gate_post.log`
- Modify later: `docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md`
- [x] **Step 1: Run the canonical paged gate helper again**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
BIN="$HOME/llama-phase6-source/build-cuda/bin" \
ART="$ART/gate_post" \
OPS=MUL_MAT,MUL_MAT_ID \
"$HOME/paged-inference-gates.sh" 2>&1 | tee "$ART/gate_post.log"'
```
Expected:
```text
paged inference gates OK
```
Actual: `build-cuda` post-gate passed with MoE md5
`8cb0ce23777bf55f92f63d0292c756b0`, dense md5
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
`MUL_MAT_ID` `806/806`.
- [x] **Step 2: Release the owner-file lock**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
echo "FREE released-by-codex-phase50-dense-true-decode $(date +%s)" > "$HOME/gpu_bench_lock/owner"
cat "$HOME/gpu_bench_lock/owner" | tee -a "$ART/run.log"'
```
Expected: owner starts with `FREE released-by-codex-phase50-dense-true-decode`.
Actual: owner `FREE released-by-codex-phase50-dense-true-decode 1782895927`;
docker `0`, `local-ai-worker` `0`, compute `0`.
### Task 7: Record the result and choose the next code target
**Files:**
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
- Modify: `docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md`
- [x] **Step 1: Mark completed plan steps**
Update every completed checkbox in this file. Leave failed or skipped steps unchecked and add a short note with the artifact path and failure reason.
- [x] **Step 2: Add the Phase50 result to the parity docs**
Record:
- artifact directory
- preflight result
- pre/post gate md5 and op-count values
- paged true decode, vLLM true decode, and ratio from `summary.tsv`
- whether Phase47 high-N serving loss is a true GPU decode gap or mostly scheduler/accounting
Actual: recorded the artifact, preflight, gates, true-decode table, and
decision in `GB10_PARITY_PHASE0_RESULTS.md`, `VLLM_PARITY_LEVER_MAP.md`, and
`PARITY_HANDOFF.md`. Interpretation: a real dense decode gap remains, but it is
about `12%`; the larger Phase47 high-N serving loss points at
scheduler/admission and prefill-overlap/accounting.
- [x] **Step 3: Commit the documentation-only result**
Run:
```bash
git status --short
git add docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md \
backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
git commit -m "docs(paged): record dense true decode profile" -m "Assisted-by: Codex:gpt-5"
```
Expected: commit succeeds and `.claude/` remains the only unrelated untracked path.
## Self-Review
- Spec coverage: covers inference safety via pre/post md5 and op checks, true steady decode via graph-node nsys difference method, and docs/plan phase tracking.
- Placeholder scan: no `TBD`, `TODO`, or unspecified test commands.
- Type consistency: the artifact path placeholder is consistently `REPLACE_WITH_TIMESTAMP`; replace it with the actual timestamp before running each command.