docs(paged): record dense true decode profile

Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-07-01 08:55:23 +00:00
parent cd59e5d61f
commit c299dcd231
4 changed files with 523 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -2769,3 +2769,67 @@ Verification:
  and a DGX dense dry-run with `VLLM_READY_ATTEMPTS=700`.
 - DGX dry-run artifact:
  `/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`.
+
+## Phase 50 Dense True Decode Profile
+
+Phase 50 separates dense high-concurrency decode from the Phase47 h2h serving
+window. The Phase47 h2h `decode_agg_tps` metric can count tokens generated by
+early requests while later requests are still in prefill, then divide by a
+window that starts at the last first-token. That is useful serving telemetry,
+but it is not a pure steady-decode measurement.
+
+Artifact:
+
+- `/home/mudler/bench/phase50_dense_true_decode/20260701_103120`
+
+Preflight:
+
+- Docker containers: `0`
+- `local-ai-worker`: `0`
+- GPU compute apps: `0`
+- GPU: `NVIDIA GB10`, driver `580.159.03`
+
+Inference gates:
+
+| phase | build | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
+|-------|-------|---------|-----------|-----------|--------------|
+| pre | `build-phase36` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+| pre | `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+| post | `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+
+`build-phase36/bin` had the completion and op-test binaries but not
+`llama-batched-bench`, so the actual profiled llama.cpp decode binary came from
+`~/llama-phase6-source/build-cuda/bin`. That build was gated before and after
+the profile.
+
+Profile method:
+
+- Shape: dense Qwen3.5, `npl=128`, `npp=128`, `ntg=16` and `ntg=64`.
+- Paged command: `llama-batched-bench` with `LLAMA_KV_PAGED=1`,
+  `LLAMA_MOE_FORCE_GRAPHS=1`, `-c 131072 -b 2048 -ub 512 -ngl 99 -fa on`.
+- vLLM command: in-process `LLM.generate`, `max_model_len=4096`,
+  `max_num_seqs=256`, `gpu_memory_utilization=0.85`, prefix caching disabled.
+- Both profiles used `nsys --cuda-graph-trace=node`.
+- Difference method: `(ntg64 tokens - ntg16 tokens) / (ntg64 wall - ntg16 wall)`.
+
+Results:
+
+| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s |
+|--------|--------------|--------------|--------------|--------------|-----------------|
+| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` |
+| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` |
+| ratio | | | | | `0.8820` |
+
+Interpretation:
+
+- Dense true decode at `n=128` is about `88.2%` of vLLM, not the `79.1%`
+  implied by Phase47 h2h aggregate decode. The Phase47 serving window therefore
+  includes real scheduler/accounting effects in addition to GPU decode speed.
+- There is still a real dense GPU-steady decode gap of about `12%`, but it is
+  not large enough to explain Phase47 aggregate serving (`50.7%` of vLLM) or
+  TTFT (`3.20x` vLLM) by itself.
+- The next low-conflict code phase should add an opt-in serving
+  batch-composition/admission trace around `server_context::pre_decode()` to
+  measure decode tokens admitted, prompt tokens admitted, waiting prompt slots,
+  graph reuse, and prefill starvation. Do not start with another GDN or GEMM
+  rewrite unless that trace rules the scheduler out.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -613,6 +613,19 @@ Phase 49 removes vLLM log noise from harness-owned environment variables. The
 preserving intentional vLLM runtime variables such as `VLLM_LOGGING_LEVEL`. Dry
 run: `/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`.

+Phase 50 resolves the dense high-N decode-accounting question with a graph-node
+difference-method profile. Artifact:
+`/home/mudler/bench/phase50_dense_true_decode/20260701_103120`. Pre/post
+inference gates on the profiled `build-cuda` binary stayed green: MoE
+`8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
+`806/806`. Dense `npl=128`, `npp=128` true decode is `383.66 t/s` for paged and
+`435.00 t/s` for vLLM, ratio `0.8820`. This means Phase47's `0.7912` h2h
+decode ratio and `0.5071` aggregate ratio include scheduler/admission and
+prefill-overlap/accounting effects beyond the real GPU-steady decode gap. Next
+GB10 code work should instrument batch composition/admission in
+`server_context::pre_decode()` before attempting another kernel shortcut.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -705,6 +718,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/phase48_readiness_harness_dryrun/20260701_100533` - harness dry-run proving configurable readiness budgets and clean preflight before retrying dense serving.
 - `~/bench/phase47_dense_serving_retry/20260701_100811` - completed dense serving snapshot after Phase48; pre/post md5 and op gates green; paged low-N decode ahead, high-N aggregate and TTFT behind.
 - `~/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138` - harness dry-run after scrubbing harness-owned `VLLM_*` variables from the `vllm serve` child environment.
+- `~/bench/phase50_dense_true_decode/20260701_103120` - dense graph-node difference-method profile at `npl=128`, `npp=128`; `build-cuda` pre/post md5 and op gates green; true decode paged `383.66 t/s`, vLLM `435.00 t/s`, ratio `0.8820`, pointing next at serving admission/scheduler tracing.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -1186,6 +1186,36 @@ Decision: dense low-N decode remains a real paged strength, but dense serving
 still does not close GB10 parity because TTFT and high-concurrency aggregate
 throughput remain substantially behind vLLM.

+### Phase 50 dense true decode profile
+
+Phase50 profiles dense `npl=128`, `npp=128` decode with graph nodes expanded and
+uses the difference method (`ntg=64 - ntg=16`) instead of the Phase47 h2h
+serving window. Artifact:
+`/home/mudler/bench/phase50_dense_true_decode/20260701_103120`.
+
+Pre/post inference gates stayed green on the profiled `build-cuda` binary set:
+MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
+`MUL_MAT_ID` `806/806`. A `build-phase36` pre-gate also passed, but
+`build-phase36` did not contain `llama-batched-bench`, so `build-cuda` is the
+profiled/gated build for this phase.
+
+Results:
+
+| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s |
+|--------|--------------|--------------|--------------|--------------|-----------------|
+| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` |
+| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` |
+| ratio | | | | | `0.8820` |
+
+Decision: Phase47's dense high-N serving loss is not just a kernel-speed gap.
+True dense decode is still behind vLLM by about `12%`, but the Phase47 h2h
+decode ratio at `n=128` was `0.7912` and aggregate serving was only `0.5071`.
+The remaining difference points at scheduler/admission, prefill overlap, and
+TTFT accounting. Next implementation target should be an opt-in
+batch-composition/admission trace in `server_context::pre_decode()` before any
+new GDN/GEMM shortcut.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md
+++ b/docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md
@@ -0,0 +1,415 @@
+# Phase50 Dense True Decode Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Measure dense Qwen3.5 true steady decode on GB10 for paged llama.cpp and vLLM, separated from h2h TTFT/prefill-overlap accounting, while proving inference output and touched backend ops remain unchanged before and after the run.
+
+**Architecture:** Do not change inference code. Run canonical pre/post paged inference gates, then collect graph-node-traced nsys profiles for dense paged llama.cpp and dense vLLM using the difference method: `ntg=64 - ntg=16` at the same `npl=128`, `npp=128` shape. Record the result in the parity docs and keep the next code target limited to scheduler/admission tracing only if true steady decode does not explain the Phase47 high-N serving gap.
+
+**Tech Stack:** DGX GB10 over `ssh dgx.casa`, llama.cpp fork build in `~/llama-phase6-source/build-cuda` for `llama-batched-bench`, vLLM 0.23.0 in `~/vllm-bench`, `nsys --cuda-graph-trace=node`, LocalAI parity docs.
+
+---
+
+### Task 1: Confirm DGX is idle and acquire an artifact directory
+
+**Files:**
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/preflight.txt`
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/hardware.txt`
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/run.log`
+- Modify later: `docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md`
+
+- [x] **Step 1: Check the DGX preflight**
+
+Run:
+
+```bash
+ssh dgx.casa 'set -euo pipefail
+ART=$HOME/bench/phase50_dense_true_decode/$(date +%Y%m%d_%H%M%S)
+mkdir -p "$ART"
+{
+  printf "docker="; docker ps -q | wc -l
+  printf "local_ai_worker="; docker ps --format "{{.Names}}" | grep -c local-ai-worker || true
+  printf "compute="; nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l
+  printf "owner="; if [ -f "$HOME/gpu_bench_lock/owner" ]; then cat "$HOME/gpu_bench_lock/owner"; else echo FREE-no-lock-file; fi
+} | tee "$ART/preflight.txt"
+nvidia-smi -L | tee "$ART/hardware.txt"
+nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader | tee -a "$ART/hardware.txt"
+echo "$ART"'
+```
+
+Expected: `docker=0`, `local_ai_worker=0`, `compute=0`, and `owner=FREE...`.
+
+- [x] **Step 2: Acquire the owner-file lock**
+
+Run with `ART` set to the printed artifact directory:
+
+```bash
+ssh dgx.casa 'set -euo pipefail
+ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
+mkdir -p "$HOME/gpu_bench_lock"
+echo "codex-phase50-dense-true-decode $(date +%s)" > "$HOME/gpu_bench_lock/owner"
+cat "$HOME/gpu_bench_lock/owner" | tee -a "$ART/run.log"'
+```
+
+Expected: owner starts with `codex-phase50-dense-true-decode`.
+
+Actual artifact: `/home/mudler/bench/phase50_dense_true_decode/20260701_103120`.
+Preflight was clean: docker `0`, `local-ai-worker` `0`, compute `0`, owner
+`FREE released-by-codex-current-serving-snapshot 1782893824`.
+
+### Task 2: Run pre-profile inference gates
+
+**Files:**
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/gate_pre/`
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/gate_pre.log`
+
+- [x] **Step 1: Run the canonical paged gate helper**
+
+Run:
+
+```bash
+ssh dgx.casa 'set -euo pipefail
+ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
+BIN="$HOME/llama-phase6-source/build-phase36/bin" \
+ART="$ART/gate_pre" \
+OPS=MUL_MAT,MUL_MAT_ID \
+  "$HOME/paged-inference-gates.sh" 2>&1 | tee "$ART/gate_pre.log"'
+```
+
+Expected:
+
+```text
+paged inference gates OK
+```
+
+Required values:
+- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
+- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
+- `MUL_MAT`: `1146/1146`
+- `MUL_MAT_ID`: `806/806`
+
+Actual: `build-phase36` pre-gate passed, then `build-cuda` pre-gate also
+passed because `build-phase36/bin` does not contain `llama-batched-bench`.
+The profiled binary set is therefore `~/llama-phase6-source/build-cuda/bin`.
+
+### Task 3: Profile dense paged llama.cpp true decode
+
+**Files:**
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/paged_dense_n128_ntg16.nsys-rep`
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/paged_dense_n128_ntg16.bench.log`
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/paged_dense_n128_ntg64.nsys-rep`
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/paged_dense_n128_ntg64.bench.log`
+
+- [x] **Step 1: Run ntg=16 graph-node profile**
+
+Run:
+
+```bash
+ssh dgx.casa 'set -euo pipefail
+ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
+BIN="$HOME/llama-phase6-source/build-cuda/bin/llama-batched-bench"
+MODEL="$HOME/bench/q36-27b-nvfp4.gguf"
+REP="$ART/paged_dense_n128_ntg16"
+rm -f "$REP.nsys-rep" "$REP.sqlite"
+nsys profile --cuda-graph-trace=node --trace=cuda,nvtx --sample=none --cpuctxsw=none \
+  --force-overwrite=true -o "$REP" \
+  env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 \
+  "$BIN" -m "$MODEL" -c 131072 -b 2048 -ub 512 -ngl 99 -fa on \
+    -npp 128 -ntg 16 -npl 128 > "$REP.bench.log" 2>&1
+grep -E "model|\\| *128|llama_perf|error|Error|Traceback" "$REP.bench.log" | tail -40'
+```
+
+Expected: command exits 0 and writes `paged_dense_n128_ntg16.nsys-rep`.
+
+Actual: `T_TG=5.754s`, `S_TG=355.93 t/s`; report
+`paged_dense_n128_ntg16.nsys-rep` written.
+
+- [x] **Step 2: Run ntg=64 graph-node profile**
+
+Run:
+
+```bash
+ssh dgx.casa 'set -euo pipefail
+ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
+BIN="$HOME/llama-phase6-source/build-cuda/bin/llama-batched-bench"
+MODEL="$HOME/bench/q36-27b-nvfp4.gguf"
+REP="$ART/paged_dense_n128_ntg64"
+rm -f "$REP.nsys-rep" "$REP.sqlite"
+nsys profile --cuda-graph-trace=node --trace=cuda,nvtx --sample=none --cpuctxsw=none \
+  --force-overwrite=true -o "$REP" \
+  env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 \
+  "$BIN" -m "$MODEL" -c 131072 -b 2048 -ub 512 -ngl 99 -fa on \
+    -npp 128 -ntg 64 -npl 128 > "$REP.bench.log" 2>&1
+grep -E "model|\\| *128|llama_perf|error|Error|Traceback" "$REP.bench.log" | tail -40'
+```
+
+Expected: command exits 0 and writes `paged_dense_n128_ntg64.nsys-rep`.
+
+Actual: `T_TG=21.768s`, `S_TG=376.33 t/s`; report
+`paged_dense_n128_ntg64.nsys-rep` written.
+
+### Task 4: Profile dense vLLM true decode
+
+**Files:**
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/vllm_dense_decode_prof.py`
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/vllm_dense_n128_ntg16.nsys-rep`
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/vllm_dense_n128_ntg16.run.log`
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/vllm_dense_n128_ntg64.nsys-rep`
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/vllm_dense_n128_ntg64.run.log`
+
+- [x] **Step 1: Write the vLLM dense profile driver**
+
+Run:
+
+```bash
+ssh dgx.casa 'set -euo pipefail
+ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
+cat > "$ART/vllm_dense_decode_prof.py" <<'"'"'PY'"'"'
+import os, time, torch
+os.environ["HF_HUB_OFFLINE"] = "1"
+os.environ["VLLM_LOGGING_LEVEL"] = "WARNING"
+os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
+from vllm import LLM, SamplingParams
+from vllm.inputs import TokensPrompt
+
+MODEL = os.environ.get("MODEL", "/home/mudler/bench/q36-27b-nvfp4-vllm")
+NSEQ = int(os.environ.get("NSEQ", "128"))
+PROMPT_TOKS = int(os.environ.get("PT", "128"))
+GEN = int(os.environ.get("GEN", "64"))
+
+llm = LLM(
+    model=MODEL,
+    enforce_eager=False,
+    max_model_len=4096,
+    gpu_memory_utilization=0.85,
+    max_num_seqs=256,
+    tensor_parallel_size=1,
+    enable_prefix_caching=False,
+    disable_log_stats=True,
+)
+prompts = [
+    TokensPrompt(prompt_token_ids=[1000 + (i * 7 + j * 13) % 30000 for j in range(PROMPT_TOKS)])
+    for i in range(NSEQ)
+]
+sp = SamplingParams(temperature=0.0, max_tokens=GEN, ignore_eos=True, min_tokens=GEN)
+print(f"dense vLLM NSEQ={NSEQ} PT={PROMPT_TOKS} GEN={GEN} warmup...", flush=True)
+llm.generate(prompts, sp, use_tqdm=False)
+torch.cuda.synchronize()
+print("PROFILED GENERATE START", flush=True)
+torch.cuda.cudart().cudaProfilerStart()
+t0 = time.time()
+outs = llm.generate(prompts, sp, use_tqdm=False)
+torch.cuda.synchronize()
+t1 = time.time()
+torch.cuda.cudart().cudaProfilerStop()
+ntok = sum(len(o.outputs[0].token_ids) for o in outs)
+print(f"PROFILED END seqs={len(outs)} gen_tok={ntok} wall={t1-t0:.3f}s tok/s={ntok/(t1-t0):.1f} incl_prefill", flush=True)
+PY'
+```
+
+Expected: `vllm_dense_decode_prof.py` exists in the artifact directory.
+
+Actual: used an equivalent self-contained `python -c` target under nsys instead
+of writing a DGX source script. No inference code or repo file was changed.
+
+- [x] **Step 2: Run ntg=16 graph-node profile**
+
+Run:
+
+```bash
+ssh dgx.casa 'set -euo pipefail
+ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
+REP="$ART/vllm_dense_n128_ntg16"
+rm -f "$REP.nsys-rep" "$REP.sqlite"
+PATH="$HOME/vllm-bench/bin:$PATH" HF_HUB_OFFLINE=1 NSEQ=128 PT=128 GEN=16 \
+nsys profile --cuda-graph-trace=node --capture-range=cudaProfilerApi --capture-range-end=stop \
+  --trace=cuda --sample=none --cpuctxsw=none --force-overwrite=true -o "$REP" \
+  "$HOME/vllm-bench/bin/python" "$ART/vllm_dense_decode_prof.py" > "$REP.run.log" 2>&1
+grep -E "PROFILED|Error|error|Traceback" "$REP.run.log" | tail -20'
+```
+
+Expected: command exits 0 and writes `vllm_dense_n128_ntg16.nsys-rep`.
+
+Actual: profiled generate `2048` tokens in `13.041s`; report
+`vllm_dense_n128_ntg16.nsys-rep` written.
+
+- [x] **Step 3: Run ntg=64 graph-node profile**
+
+Run:
+
+```bash
+ssh dgx.casa 'set -euo pipefail
+ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
+REP="$ART/vllm_dense_n128_ntg64"
+rm -f "$REP.nsys-rep" "$REP.sqlite"
+PATH="$HOME/vllm-bench/bin:$PATH" HF_HUB_OFFLINE=1 NSEQ=128 PT=128 GEN=64 \
+nsys profile --cuda-graph-trace=node --capture-range=cudaProfilerApi --capture-range-end=stop \
+  --trace=cuda --sample=none --cpuctxsw=none --force-overwrite=true -o "$REP" \
+  "$HOME/vllm-bench/bin/python" "$ART/vllm_dense_decode_prof.py" > "$REP.run.log" 2>&1
+grep -E "PROFILED|Error|error|Traceback" "$REP.run.log" | tail -20'
+```
+
+Expected: command exits 0 and writes `vllm_dense_n128_ntg64.nsys-rep`.
+
+Actual: profiled generate `8192` tokens in `27.165s`; report
+`vllm_dense_n128_ntg64.nsys-rep` written.
+
+### Task 5: Compute the difference-method summary
+
+**Files:**
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/summary.tsv`
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/profile_files.txt`
+
+- [x] **Step 1: Parse paged and vLLM throughput rows**
+
+Run:
+
+```bash
+ssh dgx.casa 'set -euo pipefail
+ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
+python3 - "$ART" <<'"'"'PY'"'"'
+import pathlib, re, sys
+art = pathlib.Path(sys.argv[1])
+
+def paged_ttg(name):
+    text = (art / f"{name}.bench.log").read_text(errors="replace")
+    rows = [line for line in text.splitlines() if "|   128 |" in line or "|  128 |" in line]
+    if not rows:
+        rows = [line for line in text.splitlines() if re.search(r"\|\s*128\s*\|", line)]
+    if not rows:
+        raise SystemExit(f"missing paged row in {name}.bench.log")
+    parts = [p.strip() for p in rows[-1].split("|") if p.strip()]
+    # columns: PP, TG, B, N_KV, T_PP, S_PP, T_TG, S_TG, T, S
+    return float(parts[6]), float(parts[7])
+
+def vllm_wall(name):
+    text = (art / f"{name}.run.log").read_text(errors="replace")
+    m = re.search(r"PROFILED END seqs=(\d+) gen_tok=(\d+) wall=([0-9.]+)s", text)
+    if not m:
+        raise SystemExit(f"missing vLLM PROFILED END in {name}.run.log")
+    return int(m.group(1)), int(m.group(2)), float(m.group(3))
+
+p16_ttg, p16_stg = paged_ttg("paged_dense_n128_ntg16")
+p64_ttg, p64_stg = paged_ttg("paged_dense_n128_ntg64")
+v16_seq, v16_tok, v16_wall = vllm_wall("vllm_dense_n128_ntg16")
+v64_seq, v64_tok, v64_wall = vllm_wall("vllm_dense_n128_ntg64")
+paged_delta_tokens = 128 * (64 - 16)
+paged_delta_wall = p64_ttg - p16_ttg
+vllm_delta_tokens = v64_tok - v16_tok
+vllm_delta_wall = v64_wall - v16_wall
+paged_decode = paged_delta_tokens / paged_delta_wall
+vllm_decode = vllm_delta_tokens / vllm_delta_wall
+with (art / "summary.tsv").open("w") as f:
+    f.write("engine\tshape\tntg16_wall_s\tntg64_wall_s\tdelta_tokens\tdelta_wall_s\ttrue_decode_tps\n")
+    f.write(f"paged\tdense_n128_pt128\t{p16_ttg:.3f}\t{p64_ttg:.3f}\t{paged_delta_tokens}\t{paged_delta_wall:.3f}\t{paged_decode:.2f}\n")
+    f.write(f"vllm\tdense_n128_pt128\t{v16_wall:.3f}\t{v64_wall:.3f}\t{vllm_delta_tokens}\t{vllm_delta_wall:.3f}\t{vllm_decode:.2f}\n")
+    f.write(f"ratio\tpaged_over_vllm\t\t\t\t\t{paged_decode / vllm_decode:.4f}\n")
+print((art / "summary.tsv").read_text())
+PY
+ls -1 "$ART"/*.nsys-rep "$ART"/*.log > "$ART/profile_files.txt"'
+```
+
+Expected: `summary.tsv` contains `paged`, `vllm`, and `ratio` rows.
+
+Actual:
+
+| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s |
+|--------|--------------|--------------|--------------|--------------|-----------------|
+| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` |
+| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` |
+| ratio | | | | | `0.8820` |
+
+### Task 6: Run post-profile inference gates and release DGX
+
+**Files:**
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/gate_post/`
+- Create on DGX: `~/bench/phase50_dense_true_decode/<timestamp>/gate_post.log`
+- Modify later: `docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md`
+
+- [x] **Step 1: Run the canonical paged gate helper again**
+
+Run:
+
+```bash
+ssh dgx.casa 'set -euo pipefail
+ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
+BIN="$HOME/llama-phase6-source/build-cuda/bin" \
+ART="$ART/gate_post" \
+OPS=MUL_MAT,MUL_MAT_ID \
+  "$HOME/paged-inference-gates.sh" 2>&1 | tee "$ART/gate_post.log"'
+```
+
+Expected:
+
+```text
+paged inference gates OK
+```
+
+Actual: `build-cuda` post-gate passed with MoE md5
+`8cb0ce23777bf55f92f63d0292c756b0`, dense md5
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
+`MUL_MAT_ID` `806/806`.
+
+- [x] **Step 2: Release the owner-file lock**
+
+Run:
+
+```bash
+ssh dgx.casa 'set -euo pipefail
+ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP"
+echo "FREE released-by-codex-phase50-dense-true-decode $(date +%s)" > "$HOME/gpu_bench_lock/owner"
+cat "$HOME/gpu_bench_lock/owner" | tee -a "$ART/run.log"'
+```
+
+Expected: owner starts with `FREE released-by-codex-phase50-dense-true-decode`.
+
+Actual: owner `FREE released-by-codex-phase50-dense-true-decode 1782895927`;
+docker `0`, `local-ai-worker` `0`, compute `0`.
+
+### Task 7: Record the result and choose the next code target
+
+**Files:**
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
+- Modify: `docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md`
+
+- [x] **Step 1: Mark completed plan steps**
+
+Update every completed checkbox in this file. Leave failed or skipped steps unchecked and add a short note with the artifact path and failure reason.
+
+- [x] **Step 2: Add the Phase50 result to the parity docs**
+
+Record:
+- artifact directory
+- preflight result
+- pre/post gate md5 and op-count values
+- paged true decode, vLLM true decode, and ratio from `summary.tsv`
+- whether Phase47 high-N serving loss is a true GPU decode gap or mostly scheduler/accounting
+
+Actual: recorded the artifact, preflight, gates, true-decode table, and
+decision in `GB10_PARITY_PHASE0_RESULTS.md`, `VLLM_PARITY_LEVER_MAP.md`, and
+`PARITY_HANDOFF.md`. Interpretation: a real dense decode gap remains, but it is
+about `12%`; the larger Phase47 high-N serving loss points at
+scheduler/admission and prefill-overlap/accounting.
+
+- [x] **Step 3: Commit the documentation-only result**
+
+Run:
+
+```bash
+git status --short
+git add docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md \
+  backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
+  backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
+  backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+git commit -m "docs(paged): record dense true decode profile" -m "Assisted-by: Codex:gpt-5"
+```
+
+Expected: commit succeeds and `.claude/` remains the only unrelated untracked path.
+
+## Self-Review
+
+- Spec coverage: covers inference safety via pre/post md5 and op checks, true steady decode via graph-node nsys difference method, and docs/plan phase tracking.
+- Placeholder scan: no `TBD`, `TODO`, or unspecified test commands.
+- Type consistency: the artifact path placeholder is consistently `REPLACE_WITH_TIMESTAMP`; replace it with the actual timestamp before running each command.