diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 8a5ac5429..f3dc3820d 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -2769,3 +2769,67 @@ Verification: and a DGX dense dry-run with `VLLM_READY_ATTEMPTS=700`. - DGX dry-run artifact: `/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`. + +## Phase 50 Dense True Decode Profile + +Phase 50 separates dense high-concurrency decode from the Phase47 h2h serving +window. The Phase47 h2h `decode_agg_tps` metric can count tokens generated by +early requests while later requests are still in prefill, then divide by a +window that starts at the last first-token. That is useful serving telemetry, +but it is not a pure steady-decode measurement. + +Artifact: + +- `/home/mudler/bench/phase50_dense_true_decode/20260701_103120` + +Preflight: + +- Docker containers: `0` +- `local-ai-worker`: `0` +- GPU compute apps: `0` +- GPU: `NVIDIA GB10`, driver `580.159.03` + +Inference gates: + +| phase | build | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|-------|---------|-----------|-----------|--------------| +| pre | `build-phase36` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| pre | `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +`build-phase36/bin` had the completion and op-test binaries but not +`llama-batched-bench`, so the actual profiled llama.cpp decode binary came from +`~/llama-phase6-source/build-cuda/bin`. That build was gated before and after +the profile. + +Profile method: + +- Shape: dense Qwen3.5, `npl=128`, `npp=128`, `ntg=16` and `ntg=64`. +- Paged command: `llama-batched-bench` with `LLAMA_KV_PAGED=1`, + `LLAMA_MOE_FORCE_GRAPHS=1`, `-c 131072 -b 2048 -ub 512 -ngl 99 -fa on`. +- vLLM command: in-process `LLM.generate`, `max_model_len=4096`, + `max_num_seqs=256`, `gpu_memory_utilization=0.85`, prefix caching disabled. +- Both profiles used `nsys --cuda-graph-trace=node`. +- Difference method: `(ntg64 tokens - ntg16 tokens) / (ntg64 wall - ntg16 wall)`. + +Results: + +| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s | +|--------|--------------|--------------|--------------|--------------|-----------------| +| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` | +| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` | +| ratio | | | | | `0.8820` | + +Interpretation: + +- Dense true decode at `n=128` is about `88.2%` of vLLM, not the `79.1%` + implied by Phase47 h2h aggregate decode. The Phase47 serving window therefore + includes real scheduler/accounting effects in addition to GPU decode speed. +- There is still a real dense GPU-steady decode gap of about `12%`, but it is + not large enough to explain Phase47 aggregate serving (`50.7%` of vLLM) or + TTFT (`3.20x` vLLM) by itself. +- The next low-conflict code phase should add an opt-in serving + batch-composition/admission trace around `server_context::pre_decode()` to + measure decode tokens admitted, prompt tokens admitted, waiting prompt slots, + graph reuse, and prefill starvation. Do not start with another GDN or GEMM + rewrite unless that trace rules the scheduler out. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index ea588bec8..23cc982c3 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -613,6 +613,19 @@ Phase 49 removes vLLM log noise from harness-owned environment variables. The preserving intentional vLLM runtime variables such as `VLLM_LOGGING_LEVEL`. Dry run: `/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`. +Phase 50 resolves the dense high-N decode-accounting question with a graph-node +difference-method profile. Artifact: +`/home/mudler/bench/phase50_dense_true_decode/20260701_103120`. Pre/post +inference gates on the profiled `build-cuda` binary stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. Dense `npl=128`, `npp=128` true decode is `383.66 t/s` for paged and +`435.00 t/s` for vLLM, ratio `0.8820`. This means Phase47's `0.7912` h2h +decode ratio and `0.5071` aggregate ratio include scheduler/admission and +prefill-overlap/accounting effects beyond the real GPU-steady decode gap. Next +GB10 code work should instrument batch composition/admission in +`server_context::pre_decode()` before attempting another kernel shortcut. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -705,6 +718,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase48_readiness_harness_dryrun/20260701_100533` - harness dry-run proving configurable readiness budgets and clean preflight before retrying dense serving. - `~/bench/phase47_dense_serving_retry/20260701_100811` - completed dense serving snapshot after Phase48; pre/post md5 and op gates green; paged low-N decode ahead, high-N aggregate and TTFT behind. - `~/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138` - harness dry-run after scrubbing harness-owned `VLLM_*` variables from the `vllm serve` child environment. +- `~/bench/phase50_dense_true_decode/20260701_103120` - dense graph-node difference-method profile at `npl=128`, `npp=128`; `build-cuda` pre/post md5 and op gates green; true decode paged `383.66 t/s`, vLLM `435.00 t/s`, ratio `0.8820`, pointing next at serving admission/scheduler tracing. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 7b95ee648..e665f558b 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -1186,6 +1186,36 @@ Decision: dense low-N decode remains a real paged strength, but dense serving still does not close GB10 parity because TTFT and high-concurrency aggregate throughput remain substantially behind vLLM. +### Phase 50 dense true decode profile + +Phase50 profiles dense `npl=128`, `npp=128` decode with graph nodes expanded and +uses the difference method (`ntg=64 - ntg=16`) instead of the Phase47 h2h +serving window. Artifact: +`/home/mudler/bench/phase50_dense_true_decode/20260701_103120`. + +Pre/post inference gates stayed green on the profiled `build-cuda` binary set: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and +`MUL_MAT_ID` `806/806`. A `build-phase36` pre-gate also passed, but +`build-phase36` did not contain `llama-batched-bench`, so `build-cuda` is the +profiled/gated build for this phase. + +Results: + +| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s | +|--------|--------------|--------------|--------------|--------------|-----------------| +| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` | +| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` | +| ratio | | | | | `0.8820` | + +Decision: Phase47's dense high-N serving loss is not just a kernel-speed gap. +True dense decode is still behind vLLM by about `12%`, but the Phase47 h2h +decode ratio at `n=128` was `0.7912` and aggregate serving was only `0.5071`. +The remaining difference points at scheduler/admission, prefill overlap, and +TTFT accounting. Next implementation target should be an opt-in +batch-composition/admission trace in `server_context::pre_decode()` before any +new GDN/GEMM shortcut. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md b/docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md new file mode 100644 index 000000000..5e0e84445 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md @@ -0,0 +1,415 @@ +# Phase50 Dense True Decode Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Measure dense Qwen3.5 true steady decode on GB10 for paged llama.cpp and vLLM, separated from h2h TTFT/prefill-overlap accounting, while proving inference output and touched backend ops remain unchanged before and after the run. + +**Architecture:** Do not change inference code. Run canonical pre/post paged inference gates, then collect graph-node-traced nsys profiles for dense paged llama.cpp and dense vLLM using the difference method: `ntg=64 - ntg=16` at the same `npl=128`, `npp=128` shape. Record the result in the parity docs and keep the next code target limited to scheduler/admission tracing only if true steady decode does not explain the Phase47 high-N serving gap. + +**Tech Stack:** DGX GB10 over `ssh dgx.casa`, llama.cpp fork build in `~/llama-phase6-source/build-cuda` for `llama-batched-bench`, vLLM 0.23.0 in `~/vllm-bench`, `nsys --cuda-graph-trace=node`, LocalAI parity docs. + +--- + +### Task 1: Confirm DGX is idle and acquire an artifact directory + +**Files:** +- Create on DGX: `~/bench/phase50_dense_true_decode//preflight.txt` +- Create on DGX: `~/bench/phase50_dense_true_decode//hardware.txt` +- Create on DGX: `~/bench/phase50_dense_true_decode//run.log` +- Modify later: `docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md` + +- [x] **Step 1: Check the DGX preflight** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART=$HOME/bench/phase50_dense_true_decode/$(date +%Y%m%d_%H%M%S) +mkdir -p "$ART" +{ + printf "docker="; docker ps -q | wc -l + printf "local_ai_worker="; docker ps --format "{{.Names}}" | grep -c local-ai-worker || true + printf "compute="; nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l + printf "owner="; if [ -f "$HOME/gpu_bench_lock/owner" ]; then cat "$HOME/gpu_bench_lock/owner"; else echo FREE-no-lock-file; fi +} | tee "$ART/preflight.txt" +nvidia-smi -L | tee "$ART/hardware.txt" +nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader | tee -a "$ART/hardware.txt" +echo "$ART"' +``` + +Expected: `docker=0`, `local_ai_worker=0`, `compute=0`, and `owner=FREE...`. + +- [x] **Step 2: Acquire the owner-file lock** + +Run with `ART` set to the printed artifact directory: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +mkdir -p "$HOME/gpu_bench_lock" +echo "codex-phase50-dense-true-decode $(date +%s)" > "$HOME/gpu_bench_lock/owner" +cat "$HOME/gpu_bench_lock/owner" | tee -a "$ART/run.log"' +``` + +Expected: owner starts with `codex-phase50-dense-true-decode`. + +Actual artifact: `/home/mudler/bench/phase50_dense_true_decode/20260701_103120`. +Preflight was clean: docker `0`, `local-ai-worker` `0`, compute `0`, owner +`FREE released-by-codex-current-serving-snapshot 1782893824`. + +### Task 2: Run pre-profile inference gates + +**Files:** +- Create on DGX: `~/bench/phase50_dense_true_decode//gate_pre/` +- Create on DGX: `~/bench/phase50_dense_true_decode//gate_pre.log` + +- [x] **Step 1: Run the canonical paged gate helper** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +BIN="$HOME/llama-phase6-source/build-phase36/bin" \ +ART="$ART/gate_pre" \ +OPS=MUL_MAT,MUL_MAT_ID \ + "$HOME/paged-inference-gates.sh" 2>&1 | tee "$ART/gate_pre.log"' +``` + +Expected: + +```text +paged inference gates OK +``` + +Required values: +- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` +- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT`: `1146/1146` +- `MUL_MAT_ID`: `806/806` + +Actual: `build-phase36` pre-gate passed, then `build-cuda` pre-gate also +passed because `build-phase36/bin` does not contain `llama-batched-bench`. +The profiled binary set is therefore `~/llama-phase6-source/build-cuda/bin`. + +### Task 3: Profile dense paged llama.cpp true decode + +**Files:** +- Create on DGX: `~/bench/phase50_dense_true_decode//paged_dense_n128_ntg16.nsys-rep` +- Create on DGX: `~/bench/phase50_dense_true_decode//paged_dense_n128_ntg16.bench.log` +- Create on DGX: `~/bench/phase50_dense_true_decode//paged_dense_n128_ntg64.nsys-rep` +- Create on DGX: `~/bench/phase50_dense_true_decode//paged_dense_n128_ntg64.bench.log` + +- [x] **Step 1: Run ntg=16 graph-node profile** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +BIN="$HOME/llama-phase6-source/build-cuda/bin/llama-batched-bench" +MODEL="$HOME/bench/q36-27b-nvfp4.gguf" +REP="$ART/paged_dense_n128_ntg16" +rm -f "$REP.nsys-rep" "$REP.sqlite" +nsys profile --cuda-graph-trace=node --trace=cuda,nvtx --sample=none --cpuctxsw=none \ + --force-overwrite=true -o "$REP" \ + env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 \ + "$BIN" -m "$MODEL" -c 131072 -b 2048 -ub 512 -ngl 99 -fa on \ + -npp 128 -ntg 16 -npl 128 > "$REP.bench.log" 2>&1 +grep -E "model|\\| *128|llama_perf|error|Error|Traceback" "$REP.bench.log" | tail -40' +``` + +Expected: command exits 0 and writes `paged_dense_n128_ntg16.nsys-rep`. + +Actual: `T_TG=5.754s`, `S_TG=355.93 t/s`; report +`paged_dense_n128_ntg16.nsys-rep` written. + +- [x] **Step 2: Run ntg=64 graph-node profile** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +BIN="$HOME/llama-phase6-source/build-cuda/bin/llama-batched-bench" +MODEL="$HOME/bench/q36-27b-nvfp4.gguf" +REP="$ART/paged_dense_n128_ntg64" +rm -f "$REP.nsys-rep" "$REP.sqlite" +nsys profile --cuda-graph-trace=node --trace=cuda,nvtx --sample=none --cpuctxsw=none \ + --force-overwrite=true -o "$REP" \ + env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 \ + "$BIN" -m "$MODEL" -c 131072 -b 2048 -ub 512 -ngl 99 -fa on \ + -npp 128 -ntg 64 -npl 128 > "$REP.bench.log" 2>&1 +grep -E "model|\\| *128|llama_perf|error|Error|Traceback" "$REP.bench.log" | tail -40' +``` + +Expected: command exits 0 and writes `paged_dense_n128_ntg64.nsys-rep`. + +Actual: `T_TG=21.768s`, `S_TG=376.33 t/s`; report +`paged_dense_n128_ntg64.nsys-rep` written. + +### Task 4: Profile dense vLLM true decode + +**Files:** +- Create on DGX: `~/bench/phase50_dense_true_decode//vllm_dense_decode_prof.py` +- Create on DGX: `~/bench/phase50_dense_true_decode//vllm_dense_n128_ntg16.nsys-rep` +- Create on DGX: `~/bench/phase50_dense_true_decode//vllm_dense_n128_ntg16.run.log` +- Create on DGX: `~/bench/phase50_dense_true_decode//vllm_dense_n128_ntg64.nsys-rep` +- Create on DGX: `~/bench/phase50_dense_true_decode//vllm_dense_n128_ntg64.run.log` + +- [x] **Step 1: Write the vLLM dense profile driver** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +cat > "$ART/vllm_dense_decode_prof.py" <<'"'"'PY'"'"' +import os, time, torch +os.environ["HF_HUB_OFFLINE"] = "1" +os.environ["VLLM_LOGGING_LEVEL"] = "WARNING" +os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" +from vllm import LLM, SamplingParams +from vllm.inputs import TokensPrompt + +MODEL = os.environ.get("MODEL", "/home/mudler/bench/q36-27b-nvfp4-vllm") +NSEQ = int(os.environ.get("NSEQ", "128")) +PROMPT_TOKS = int(os.environ.get("PT", "128")) +GEN = int(os.environ.get("GEN", "64")) + +llm = LLM( + model=MODEL, + enforce_eager=False, + max_model_len=4096, + gpu_memory_utilization=0.85, + max_num_seqs=256, + tensor_parallel_size=1, + enable_prefix_caching=False, + disable_log_stats=True, +) +prompts = [ + TokensPrompt(prompt_token_ids=[1000 + (i * 7 + j * 13) % 30000 for j in range(PROMPT_TOKS)]) + for i in range(NSEQ) +] +sp = SamplingParams(temperature=0.0, max_tokens=GEN, ignore_eos=True, min_tokens=GEN) +print(f"dense vLLM NSEQ={NSEQ} PT={PROMPT_TOKS} GEN={GEN} warmup...", flush=True) +llm.generate(prompts, sp, use_tqdm=False) +torch.cuda.synchronize() +print("PROFILED GENERATE START", flush=True) +torch.cuda.cudart().cudaProfilerStart() +t0 = time.time() +outs = llm.generate(prompts, sp, use_tqdm=False) +torch.cuda.synchronize() +t1 = time.time() +torch.cuda.cudart().cudaProfilerStop() +ntok = sum(len(o.outputs[0].token_ids) for o in outs) +print(f"PROFILED END seqs={len(outs)} gen_tok={ntok} wall={t1-t0:.3f}s tok/s={ntok/(t1-t0):.1f} incl_prefill", flush=True) +PY' +``` + +Expected: `vllm_dense_decode_prof.py` exists in the artifact directory. + +Actual: used an equivalent self-contained `python -c` target under nsys instead +of writing a DGX source script. No inference code or repo file was changed. + +- [x] **Step 2: Run ntg=16 graph-node profile** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +REP="$ART/vllm_dense_n128_ntg16" +rm -f "$REP.nsys-rep" "$REP.sqlite" +PATH="$HOME/vllm-bench/bin:$PATH" HF_HUB_OFFLINE=1 NSEQ=128 PT=128 GEN=16 \ +nsys profile --cuda-graph-trace=node --capture-range=cudaProfilerApi --capture-range-end=stop \ + --trace=cuda --sample=none --cpuctxsw=none --force-overwrite=true -o "$REP" \ + "$HOME/vllm-bench/bin/python" "$ART/vllm_dense_decode_prof.py" > "$REP.run.log" 2>&1 +grep -E "PROFILED|Error|error|Traceback" "$REP.run.log" | tail -20' +``` + +Expected: command exits 0 and writes `vllm_dense_n128_ntg16.nsys-rep`. + +Actual: profiled generate `2048` tokens in `13.041s`; report +`vllm_dense_n128_ntg16.nsys-rep` written. + +- [x] **Step 3: Run ntg=64 graph-node profile** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +REP="$ART/vllm_dense_n128_ntg64" +rm -f "$REP.nsys-rep" "$REP.sqlite" +PATH="$HOME/vllm-bench/bin:$PATH" HF_HUB_OFFLINE=1 NSEQ=128 PT=128 GEN=64 \ +nsys profile --cuda-graph-trace=node --capture-range=cudaProfilerApi --capture-range-end=stop \ + --trace=cuda --sample=none --cpuctxsw=none --force-overwrite=true -o "$REP" \ + "$HOME/vllm-bench/bin/python" "$ART/vllm_dense_decode_prof.py" > "$REP.run.log" 2>&1 +grep -E "PROFILED|Error|error|Traceback" "$REP.run.log" | tail -20' +``` + +Expected: command exits 0 and writes `vllm_dense_n128_ntg64.nsys-rep`. + +Actual: profiled generate `8192` tokens in `27.165s`; report +`vllm_dense_n128_ntg64.nsys-rep` written. + +### Task 5: Compute the difference-method summary + +**Files:** +- Create on DGX: `~/bench/phase50_dense_true_decode//summary.tsv` +- Create on DGX: `~/bench/phase50_dense_true_decode//profile_files.txt` + +- [x] **Step 1: Parse paged and vLLM throughput rows** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +python3 - "$ART" <<'"'"'PY'"'"' +import pathlib, re, sys +art = pathlib.Path(sys.argv[1]) + +def paged_ttg(name): + text = (art / f"{name}.bench.log").read_text(errors="replace") + rows = [line for line in text.splitlines() if "| 128 |" in line or "| 128 |" in line] + if not rows: + rows = [line for line in text.splitlines() if re.search(r"\|\s*128\s*\|", line)] + if not rows: + raise SystemExit(f"missing paged row in {name}.bench.log") + parts = [p.strip() for p in rows[-1].split("|") if p.strip()] + # columns: PP, TG, B, N_KV, T_PP, S_PP, T_TG, S_TG, T, S + return float(parts[6]), float(parts[7]) + +def vllm_wall(name): + text = (art / f"{name}.run.log").read_text(errors="replace") + m = re.search(r"PROFILED END seqs=(\d+) gen_tok=(\d+) wall=([0-9.]+)s", text) + if not m: + raise SystemExit(f"missing vLLM PROFILED END in {name}.run.log") + return int(m.group(1)), int(m.group(2)), float(m.group(3)) + +p16_ttg, p16_stg = paged_ttg("paged_dense_n128_ntg16") +p64_ttg, p64_stg = paged_ttg("paged_dense_n128_ntg64") +v16_seq, v16_tok, v16_wall = vllm_wall("vllm_dense_n128_ntg16") +v64_seq, v64_tok, v64_wall = vllm_wall("vllm_dense_n128_ntg64") +paged_delta_tokens = 128 * (64 - 16) +paged_delta_wall = p64_ttg - p16_ttg +vllm_delta_tokens = v64_tok - v16_tok +vllm_delta_wall = v64_wall - v16_wall +paged_decode = paged_delta_tokens / paged_delta_wall +vllm_decode = vllm_delta_tokens / vllm_delta_wall +with (art / "summary.tsv").open("w") as f: + f.write("engine\tshape\tntg16_wall_s\tntg64_wall_s\tdelta_tokens\tdelta_wall_s\ttrue_decode_tps\n") + f.write(f"paged\tdense_n128_pt128\t{p16_ttg:.3f}\t{p64_ttg:.3f}\t{paged_delta_tokens}\t{paged_delta_wall:.3f}\t{paged_decode:.2f}\n") + f.write(f"vllm\tdense_n128_pt128\t{v16_wall:.3f}\t{v64_wall:.3f}\t{vllm_delta_tokens}\t{vllm_delta_wall:.3f}\t{vllm_decode:.2f}\n") + f.write(f"ratio\tpaged_over_vllm\t\t\t\t\t{paged_decode / vllm_decode:.4f}\n") +print((art / "summary.tsv").read_text()) +PY +ls -1 "$ART"/*.nsys-rep "$ART"/*.log > "$ART/profile_files.txt"' +``` + +Expected: `summary.tsv` contains `paged`, `vllm`, and `ratio` rows. + +Actual: + +| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s | +|--------|--------------|--------------|--------------|--------------|-----------------| +| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` | +| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` | +| ratio | | | | | `0.8820` | + +### Task 6: Run post-profile inference gates and release DGX + +**Files:** +- Create on DGX: `~/bench/phase50_dense_true_decode//gate_post/` +- Create on DGX: `~/bench/phase50_dense_true_decode//gate_post.log` +- Modify later: `docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md` + +- [x] **Step 1: Run the canonical paged gate helper again** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +BIN="$HOME/llama-phase6-source/build-cuda/bin" \ +ART="$ART/gate_post" \ +OPS=MUL_MAT,MUL_MAT_ID \ + "$HOME/paged-inference-gates.sh" 2>&1 | tee "$ART/gate_post.log"' +``` + +Expected: + +```text +paged inference gates OK +``` + +Actual: `build-cuda` post-gate passed with MoE md5 +`8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and +`MUL_MAT_ID` `806/806`. + +- [x] **Step 2: Release the owner-file lock** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +echo "FREE released-by-codex-phase50-dense-true-decode $(date +%s)" > "$HOME/gpu_bench_lock/owner" +cat "$HOME/gpu_bench_lock/owner" | tee -a "$ART/run.log"' +``` + +Expected: owner starts with `FREE released-by-codex-phase50-dense-true-decode`. + +Actual: owner `FREE released-by-codex-phase50-dense-true-decode 1782895927`; +docker `0`, `local-ai-worker` `0`, compute `0`. + +### Task 7: Record the result and choose the next code target + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md` + +- [x] **Step 1: Mark completed plan steps** + +Update every completed checkbox in this file. Leave failed or skipped steps unchecked and add a short note with the artifact path and failure reason. + +- [x] **Step 2: Add the Phase50 result to the parity docs** + +Record: +- artifact directory +- preflight result +- pre/post gate md5 and op-count values +- paged true decode, vLLM true decode, and ratio from `summary.tsv` +- whether Phase47 high-N serving loss is a true GPU decode gap or mostly scheduler/accounting + +Actual: recorded the artifact, preflight, gates, true-decode table, and +decision in `GB10_PARITY_PHASE0_RESULTS.md`, `VLLM_PARITY_LEVER_MAP.md`, and +`PARITY_HANDOFF.md`. Interpretation: a real dense decode gap remains, but it is +about `12%`; the larger Phase47 high-N serving loss points at +scheduler/admission and prefill-overlap/accounting. + +- [x] **Step 3: Commit the documentation-only result** + +Run: + +```bash +git status --short +git add docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git commit -m "docs(paged): record dense true decode profile" -m "Assisted-by: Codex:gpt-5" +``` + +Expected: commit succeeds and `.claude/` remains the only unrelated untracked path. + +## Self-Review + +- Spec coverage: covers inference safety via pre/post md5 and op checks, true steady decode via graph-node nsys difference method, and docs/plan phase tracking. +- Placeholder scan: no `TBD`, `TODO`, or unspecified test commands. +- Type consistency: the artifact path placeholder is consistently `REPLACE_WITH_TIMESTAMP`; replace it with the actual timestamp before running each command.