docs(paged): profile MTP graph reuse loss

Record Phase 16 nsys evidence that current MTP serving loses paged decode graph reuse and increases GPU work, explaining the Phase 15 serving regression. Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 02:32:49 +00:00
parent 4d171e62bb
commit ae76d42a96
5 changed files with 217 additions and 4 deletions
--- a/docs/superpowers/plans/2026-07-01-mtp-graph-profile-phase16.md
+++ b/docs/superpowers/plans/2026-07-01-mtp-graph-profile-phase16.md
@@ -0,0 +1,154 @@
+# MTP Graph-Reuse Profile Phase 16 Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use
+> superpowers:systematic-debugging before proposing source changes. Steps use
+> checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** validate the Phase 15 hypothesis that current MTP serving regresses
+because speculative verification disrupts paged CUDA graph reuse and increases
+GPU work.
+
+**Architecture:** capture a small direct `llama-server` baseline/MTP pair under
+`nsys --cuda-graph-trace=node`, using the same request shape except for
+`--spec-type draft-mtp`. Do not patch code in this phase.
+
+**Tech Stack:** llama.cpp `llama-server`, Nsight Systems, DGX GB10,
+`h2h_cli3.py`.
+
+---
+
+## Task 1: Preflight
+
+- [x] **Step 1: Confirm DGX is free**
+
+  Result:
+
+  ```text
+  docker=0
+  local_ai_worker=0
+  compute=0
+  FREE released-by-codex-phase15-mtp-serving-bench 1782872749
+  ```
+
+- [x] **Step 2: Confirm profiler is available**
+
+  Result:
+
+  ```text
+  /usr/local/bin/nsys
+  ```
+
+## Task 2: Capture Baseline and MTP Profiles
+
+- [x] **Step 1: Run baseline profile**
+
+  Command shape:
+
+  ```bash
+  nsys profile --force-overwrite=true --cuda-graph-trace=node \
+    --trace=cuda,nvtx,osrt --output="$ART/baseline/profile" \
+    ./llama-server -m "$MODEL" -ngl 99 -fa on -c 32768 -b 2048 -ub 512 \
+      --parallel 32 --host 127.0.0.1 --port 8098 --no-webui
+  ```
+
+  Client:
+
+  ```bash
+  python3 ~/bench/h2h_cli3.py --url http://127.0.0.1:8098/v1/completions \
+    --model m -n 8 --ptok 64 --gen 64 --no-cache
+  ```
+
+- [x] **Step 2: Run MTP profile**
+
+  Same as baseline plus:
+
+  ```text
+  --spec-type draft-mtp --spec-draft-n-max 3 --no-spec-draft-backend-sampling
+  ```
+
+- [x] **Step 3: Save artifacts**
+
+  Artifact root:
+
+  - `/home/mudler/bench/phase16_mtp_graph_profile/20260701_043016`
+
+  Files:
+
+  - `baseline/profile.nsys-rep`
+  - `baseline/profile.sqlite`
+  - `baseline/nsys_stats.txt`
+  - `baseline/client.json`
+  - `baseline/key_lines.txt`
+  - `mtp/profile.nsys-rep`
+  - `mtp/profile.sqlite`
+  - `mtp/nsys_stats.txt`
+  - `mtp/client.json`
+  - `mtp/key_lines.txt`
+
+## Task 3: Compare Evidence
+
+- [x] **Step 1: Compare client throughput**
+
+  Result:
+
+  ```text
+  baseline n=8: decode_agg_tps=230.5, decode_perseq_tps=28.07, wall_s=3.523
+  MTP      n=8: decode_agg_tps= 97.7, decode_perseq_tps=12.83, wall_s=7.049
+  ```
+
+- [x] **Step 2: Compare graph reuse**
+
+  Result:
+
+  ```text
+  baseline: graphs reused = 62
+  MTP:      graphs reused = 1
+  ```
+
+- [x] **Step 3: Confirm MTP actually drafted**
+
+  Result:
+
+  ```text
+  common_speculative_impl_draft_mtp: - n_max=3, n_min=0, p_min=0.00
+  draft acceptance = 0.81481 (44 accepted / 54 generated)
+  statistics draft-mtp: #gen tokens = 460, #acc tokens = 346
+  ```
+
+- [x] **Step 4: Compare GPU work**
+
+  `nsys stats` kernel summaries show materially more GPU work for the MTP run:
+
+  - baseline top kernel summary total is about `2.59 s` of GPU kernel time,
+  - MTP top kernel summary total is about `5.89 s` of GPU kernel time.
+
+  This supports the graph/batch-shape hypothesis and rules out a purely
+  host-side or no-draft explanation.
+
+## Task 4: Disposition
+
+- [x] **Step 1: Record root-cause hypothesis as supported**
+
+  Phase 16 supports the Phase 15 root cause: current MTP serving loses the
+  existing paged decode graph-reuse advantage and does substantially more GPU
+  work, so it is not a viable GB10 parity lever as implemented.
+
+- [x] **Step 2: Scope the only plausible code follow-up**
+
+  Do not tune MTP draft parameters first. A source phase would need to inspect
+  `tools/server/server-context.cpp` speculative batch construction and
+  `llama-graph` reuse keys to answer:
+
+  - whether verification batches can be bucketed/reused like pure decode,
+  - whether MTP draft/verify rows force graph rebuilds by changing output rows
+    per sequence,
+  - whether target verification can be separated from normal decode graph reuse
+    without breaking rollback or greedy equivalence.
+
+  If those answers are negative, leave MTP default-off and closed for GB10.
+
+## Self-Review
+
+- No source patch was made.
+- The profile used `--cuda-graph-trace=node`.
+- The result narrows the next work to graph/batch-shape mechanics.