From ae76d42a96383dc784728ce4414bea0fdcdfc06d Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 02:32:49 +0000 Subject: [PATCH] docs(paged): profile MTP graph reuse loss Record Phase 16 nsys evidence that current MTP serving loses paged decode graph reuse and increases GPU work, explaining the Phase 15 serving regression. Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 36 ++++ .../docs/PARITY_HANDOFF.md | 9 +- .../docs/VLLM_PARITY_FINAL.md | 2 +- .../docs/VLLM_PARITY_LEVER_MAP.md | 20 +++ .../2026-07-01-mtp-graph-profile-phase16.md | 154 ++++++++++++++++++ 5 files changed, 217 insertions(+), 4 deletions(-) create mode 100644 docs/superpowers/plans/2026-07-01-mtp-graph-profile-phase16.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index ba1adca97..abe98d9fc 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -944,6 +944,42 @@ Likely root cause: defeats the paged decode graph-reuse wins, so extra verification dominates despite high draft acceptance. +## Phase 16 MTP Graph-Reuse Profile + +Phase 16 profiled the Phase 15 hypothesis with +`nsys --cuda-graph-trace=node` on a smaller direct serving shape: + +- server: `-c 32768 -b 2048 -ub 512 --parallel 32`, +- client: `h2h_cli3.py -n 8 --ptok 64 --gen 64`, +- arms: baseline vs `--spec-type draft-mtp --spec-draft-n-max 3`. + +Artifact: + +- `/home/mudler/bench/phase16_mtp_graph_profile/20260701_043016` + +Result: + +| arm | decode agg t/s | decode per-seq t/s | wall s | graph reuse | +|---|---:|---:|---:|---:| +| baseline | 230.5 | 28.07 | 3.523 | `graphs reused = 62` | +| MTP | 97.7 | 12.83 | 7.049 | `graphs reused = 1` | + +MTP drafted and accepted tokens: + +- `draft acceptance = 0.81481 (44 accepted / 54 generated)`, +- `#gen tokens = 460`, `#acc tokens = 346`. + +Nsight kernel summaries also show materially more GPU work in the MTP run: +roughly `5.89 s` top-level GPU kernel time versus `2.59 s` for the baseline +small profile. + +Decision: + +- Phase 16 supports the Phase 15 root-cause hypothesis: current MTP serving + defeats the paged decode graph-reuse advantage and increases GPU work. +- A future source phase must start at speculative verification batch shapes and + graph-reuse keys, not at MTP draft-length tuning. + ## Phase 10 GDN C32 Slab Baseline and Source Check Phase 10 starts a separate GDN prefill path; it does not reopen the rejected diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index cffbce06d..d325e238d 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -229,9 +229,12 @@ Phase 14 re-validated the MTP bucket as safe, then Phase 15 rejected it as a current GB10 serving-throughput lever. Do not enable it by default and do not keep tuning draft length blindly. The only plausible follow-up is a graph-reuse and speculative verification batch-shape profile with -`nsys --cuda-graph-trace=node`. The fixed safety gates stayed green before and -after the failed serving A/B: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense -md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. +`nsys --cuda-graph-trace=node`. Phase 16 ran that profile and supported the +root cause: small-shape baseline reused graphs (`graphs reused = 62`) while MTP +did not (`graphs reused = 1`) and did ~2.3x more GPU kernel work. The fixed +safety gates stayed green before and after the failed serving A/B: MoE md5 +`8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. --- diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md index b4d8e55f7..1cdd31ab9 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md @@ -281,7 +281,7 @@ operating point where paged is unambiguously faster than vLLM. | S2 double-buffer set_inputs | overlap host input build with GPU | **DROPPED** | `set_inputs` is **~0.05 ms/step** - nothing to recover (the rebuild was the cost) | DSS | | whole-step graph / host loop | the host scheduling loop as the serving residual | **CLOSED (~0-1%)** | baseline reuse 0% (agg 757.6) **statistically equal** to S1+S3 reuse 72% (agg 763.3); `hostproc` only ~4-8% of the per-step wall = **measured dead** | DSS | | padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested)** | inert (md5 bit-exact) but **regresses at every concurrency**; N=8 burst 28.16 -> 6.05 tok/s/seq (~4.6x slower); serving decode is **GPU-compute-bound**, dummy-row compute > reuse recovered | DSS | -| speculative decode (MTP) | draft + verify; greedy is bit-exact | **REJECTED for current GB10 serving** | Phase 14 passed safety, but Phase 15 direct serving A/B regressed at every tested concurrency (n128 decode agg 662.4 -> 138.5 tok/s) despite high acceptance; likely breaks paged decode graph reuse (`graphs reused` 361 -> 1). Not a parity lever unless a future graph/batch-shape fix changes this result | LMAP | +| speculative decode (MTP) | draft + verify; greedy is bit-exact | **REJECTED for current GB10 serving** | Phase 14 passed safety, but Phase 15 direct serving A/B regressed at every tested concurrency (n128 decode agg 662.4 -> 138.5 tok/s) despite high acceptance; Phase 16 profile supports graph-reuse loss as root cause (`graphs reused` 62 -> 1 in the small nsys run). Not a parity lever unless a future graph/batch-shape fix changes this result | LMAP | The serving regime was the one place the static-bench parity did not carry over (paged ~3.7 vs vLLM ~5.9 tok/s/seq, -39%, DSS). S1 made the decode step reusable diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 077fb34a5..d385829af 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -508,6 +508,26 @@ Do not keep tuning MTP draft length blindly. A follow-up must first profile speculative verification batch shapes and CUDA graph reuse with `nsys --cuda-graph-trace=node`. +### Phase 16 MTP graph-reuse profile + +Phase 16 ran that profile on a smaller direct serving shape (`n=8`, `ptok=64`, +`gen=64`) with `nsys --cuda-graph-trace=node`. + +Artifact: `/home/mudler/bench/phase16_mtp_graph_profile/20260701_043016`. + +Result: + +- baseline: `decode_agg_tps=230.5`, `graphs reused = 62`, +- MTP: `decode_agg_tps=97.7`, `graphs reused = 1`, +- MTP drafted (`#gen tokens = 460`, `#acc tokens = 346`), +- `nsys stats` showed materially more GPU kernel time in MTP (~`5.89 s`) than + baseline (~`2.59 s`). + +This supports the root-cause hypothesis: current MTP serving disrupts the paged +decode graph-reuse path and increases GPU work. If MTP is reopened, start at +`tools/server/server-context.cpp` speculative verification batch construction +and graph-reuse keys, not draft-length tuning. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-mtp-graph-profile-phase16.md b/docs/superpowers/plans/2026-07-01-mtp-graph-profile-phase16.md new file mode 100644 index 000000000..fa78005a7 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mtp-graph-profile-phase16.md @@ -0,0 +1,154 @@ +# MTP Graph-Reuse Profile Phase 16 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:systematic-debugging before proposing source changes. Steps use +> checkbox (`- [ ]`) syntax for tracking. + +**Goal:** validate the Phase 15 hypothesis that current MTP serving regresses +because speculative verification disrupts paged CUDA graph reuse and increases +GPU work. + +**Architecture:** capture a small direct `llama-server` baseline/MTP pair under +`nsys --cuda-graph-trace=node`, using the same request shape except for +`--spec-type draft-mtp`. Do not patch code in this phase. + +**Tech Stack:** llama.cpp `llama-server`, Nsight Systems, DGX GB10, +`h2h_cli3.py`. + +--- + +## Task 1: Preflight + +- [x] **Step 1: Confirm DGX is free** + + Result: + + ```text + docker=0 + local_ai_worker=0 + compute=0 + FREE released-by-codex-phase15-mtp-serving-bench 1782872749 + ``` + +- [x] **Step 2: Confirm profiler is available** + + Result: + + ```text + /usr/local/bin/nsys + ``` + +## Task 2: Capture Baseline and MTP Profiles + +- [x] **Step 1: Run baseline profile** + + Command shape: + + ```bash + nsys profile --force-overwrite=true --cuda-graph-trace=node \ + --trace=cuda,nvtx,osrt --output="$ART/baseline/profile" \ + ./llama-server -m "$MODEL" -ngl 99 -fa on -c 32768 -b 2048 -ub 512 \ + --parallel 32 --host 127.0.0.1 --port 8098 --no-webui + ``` + + Client: + + ```bash + python3 ~/bench/h2h_cli3.py --url http://127.0.0.1:8098/v1/completions \ + --model m -n 8 --ptok 64 --gen 64 --no-cache + ``` + +- [x] **Step 2: Run MTP profile** + + Same as baseline plus: + + ```text + --spec-type draft-mtp --spec-draft-n-max 3 --no-spec-draft-backend-sampling + ``` + +- [x] **Step 3: Save artifacts** + + Artifact root: + + - `/home/mudler/bench/phase16_mtp_graph_profile/20260701_043016` + + Files: + + - `baseline/profile.nsys-rep` + - `baseline/profile.sqlite` + - `baseline/nsys_stats.txt` + - `baseline/client.json` + - `baseline/key_lines.txt` + - `mtp/profile.nsys-rep` + - `mtp/profile.sqlite` + - `mtp/nsys_stats.txt` + - `mtp/client.json` + - `mtp/key_lines.txt` + +## Task 3: Compare Evidence + +- [x] **Step 1: Compare client throughput** + + Result: + + ```text + baseline n=8: decode_agg_tps=230.5, decode_perseq_tps=28.07, wall_s=3.523 + MTP n=8: decode_agg_tps= 97.7, decode_perseq_tps=12.83, wall_s=7.049 + ``` + +- [x] **Step 2: Compare graph reuse** + + Result: + + ```text + baseline: graphs reused = 62 + MTP: graphs reused = 1 + ``` + +- [x] **Step 3: Confirm MTP actually drafted** + + Result: + + ```text + common_speculative_impl_draft_mtp: - n_max=3, n_min=0, p_min=0.00 + draft acceptance = 0.81481 (44 accepted / 54 generated) + statistics draft-mtp: #gen tokens = 460, #acc tokens = 346 + ``` + +- [x] **Step 4: Compare GPU work** + + `nsys stats` kernel summaries show materially more GPU work for the MTP run: + + - baseline top kernel summary total is about `2.59 s` of GPU kernel time, + - MTP top kernel summary total is about `5.89 s` of GPU kernel time. + + This supports the graph/batch-shape hypothesis and rules out a purely + host-side or no-draft explanation. + +## Task 4: Disposition + +- [x] **Step 1: Record root-cause hypothesis as supported** + + Phase 16 supports the Phase 15 root cause: current MTP serving loses the + existing paged decode graph-reuse advantage and does substantially more GPU + work, so it is not a viable GB10 parity lever as implemented. + +- [x] **Step 2: Scope the only plausible code follow-up** + + Do not tune MTP draft parameters first. A source phase would need to inspect + `tools/server/server-context.cpp` speculative batch construction and + `llama-graph` reuse keys to answer: + + - whether verification batches can be bucketed/reused like pure decode, + - whether MTP draft/verify rows force graph rebuilds by changing output rows + per sequence, + - whether target verification can be separated from normal decode graph reuse + without breaking rollback or greedy equivalence. + + If those answers are negative, leave MTP default-off and closed for GB10. + +## Self-Review + +- No source patch was made. +- The profile used `--cuda-graph-trace=node`. +- The result narrows the next work to graph/batch-shape mechanics.