From 6e354763402da9e9b9e949bf76a1e857b9bd4e54 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Wed, 1 Jul 2026 02:37:21 +0000
Subject: [PATCH] docs(paged): scope MTP graph-shape follow-up

Record Phase 17 source inspection: MTP verification rows change hard graph dimensions, padding is not a safe shortcut, and any future work should start with shape instrumentation before an opt-in scheduler experiment.

Assisted-by: Codex:gpt-5
---
 .../docs/PARITY_HANDOFF.md                    |   8 ++
 .../docs/VLLM_PARITY_LEVER_MAP.md             |  26 ++++
 ...-01-mtp-graph-shape-feasibility-phase17.md | 123 ++++++++++++++++++
 3 files changed, 157 insertions(+)
 create mode 100644 docs/superpowers/plans/2026-07-01-mtp-graph-shape-feasibility-phase17.md

diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
index d325e238d..56cfde7da 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -236,6 +236,14 @@ safety gates stayed green before and after the failed serving A/B: MoE md5
 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
 
+Phase 17 source inspection found no tiny additive graph-reuse fix. MTP
+verification rows are real target decode/output rows (`K + 1` per speculative
+slot), so fake padding would touch KV, positions, logits, MTP nextn state, and
+rollback semantics. If reopened, start with a server-only shape counter around
+`server_slot::handle_last_sampled_token()`. Only then consider an opt-in
+group/defer-by-draft-length scheduler experiment, with TTFT/throughput and
+md5/op gates as kill criteria.
+
 ---
 
 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
index d385829af..a3b95e7fd 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -528,6 +528,32 @@ decode graph-reuse path and increases GPU work. If MTP is reopened, start at
 `tools/server/server-context.cpp` speculative verification batch construction
 and graph-reuse keys, not draft-length tuning.
 
+### Phase 17 MTP graph-shape feasibility
+
+Phase 17 inspected the source path before any patch. Verdict: no small additive
+graph-reuse shortcut is evident.
+
+Key mechanics:
+
+- normal decode appends one `output=true` row per generating slot;
+- MTP verification appends `K + 1` `output=true` rows per speculative slot,
+  where `K = spec_draft.size()`;
+- total shape is `sum(non_spec * 1) + sum(spec * (1 + K_i)) + prompt rows`;
+- `n_tokens`, `n_seq_tokens`, `n_outputs`, KQ mask rows, position length, and
+  output-id count are hard graph/input dimensions;
+- paged-attention block-table bucketing does not stabilize those verification
+  token/output dimensions.
+
+Rejected shortcut: fake padding rows. They would be real target decode rows with
+KV, position, logits, MTP nextn embedding, sampling-index, and rollback effects,
+and they resemble the already rejected fixed-slot dummy-compute experiment.
+
+Only plausible next step: an instrumentation-only patch around
+`server_slot::handle_last_sampled_token()` to count verification shape buckets.
+Only after that should an opt-in scheduling experiment group/defer MTP
+verification by `1 + spec_draft.size()`. Keep it default-off and kill it if TTFT
+or throughput regresses, graph reuse does not recover, or the md5/op gates drift.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
 
 ### Phase 10 GDN C32 slab update
diff --git a/docs/superpowers/plans/2026-07-01-mtp-graph-shape-feasibility-phase17.md b/docs/superpowers/plans/2026-07-01-mtp-graph-shape-feasibility-phase17.md
new file mode 100644
index 000000000..837e42794
--- /dev/null
+++ b/docs/superpowers/plans/2026-07-01-mtp-graph-shape-feasibility-phase17.md
@@ -0,0 +1,123 @@
+# MTP Graph-Shape Feasibility Phase 17 Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use
+> superpowers:systematic-debugging before proposing source changes. Steps use
+> checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** decide whether Phase 16's MTP graph-reuse loss has a small,
+maintainable source fix.
+
+**Architecture:** use read-only code inspection first. Split the problem into
+server speculative batch construction and graph-reuse keying. Do not patch until
+the shape mechanics are clear.
+
+**Tech Stack:** llama.cpp `tools/server`, `src/llama-graph.*`,
+`ggml-cuda` graph reuse, LocalAI paged docs.
+
+---
+
+## Task 1: Parallel Read-Only Inspection
+
+- [x] **Step 1: Inspect server speculative batch construction**
+
+  Finding:
+
+  - Normal decode appends one `output=true` row per generating slot.
+  - Speculative/MTP verification appends `K + 1` `output=true` rows per slot,
+    where `K = spec_draft.size()`.
+  - `slot.spec_i_batch` stores the absolute logical row indices for those
+    verification rows.
+  - Total batch shape becomes:
+
+    ```text
+    sum(non_spec_slots * 1) + sum(spec_slots * (1 + K_i)) + prompt rows
+    ```
+
+  Key source areas:
+
+  - `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp`
+    around `server_slot::handle_last_sampled_token()`.
+  - `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp`
+    around the `slot.handle_last_sampled_token(batch)` call site.
+  - `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp`
+    `post_decode()` speculative index validation.
+
+- [x] **Step 2: Inspect graph-reuse blockers**
+
+  Finding:
+
+  - MTP changes hard graph dimensions:
+    `n_tokens`, `n_seq_tokens`, `n_outputs`, KQ mask shape, position length, and
+    output-id count.
+  - `llm_graph_params::allow_reuse` rejects changes in these dimensions.
+  - Paged attention bucketing stabilizes block-table view dimensions only; it
+    does not stabilize verification token/output rows.
+  - CUDA graph reuse still requires copied node/source properties (`ne`, `nb`,
+    pointers, node count) to match.
+
+## Task 2: Feasibility Verdict
+
+- [x] **Step 1: Reject dummy-row padding as a shortcut**
+
+  Padding fake verification rows is not low-risk:
+
+  - rows are real target decode rows,
+  - rows have real output logits,
+  - rows feed MTP nextn embedding/state extraction,
+  - fake rows would mutate KV, positions, sampling indices, and rollback shape.
+
+  This also resembles the previously rejected fixed-slot decode experiment,
+  where dummy compute cost exceeded graph-reuse recovery.
+
+- [x] **Step 2: Identify the only small safe hook**
+
+  A read-only shape counter around `server_slot::handle_last_sampled_token()` is
+  low-conflict and can expose:
+
+  - normal vs speculative rows,
+  - draft length `K`,
+  - output rows per sequence,
+  - `slot.spec_i_batch` range.
+
+  This is useful instrumentation, not a performance fix.
+
+- [x] **Step 3: Identify the only plausible behavior experiment**
+
+  The least invasive performance experiment is server-side scheduling, not graph
+  padding:
+
+  - group or defer speculative verification slots by `1 + spec_draft.size()`,
+  - try to make verification windows repeat shape buckets,
+  - keep it opt-in and default-off,
+  - gate with Phase 14 rollback, Phase 15 serving A/B, and pre/post inference
+    md5/op checks.
+
+  This changes serving scheduling and may regress TTFT or reduce concurrency, so
+  it needs an explicit kill gate.
+
+## Task 3: Phase 18 Scope If Pursued
+
+- [x] **Step 1: Write the source-scope boundary**
+
+  Phase 18 should be split into two incremental patches if it is attempted:
+
+  1. instrumentation-only: log or count verification shape buckets under a
+     disabled-by-default env var, no scheduling change,
+  2. opt-in scheduler experiment: group/defer MTP verification by draft length.
+
+- [x] **Step 2: Define stop criteria**
+
+  Stop and reject the source path if:
+
+  - shape counters show high entropy across draft lengths and active slots,
+  - grouping reduces graph churn but loses more throughput/TTFT than it recovers,
+  - pre/post md5 or `MUL_MAT_ID` gates drift,
+  - MTP rollback or normalized greedy-prefix gates fail.
+
+## Self-Review
+
+- No source patch was made in this phase.
+- The feasibility conclusion is narrower than "optimize MTP": instrument first,
+  then only consider an opt-in scheduler experiment.
+- No default behavior changes are proposed without a separate implementation
+  phase and gates.