docs(paged): scope MTP graph-shape follow-up

Record Phase 17 source inspection: MTP verification rows change hard graph dimensions, padding is not a safe shortcut, and any future work should start with shape instrumentation before an opt-in scheduler experiment. Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-07-01 02:37:21 +00:00
parent ae76d42a96
commit 6e35476340
3 changed files with 157 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -236,6 +236,14 @@ safety gates stayed green before and after the failed serving A/B: MoE md5
 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.

+Phase 17 source inspection found no tiny additive graph-reuse fix. MTP
+verification rows are real target decode/output rows (`K + 1` per speculative
+slot), so fake padding would touch KV, positions, logits, MTP nextn state, and
+rollback semantics. If reopened, start with a server-only shape counter around
+`server_slot::handle_last_sampled_token()`. Only then consider an opt-in
+group/defer-by-draft-length scheduler experiment, with TTFT/throughput and
+md5/op gates as kill criteria.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -528,6 +528,32 @@ decode graph-reuse path and increases GPU work. If MTP is reopened, start at
 `tools/server/server-context.cpp` speculative verification batch construction
 and graph-reuse keys, not draft-length tuning.

+### Phase 17 MTP graph-shape feasibility
+
+Phase 17 inspected the source path before any patch. Verdict: no small additive
+graph-reuse shortcut is evident.
+
+Key mechanics:
+
+- normal decode appends one `output=true` row per generating slot;
+- MTP verification appends `K + 1` `output=true` rows per speculative slot,
+  where `K = spec_draft.size()`;
+- total shape is `sum(non_spec * 1) + sum(spec * (1 + K_i)) + prompt rows`;
+- `n_tokens`, `n_seq_tokens`, `n_outputs`, KQ mask rows, position length, and
+  output-id count are hard graph/input dimensions;
+- paged-attention block-table bucketing does not stabilize those verification
+  token/output dimensions.
+
+Rejected shortcut: fake padding rows. They would be real target decode rows with
+KV, position, logits, MTP nextn embedding, sampling-index, and rollback effects,
+and they resemble the already rejected fixed-slot dummy-compute experiment.
+
+Only plausible next step: an instrumentation-only patch around
+`server_slot::handle_last_sampled_token()` to count verification shape buckets.
+Only after that should an opt-in scheduling experiment group/defer MTP
+verification by `1 + spec_draft.size()`. Keep it default-off and kill it if TTFT
+or throughput regresses, graph reuse does not recover, or the md5/op gates drift.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/docs/superpowers/plans/2026-07-01-mtp-graph-shape-feasibility-phase17.md
+++ b/docs/superpowers/plans/2026-07-01-mtp-graph-shape-feasibility-phase17.md
@@ -0,0 +1,123 @@
+# MTP Graph-Shape Feasibility Phase 17 Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use
+> superpowers:systematic-debugging before proposing source changes. Steps use
+> checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** decide whether Phase 16's MTP graph-reuse loss has a small,
+maintainable source fix.
+
+**Architecture:** use read-only code inspection first. Split the problem into
+server speculative batch construction and graph-reuse keying. Do not patch until
+the shape mechanics are clear.
+
+**Tech Stack:** llama.cpp `tools/server`, `src/llama-graph.*`,
+`ggml-cuda` graph reuse, LocalAI paged docs.
+
+---
+
+## Task 1: Parallel Read-Only Inspection
+
+- [x] **Step 1: Inspect server speculative batch construction**
+
+  Finding:
+
+  - Normal decode appends one `output=true` row per generating slot.
+  - Speculative/MTP verification appends `K + 1` `output=true` rows per slot,
+    where `K = spec_draft.size()`.
+  - `slot.spec_i_batch` stores the absolute logical row indices for those
+    verification rows.
+  - Total batch shape becomes:
+
+    ```text
+    sum(non_spec_slots * 1) + sum(spec_slots * (1 + K_i)) + prompt rows
+    ```
+
+  Key source areas:
+
+  - `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp`
+    around `server_slot::handle_last_sampled_token()`.
+  - `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp`
+    around the `slot.handle_last_sampled_token(batch)` call site.
+  - `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp`
+    `post_decode()` speculative index validation.
+
+- [x] **Step 2: Inspect graph-reuse blockers**
+
+  Finding:
+
+  - MTP changes hard graph dimensions:
+    `n_tokens`, `n_seq_tokens`, `n_outputs`, KQ mask shape, position length, and
+    output-id count.
+  - `llm_graph_params::allow_reuse` rejects changes in these dimensions.
+  - Paged attention bucketing stabilizes block-table view dimensions only; it
+    does not stabilize verification token/output rows.
+  - CUDA graph reuse still requires copied node/source properties (`ne`, `nb`,
+    pointers, node count) to match.
+
+## Task 2: Feasibility Verdict
+
+- [x] **Step 1: Reject dummy-row padding as a shortcut**
+
+  Padding fake verification rows is not low-risk:
+
+  - rows are real target decode rows,
+  - rows have real output logits,
+  - rows feed MTP nextn embedding/state extraction,
+  - fake rows would mutate KV, positions, sampling indices, and rollback shape.
+
+  This also resembles the previously rejected fixed-slot decode experiment,
+  where dummy compute cost exceeded graph-reuse recovery.
+
+- [x] **Step 2: Identify the only small safe hook**
+
+  A read-only shape counter around `server_slot::handle_last_sampled_token()` is
+  low-conflict and can expose:
+
+  - normal vs speculative rows,
+  - draft length `K`,
+  - output rows per sequence,
+  - `slot.spec_i_batch` range.
+
+  This is useful instrumentation, not a performance fix.
+
+- [x] **Step 3: Identify the only plausible behavior experiment**
+
+  The least invasive performance experiment is server-side scheduling, not graph
+  padding:
+
+  - group or defer speculative verification slots by `1 + spec_draft.size()`,
+  - try to make verification windows repeat shape buckets,
+  - keep it opt-in and default-off,
+  - gate with Phase 14 rollback, Phase 15 serving A/B, and pre/post inference
+    md5/op checks.
+
+  This changes serving scheduling and may regress TTFT or reduce concurrency, so
+  it needs an explicit kill gate.
+
+## Task 3: Phase 18 Scope If Pursued
+
+- [x] **Step 1: Write the source-scope boundary**
+
+  Phase 18 should be split into two incremental patches if it is attempted:
+
+  1. instrumentation-only: log or count verification shape buckets under a
+     disabled-by-default env var, no scheduling change,
+  2. opt-in scheduler experiment: group/defer MTP verification by draft length.
+
+- [x] **Step 2: Define stop criteria**
+
+  Stop and reject the source path if:
+
+  - shape counters show high entropy across draft lengths and active slots,
+  - grouping reduces graph churn but loses more throughput/TTFT than it recovers,
+  - pre/post md5 or `MUL_MAT_ID` gates drift,
+  - MTP rollback or normalized greedy-prefix gates fail.
+
+## Self-Review
+
+- No source patch was made in this phase.
+- The feasibility conclusion is narrower than "optimize MTP": instrument first,
+  then only consider an opt-in scheduler experiment.
+- No default behavior changes are proposed without a separate implementation
+  phase and gates.