From 6e354763402da9e9b9e949bf76a1e857b9bd4e54 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 02:37:21 +0000 Subject: [PATCH] docs(paged): scope MTP graph-shape follow-up Record Phase 17 source inspection: MTP verification rows change hard graph dimensions, padding is not a safe shortcut, and any future work should start with shape instrumentation before an opt-in scheduler experiment. Assisted-by: Codex:gpt-5 --- .../docs/PARITY_HANDOFF.md | 8 ++ .../docs/VLLM_PARITY_LEVER_MAP.md | 26 ++++ ...-01-mtp-graph-shape-feasibility-phase17.md | 123 ++++++++++++++++++ 3 files changed, 157 insertions(+) create mode 100644 docs/superpowers/plans/2026-07-01-mtp-graph-shape-feasibility-phase17.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index d325e238d..56cfde7da 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -236,6 +236,14 @@ safety gates stayed green before and after the failed serving A/B: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. +Phase 17 source inspection found no tiny additive graph-reuse fix. MTP +verification rows are real target decode/output rows (`K + 1` per speculative +slot), so fake padding would touch KV, positions, logits, MTP nextn state, and +rollback semantics. If reopened, start with a server-only shape counter around +`server_slot::handle_last_sampled_token()`. Only then consider an opt-in +group/defer-by-draft-length scheduler experiment, with TTFT/throughput and +md5/op gates as kill criteria. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index d385829af..a3b95e7fd 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -528,6 +528,32 @@ decode graph-reuse path and increases GPU work. If MTP is reopened, start at `tools/server/server-context.cpp` speculative verification batch construction and graph-reuse keys, not draft-length tuning. +### Phase 17 MTP graph-shape feasibility + +Phase 17 inspected the source path before any patch. Verdict: no small additive +graph-reuse shortcut is evident. + +Key mechanics: + +- normal decode appends one `output=true` row per generating slot; +- MTP verification appends `K + 1` `output=true` rows per speculative slot, + where `K = spec_draft.size()`; +- total shape is `sum(non_spec * 1) + sum(spec * (1 + K_i)) + prompt rows`; +- `n_tokens`, `n_seq_tokens`, `n_outputs`, KQ mask rows, position length, and + output-id count are hard graph/input dimensions; +- paged-attention block-table bucketing does not stabilize those verification + token/output dimensions. + +Rejected shortcut: fake padding rows. They would be real target decode rows with +KV, position, logits, MTP nextn embedding, sampling-index, and rollback effects, +and they resemble the already rejected fixed-slot dummy-compute experiment. + +Only plausible next step: an instrumentation-only patch around +`server_slot::handle_last_sampled_token()` to count verification shape buckets. +Only after that should an opt-in scheduling experiment group/defer MTP +verification by `1 + spec_draft.size()`. Keep it default-off and kill it if TTFT +or throughput regresses, graph reuse does not recover, or the md5/op gates drift. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-mtp-graph-shape-feasibility-phase17.md b/docs/superpowers/plans/2026-07-01-mtp-graph-shape-feasibility-phase17.md new file mode 100644 index 000000000..837e42794 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mtp-graph-shape-feasibility-phase17.md @@ -0,0 +1,123 @@ +# MTP Graph-Shape Feasibility Phase 17 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:systematic-debugging before proposing source changes. Steps use +> checkbox (`- [ ]`) syntax for tracking. + +**Goal:** decide whether Phase 16's MTP graph-reuse loss has a small, +maintainable source fix. + +**Architecture:** use read-only code inspection first. Split the problem into +server speculative batch construction and graph-reuse keying. Do not patch until +the shape mechanics are clear. + +**Tech Stack:** llama.cpp `tools/server`, `src/llama-graph.*`, +`ggml-cuda` graph reuse, LocalAI paged docs. + +--- + +## Task 1: Parallel Read-Only Inspection + +- [x] **Step 1: Inspect server speculative batch construction** + + Finding: + + - Normal decode appends one `output=true` row per generating slot. + - Speculative/MTP verification appends `K + 1` `output=true` rows per slot, + where `K = spec_draft.size()`. + - `slot.spec_i_batch` stores the absolute logical row indices for those + verification rows. + - Total batch shape becomes: + + ```text + sum(non_spec_slots * 1) + sum(spec_slots * (1 + K_i)) + prompt rows + ``` + + Key source areas: + + - `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp` + around `server_slot::handle_last_sampled_token()`. + - `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp` + around the `slot.handle_last_sampled_token(batch)` call site. + - `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp` + `post_decode()` speculative index validation. + +- [x] **Step 2: Inspect graph-reuse blockers** + + Finding: + + - MTP changes hard graph dimensions: + `n_tokens`, `n_seq_tokens`, `n_outputs`, KQ mask shape, position length, and + output-id count. + - `llm_graph_params::allow_reuse` rejects changes in these dimensions. + - Paged attention bucketing stabilizes block-table view dimensions only; it + does not stabilize verification token/output rows. + - CUDA graph reuse still requires copied node/source properties (`ne`, `nb`, + pointers, node count) to match. + +## Task 2: Feasibility Verdict + +- [x] **Step 1: Reject dummy-row padding as a shortcut** + + Padding fake verification rows is not low-risk: + + - rows are real target decode rows, + - rows have real output logits, + - rows feed MTP nextn embedding/state extraction, + - fake rows would mutate KV, positions, sampling indices, and rollback shape. + + This also resembles the previously rejected fixed-slot decode experiment, + where dummy compute cost exceeded graph-reuse recovery. + +- [x] **Step 2: Identify the only small safe hook** + + A read-only shape counter around `server_slot::handle_last_sampled_token()` is + low-conflict and can expose: + + - normal vs speculative rows, + - draft length `K`, + - output rows per sequence, + - `slot.spec_i_batch` range. + + This is useful instrumentation, not a performance fix. + +- [x] **Step 3: Identify the only plausible behavior experiment** + + The least invasive performance experiment is server-side scheduling, not graph + padding: + + - group or defer speculative verification slots by `1 + spec_draft.size()`, + - try to make verification windows repeat shape buckets, + - keep it opt-in and default-off, + - gate with Phase 14 rollback, Phase 15 serving A/B, and pre/post inference + md5/op checks. + + This changes serving scheduling and may regress TTFT or reduce concurrency, so + it needs an explicit kill gate. + +## Task 3: Phase 18 Scope If Pursued + +- [x] **Step 1: Write the source-scope boundary** + + Phase 18 should be split into two incremental patches if it is attempted: + + 1. instrumentation-only: log or count verification shape buckets under a + disabled-by-default env var, no scheduling change, + 2. opt-in scheduler experiment: group/defer MTP verification by draft length. + +- [x] **Step 2: Define stop criteria** + + Stop and reject the source path if: + + - shape counters show high entropy across draft lengths and active slots, + - grouping reduces graph churn but loses more throughput/TTFT than it recovers, + - pre/post md5 or `MUL_MAT_ID` gates drift, + - MTP rollback or normalized greedy-prefix gates fail. + +## Self-Review + +- No source patch was made in this phase. +- The feasibility conclusion is narrower than "optimize MTP": instrument first, + then only consider an opt-in scheduler experiment. +- No default behavior changes are proposed without a separate implementation + phase and gates.