docs(paged): scope MTP graph-shape follow-up

Record Phase 17 source inspection: MTP verification rows change hard graph dimensions, padding is not a safe shortcut, and any future work should start with shape instrumentation before an opt-in scheduler experiment.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 02:37:21 +00:00
parent ae76d42a96
commit 6e35476340
3 changed files with 157 additions and 0 deletions

View File

@@ -236,6 +236,14 @@ safety gates stayed green before and after the failed serving A/B: MoE md5
`8cb0ce23777bf55f92f63d0292c756b0`, dense md5
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
Phase 17 source inspection found no tiny additive graph-reuse fix. MTP
verification rows are real target decode/output rows (`K + 1` per speculative
slot), so fake padding would touch KV, positions, logits, MTP nextn state, and
rollback semantics. If reopened, start with a server-only shape counter around
`server_slot::handle_last_sampled_token()`. Only then consider an opt-in
group/defer-by-draft-length scheduler experiment, with TTFT/throughput and
md5/op gates as kill criteria.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)

View File

@@ -528,6 +528,32 @@ decode graph-reuse path and increases GPU work. If MTP is reopened, start at
`tools/server/server-context.cpp` speculative verification batch construction
and graph-reuse keys, not draft-length tuning.
### Phase 17 MTP graph-shape feasibility
Phase 17 inspected the source path before any patch. Verdict: no small additive
graph-reuse shortcut is evident.
Key mechanics:
- normal decode appends one `output=true` row per generating slot;
- MTP verification appends `K + 1` `output=true` rows per speculative slot,
where `K = spec_draft.size()`;
- total shape is `sum(non_spec * 1) + sum(spec * (1 + K_i)) + prompt rows`;
- `n_tokens`, `n_seq_tokens`, `n_outputs`, KQ mask rows, position length, and
output-id count are hard graph/input dimensions;
- paged-attention block-table bucketing does not stabilize those verification
token/output dimensions.
Rejected shortcut: fake padding rows. They would be real target decode rows with
KV, position, logits, MTP nextn embedding, sampling-index, and rollback effects,
and they resemble the already rejected fixed-slot dummy-compute experiment.
Only plausible next step: an instrumentation-only patch around
`server_slot::handle_last_sampled_token()` to count verification shape buckets.
Only after that should an opt-in scheduling experiment group/defer MTP
verification by `1 + spec_draft.size()`. Keep it default-off and kill it if TTFT
or throughput regresses, graph reuse does not recover, or the md5/op gates drift.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update

View File

@@ -0,0 +1,123 @@
# MTP Graph-Shape Feasibility Phase 17 Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use
> superpowers:systematic-debugging before proposing source changes. Steps use
> checkbox (`- [ ]`) syntax for tracking.
**Goal:** decide whether Phase 16's MTP graph-reuse loss has a small,
maintainable source fix.
**Architecture:** use read-only code inspection first. Split the problem into
server speculative batch construction and graph-reuse keying. Do not patch until
the shape mechanics are clear.
**Tech Stack:** llama.cpp `tools/server`, `src/llama-graph.*`,
`ggml-cuda` graph reuse, LocalAI paged docs.
---
## Task 1: Parallel Read-Only Inspection
- [x] **Step 1: Inspect server speculative batch construction**
Finding:
- Normal decode appends one `output=true` row per generating slot.
- Speculative/MTP verification appends `K + 1` `output=true` rows per slot,
where `K = spec_draft.size()`.
- `slot.spec_i_batch` stores the absolute logical row indices for those
verification rows.
- Total batch shape becomes:
```text
sum(non_spec_slots * 1) + sum(spec_slots * (1 + K_i)) + prompt rows
```
Key source areas:
- `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp`
around `server_slot::handle_last_sampled_token()`.
- `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp`
around the `slot.handle_last_sampled_token(batch)` call site.
- `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp`
`post_decode()` speculative index validation.
- [x] **Step 2: Inspect graph-reuse blockers**
Finding:
- MTP changes hard graph dimensions:
`n_tokens`, `n_seq_tokens`, `n_outputs`, KQ mask shape, position length, and
output-id count.
- `llm_graph_params::allow_reuse` rejects changes in these dimensions.
- Paged attention bucketing stabilizes block-table view dimensions only; it
does not stabilize verification token/output rows.
- CUDA graph reuse still requires copied node/source properties (`ne`, `nb`,
pointers, node count) to match.
## Task 2: Feasibility Verdict
- [x] **Step 1: Reject dummy-row padding as a shortcut**
Padding fake verification rows is not low-risk:
- rows are real target decode rows,
- rows have real output logits,
- rows feed MTP nextn embedding/state extraction,
- fake rows would mutate KV, positions, sampling indices, and rollback shape.
This also resembles the previously rejected fixed-slot decode experiment,
where dummy compute cost exceeded graph-reuse recovery.
- [x] **Step 2: Identify the only small safe hook**
A read-only shape counter around `server_slot::handle_last_sampled_token()` is
low-conflict and can expose:
- normal vs speculative rows,
- draft length `K`,
- output rows per sequence,
- `slot.spec_i_batch` range.
This is useful instrumentation, not a performance fix.
- [x] **Step 3: Identify the only plausible behavior experiment**
The least invasive performance experiment is server-side scheduling, not graph
padding:
- group or defer speculative verification slots by `1 + spec_draft.size()`,
- try to make verification windows repeat shape buckets,
- keep it opt-in and default-off,
- gate with Phase 14 rollback, Phase 15 serving A/B, and pre/post inference
md5/op checks.
This changes serving scheduling and may regress TTFT or reduce concurrency, so
it needs an explicit kill gate.
## Task 3: Phase 18 Scope If Pursued
- [x] **Step 1: Write the source-scope boundary**
Phase 18 should be split into two incremental patches if it is attempted:
1. instrumentation-only: log or count verification shape buckets under a
disabled-by-default env var, no scheduling change,
2. opt-in scheduler experiment: group/defer MTP verification by draft length.
- [x] **Step 2: Define stop criteria**
Stop and reject the source path if:
- shape counters show high entropy across draft lengths and active slots,
- grouping reduces graph churn but loses more throughput/TTFT than it recovers,
- pre/post md5 or `MUL_MAT_ID` gates drift,
- MTP rollback or normalized greedy-prefix gates fail.
## Self-Review
- No source patch was made in this phase.
- The feasibility conclusion is narrower than "optimize MTP": instrument first,
then only consider an opt-in scheduler experiment.
- No default behavior changes are proposed without a separate implementation
phase and gates.