mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
docs(paged): scope MTP graph-shape follow-up
Record Phase 17 source inspection: MTP verification rows change hard graph dimensions, padding is not a safe shortcut, and any future work should start with shape instrumentation before an opt-in scheduler experiment. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -236,6 +236,14 @@ safety gates stayed green before and after the failed serving A/B: MoE md5
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
|
||||
|
||||
Phase 17 source inspection found no tiny additive graph-reuse fix. MTP
|
||||
verification rows are real target decode/output rows (`K + 1` per speculative
|
||||
slot), so fake padding would touch KV, positions, logits, MTP nextn state, and
|
||||
rollback semantics. If reopened, start with a server-only shape counter around
|
||||
`server_slot::handle_last_sampled_token()`. Only then consider an opt-in
|
||||
group/defer-by-draft-length scheduler experiment, with TTFT/throughput and
|
||||
md5/op gates as kill criteria.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
|
||||
@@ -528,6 +528,32 @@ decode graph-reuse path and increases GPU work. If MTP is reopened, start at
|
||||
`tools/server/server-context.cpp` speculative verification batch construction
|
||||
and graph-reuse keys, not draft-length tuning.
|
||||
|
||||
### Phase 17 MTP graph-shape feasibility
|
||||
|
||||
Phase 17 inspected the source path before any patch. Verdict: no small additive
|
||||
graph-reuse shortcut is evident.
|
||||
|
||||
Key mechanics:
|
||||
|
||||
- normal decode appends one `output=true` row per generating slot;
|
||||
- MTP verification appends `K + 1` `output=true` rows per speculative slot,
|
||||
where `K = spec_draft.size()`;
|
||||
- total shape is `sum(non_spec * 1) + sum(spec * (1 + K_i)) + prompt rows`;
|
||||
- `n_tokens`, `n_seq_tokens`, `n_outputs`, KQ mask rows, position length, and
|
||||
output-id count are hard graph/input dimensions;
|
||||
- paged-attention block-table bucketing does not stabilize those verification
|
||||
token/output dimensions.
|
||||
|
||||
Rejected shortcut: fake padding rows. They would be real target decode rows with
|
||||
KV, position, logits, MTP nextn embedding, sampling-index, and rollback effects,
|
||||
and they resemble the already rejected fixed-slot dummy-compute experiment.
|
||||
|
||||
Only plausible next step: an instrumentation-only patch around
|
||||
`server_slot::handle_last_sampled_token()` to count verification shape buckets.
|
||||
Only after that should an opt-in scheduling experiment group/defer MTP
|
||||
verification by `1 + spec_draft.size()`. Keep it default-off and kill it if TTFT
|
||||
or throughput regresses, graph reuse does not recover, or the md5/op gates drift.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -0,0 +1,123 @@
|
||||
# MTP Graph-Shape Feasibility Phase 17 Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use
|
||||
> superpowers:systematic-debugging before proposing source changes. Steps use
|
||||
> checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** decide whether Phase 16's MTP graph-reuse loss has a small,
|
||||
maintainable source fix.
|
||||
|
||||
**Architecture:** use read-only code inspection first. Split the problem into
|
||||
server speculative batch construction and graph-reuse keying. Do not patch until
|
||||
the shape mechanics are clear.
|
||||
|
||||
**Tech Stack:** llama.cpp `tools/server`, `src/llama-graph.*`,
|
||||
`ggml-cuda` graph reuse, LocalAI paged docs.
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Parallel Read-Only Inspection
|
||||
|
||||
- [x] **Step 1: Inspect server speculative batch construction**
|
||||
|
||||
Finding:
|
||||
|
||||
- Normal decode appends one `output=true` row per generating slot.
|
||||
- Speculative/MTP verification appends `K + 1` `output=true` rows per slot,
|
||||
where `K = spec_draft.size()`.
|
||||
- `slot.spec_i_batch` stores the absolute logical row indices for those
|
||||
verification rows.
|
||||
- Total batch shape becomes:
|
||||
|
||||
```text
|
||||
sum(non_spec_slots * 1) + sum(spec_slots * (1 + K_i)) + prompt rows
|
||||
```
|
||||
|
||||
Key source areas:
|
||||
|
||||
- `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp`
|
||||
around `server_slot::handle_last_sampled_token()`.
|
||||
- `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp`
|
||||
around the `slot.handle_last_sampled_token(batch)` call site.
|
||||
- `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp`
|
||||
`post_decode()` speculative index validation.
|
||||
|
||||
- [x] **Step 2: Inspect graph-reuse blockers**
|
||||
|
||||
Finding:
|
||||
|
||||
- MTP changes hard graph dimensions:
|
||||
`n_tokens`, `n_seq_tokens`, `n_outputs`, KQ mask shape, position length, and
|
||||
output-id count.
|
||||
- `llm_graph_params::allow_reuse` rejects changes in these dimensions.
|
||||
- Paged attention bucketing stabilizes block-table view dimensions only; it
|
||||
does not stabilize verification token/output rows.
|
||||
- CUDA graph reuse still requires copied node/source properties (`ne`, `nb`,
|
||||
pointers, node count) to match.
|
||||
|
||||
## Task 2: Feasibility Verdict
|
||||
|
||||
- [x] **Step 1: Reject dummy-row padding as a shortcut**
|
||||
|
||||
Padding fake verification rows is not low-risk:
|
||||
|
||||
- rows are real target decode rows,
|
||||
- rows have real output logits,
|
||||
- rows feed MTP nextn embedding/state extraction,
|
||||
- fake rows would mutate KV, positions, sampling indices, and rollback shape.
|
||||
|
||||
This also resembles the previously rejected fixed-slot decode experiment,
|
||||
where dummy compute cost exceeded graph-reuse recovery.
|
||||
|
||||
- [x] **Step 2: Identify the only small safe hook**
|
||||
|
||||
A read-only shape counter around `server_slot::handle_last_sampled_token()` is
|
||||
low-conflict and can expose:
|
||||
|
||||
- normal vs speculative rows,
|
||||
- draft length `K`,
|
||||
- output rows per sequence,
|
||||
- `slot.spec_i_batch` range.
|
||||
|
||||
This is useful instrumentation, not a performance fix.
|
||||
|
||||
- [x] **Step 3: Identify the only plausible behavior experiment**
|
||||
|
||||
The least invasive performance experiment is server-side scheduling, not graph
|
||||
padding:
|
||||
|
||||
- group or defer speculative verification slots by `1 + spec_draft.size()`,
|
||||
- try to make verification windows repeat shape buckets,
|
||||
- keep it opt-in and default-off,
|
||||
- gate with Phase 14 rollback, Phase 15 serving A/B, and pre/post inference
|
||||
md5/op checks.
|
||||
|
||||
This changes serving scheduling and may regress TTFT or reduce concurrency, so
|
||||
it needs an explicit kill gate.
|
||||
|
||||
## Task 3: Phase 18 Scope If Pursued
|
||||
|
||||
- [x] **Step 1: Write the source-scope boundary**
|
||||
|
||||
Phase 18 should be split into two incremental patches if it is attempted:
|
||||
|
||||
1. instrumentation-only: log or count verification shape buckets under a
|
||||
disabled-by-default env var, no scheduling change,
|
||||
2. opt-in scheduler experiment: group/defer MTP verification by draft length.
|
||||
|
||||
- [x] **Step 2: Define stop criteria**
|
||||
|
||||
Stop and reject the source path if:
|
||||
|
||||
- shape counters show high entropy across draft lengths and active slots,
|
||||
- grouping reduces graph churn but loses more throughput/TTFT than it recovers,
|
||||
- pre/post md5 or `MUL_MAT_ID` gates drift,
|
||||
- MTP rollback or normalized greedy-prefix gates fail.
|
||||
|
||||
## Self-Review
|
||||
|
||||
- No source patch was made in this phase.
|
||||
- The feasibility conclusion is narrower than "optimize MTP": instrument first,
|
||||
then only consider an opt-in scheduler experiment.
|
||||
- No default behavior changes are proposed without a separate implementation
|
||||
phase and gates.
|
||||
Reference in New Issue
Block a user