docs(paged): reject MTP draft-shape scheduler

Record the Phase 19 trace-only serving entropy run and close the group/defer-by-draft follow-up based on measured shape distribution, throughput regression, and green inference gates. Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-07-01 03:03:49 +00:00
parent cced07c7fe
commit 310eb3c866
4 changed files with 246 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -1303,3 +1303,59 @@ Conclusion:
  request (`rows=4` and `rows=3`).
 - A follow-up scheduler experiment is not yet justified. First use this trace
  under real serving load to measure draft-length bucket entropy.
+
+## Phase 19 MTP Serving Shape Entropy
+
+Phase 19 ran Phase 18's shape trace under the direct serving harness with
+`LLAMA_SPEC_SHAPE_TRACE=1`, `NPL="8 32 128"`, `GEN=64`, and `PTOK=128`.
+
+Artifact:
+
+- `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`
+
+Pre/post gate result:
+
+- Pre-gate and post-gate both passed.
+- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`.
+- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`.
+- Full `MUL_MAT_ID`: `806/806` on CUDA0.
+
+Serving A/B:
+
+| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms |
+|---|---------------------|----------------|----------------|------------------|-------------|
+| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 |
+| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 |
+| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 |
+
+Shape entropy summaries:
+
+- `shape_entropy_summary.tsv`
+- `step_shape_summary.tsv`
+
+Per-slot draft distribution:
+
+| window | verify slots | draft counts | top draft share | unique `batch_before` |
+|--------|--------------|--------------|-----------------|-----------------------|
+| n8 | 162 | `{1: 4, 2: 2, 3: 156}` | 96.3% | 15 |
+| n32 | 610 | `{1: 8, 2: 11, 3: 591}` | 96.9% | 96 |
+| n128 | 2353 | `{1: 40, 2: 49, 3: 2264}` | 96.2% | 479 |
+
+Per-step aggregate shape:
+
+| window | steps | unique total rows | top full-shape rows |
+|--------|-------|-------------------|---------------------|
+| n8 | 26 | 12 | `32` rows for 14 steps |
+| n32 | 32 | 20 | `128` rows for 13 steps |
+| n128 | 37 | 34 | `512` rows for 4 steps |
+
+Decision:
+
+- Do not implement the Phase 20 group/defer-by-draft scheduler shortcut on this
+  evidence.
+- Draft length is already stable (`draft=3` is >96% of verify slots), yet MTP
+  still regresses decode throughput hard and worsens TTFT.
+- The residual shape churn is dominated by active-slot/tail churn and the MTP
+  `K + 1` verification-row expansion, not mixed draft lengths.
+- Any future MTP parity work needs a deeper target-verify graph/state design,
+  not a small server scheduling shortcut.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -262,6 +262,27 @@ grouping. Any scheduler experiment must be opt-in/default-off and killed by
 TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP
 rollback/prefix gate failure.

+Phase 19 ran that trace-only serving measurement and rejected the scheduler
+shortcut. Artifact:
+`/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`. Pre/post gates
+passed with canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
+`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
+
+Serving result:
+
+| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms |
+|---|---------------------|----------------|----------------|------------------|-------------|
+| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 |
+| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 |
+| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 |
+
+Shape result: `draft=3` already accounts for 96.2-96.9% of verify slots, so
+group/defer-by-draft has little to recover. Full in-flight steps already mostly
+use all-`draft=3` vectors; the remaining churn is active-slot/tail churn plus
+the real `K + 1` verification-row expansion. Do not build a Phase 20 scheduler
+experiment on this evidence. Future MTP work would need a deeper target-verify
+graph/state design, not another small server scheduling shortcut.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -586,6 +586,36 @@ should an opt-in group/defer-by-draft-length scheduler be built; kill it on
 TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP
 rollback/prefix gate failure.

+### Phase 19 MTP serving shape entropy
+
+Phase 19 ran the trace-only serving measurement. Artifact:
+`/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`.
+
+Pre/post canonical gates passed: MoE `8cb0ce23777bf55f92f63d0292c756b0`,
+dense `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
+
+MTP serving stayed slower:
+
+| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms |
+|---|---------------------|----------------|----------------|------------------|-------------|
+| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 |
+| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 |
+| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 |
+
+The shape trace rejects the small scheduler shortcut:
+
+- per-slot draft length is already stable: `draft=3` is 96.2-96.9% of verify
+  slots across n8/n32/n128;
+- full in-flight steps already mostly use all-`draft=3` vectors;
+- remaining aggregate shape churn is active-slot/tail churn plus MTP's real
+  `K + 1` output-row expansion;
+- group/defer-by-draft would not remove the dominant row expansion and would
+  risk more TTFT loss.
+
+Decision: do not build a Phase 20 group/defer scheduler on current evidence.
+Future MTP work would need a deeper target-verify graph/state design, not
+another small server scheduling shortcut.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/docs/superpowers/plans/2026-07-01-mtp-serving-shape-entropy-phase19.md
+++ b/docs/superpowers/plans/2026-07-01-mtp-serving-shape-entropy-phase19.md
@@ -0,0 +1,139 @@
+# MTP Serving Shape Entropy Phase 19 Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use
+> superpowers:verification-before-completion before recording the phase result.
+> Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** use Phase 18's `LLAMA_SPEC_SHAPE_TRACE=1` instrumentation under real
+serving load to decide whether a group/defer-by-draft-length scheduler
+experiment is justified.
+
+**Architecture:** trace-only benchmark. Do not change llama.cpp source or
+scheduling policy. Run the existing MTP serving A/B with pre/post canonical
+inference gates.
+
+**Tech Stack:** `paged-mtp-serving-bench.sh`, llama.cpp `llama-server`, DGX
+GB10, LocalAI paged patch stack.
+
+---
+
+## Task 1: Run Trace-Only Serving A/B
+
+- [x] **Step 1: Confirm DGX is free**
+
+  Preflight passed:
+
+  - `docker=0`
+  - `local_ai_worker=0`
+  - `compute=0`
+
+- [x] **Step 2: Run serving harness with shape trace**
+
+  Command shape:
+
+  ```bash
+  LLAMA_SPEC_SHAPE_TRACE=1 \
+    ART=~/bench/phase19_mtp_shape_entropy/20260701_045534 \
+    NPL="8 32 128" GEN=64 PTOK=128 \
+    /tmp/paged-mtp-serving-bench.sh
+  ```
+
+  Artifact:
+
+  - `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`
+
+## Task 2: Verify Inference Gates
+
+- [x] **Step 1: Pre-gate passed**
+
+  Artifact:
+
+  - `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534/gate_pre`
+
+  Result:
+
+  - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
+  - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
+  - `MUL_MAT_ID`: `806/806`
+
+- [x] **Step 2: Post-gate passed**
+
+  Artifact:
+
+  - `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534/gate_post`
+
+  Result:
+
+  - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
+  - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
+  - `MUL_MAT_ID`: `806/806`
+
+## Task 3: Analyze Serving Result
+
+- [x] **Step 1: Compare baseline vs MTP serving throughput**
+
+  | n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms |
+  |---|---------------------|----------------|----------------|------------------|-------------|
+  | 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 |
+  | 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 |
+  | 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 |
+
+  MTP remained materially slower at every concurrency.
+
+- [x] **Step 2: Parse per-slot draft entropy**
+
+  Artifact:
+
+  - `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534/shape_entropy_summary.tsv`
+
+  Result:
+
+  | window | verify slots | draft counts | top draft share | unique `batch_before` |
+  |--------|--------------|--------------|-----------------|-----------------------|
+  | n8 | 162 | `{1: 4, 2: 2, 3: 156}` | 96.3% | 15 |
+  | n32 | 610 | `{1: 8, 2: 11, 3: 591}` | 96.9% | 96 |
+  | n128 | 2353 | `{1: 40, 2: 49, 3: 2264}` | 96.2% | 479 |
+
+  Draft length is already overwhelmingly `3`. Grouping by draft length has
+  little to recover.
+
+- [x] **Step 3: Parse per-step aggregate shapes**
+
+  Artifact:
+
+  - `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534/step_shape_summary.tsv`
+
+  Result:
+
+  | window | steps | unique total rows | top full-shape rows |
+  |--------|-------|-------------------|---------------------|
+  | n8 | 26 | 12 | `32` rows for 14 steps |
+  | n32 | 32 | 20 | `128` rows for 13 steps |
+  | n128 | 37 | 34 | `512` rows for 4 steps |
+
+  Full in-flight steps already consist mostly of all-`draft=3` vectors. The
+  remaining shape churn is active-slot/tail churn plus the speculative `K + 1`
+  output-row expansion itself, not a draft-length scheduling problem.
+
+## Task 4: Decision
+
+- [x] **Step 1: Reject Phase 20 scheduler experiment for now**
+
+  Do not build the group/defer-by-draft-length scheduler experiment on this
+  evidence:
+
+  - draft length is already stable (`draft=3` >96% of verify slots),
+  - MTP still regresses decode throughput to 22-39% of baseline,
+  - TTFT gets worse at every concurrency,
+  - per-step shape variation is dominated by active-slot/tail churn and row
+    expansion, not mixed draft lengths.
+
+  The next useful MTP work would need a deeper target-verify graph/state design,
+  not a small server scheduling shortcut.
+
+## Self-Review
+
+- No source behavior changed in this phase.
+- Pre/post md5 and op gates passed.
+- The phase result moves the plan by rejecting the scheduler follow-up rather
+  than leaving it as an attractive but unsupported idea.