docs(paged): reject MTP draft-shape scheduler

Record the Phase 19 trace-only serving entropy run and close the group/defer-by-draft follow-up based on measured shape distribution, throughput regression, and green inference gates. Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 03:03:49 +00:00
parent cced07c7fe
commit 310eb3c866
4 changed files with 246 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -1303,3 +1303,59 @@ Conclusion:
  request (`rows=4` and `rows=3`).
 - A follow-up scheduler experiment is not yet justified. First use this trace
  under real serving load to measure draft-length bucket entropy.
+
+## Phase 19 MTP Serving Shape Entropy
+
+Phase 19 ran Phase 18's shape trace under the direct serving harness with
+`LLAMA_SPEC_SHAPE_TRACE=1`, `NPL="8 32 128"`, `GEN=64`, and `PTOK=128`.
+
+Artifact:
+
+- `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`
+
+Pre/post gate result:
+
+- Pre-gate and post-gate both passed.
+- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`.
+- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`.
+- Full `MUL_MAT_ID`: `806/806` on CUDA0.
+
+Serving A/B:
+
+| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms |
+|---|---------------------|----------------|----------------|------------------|-------------|
+| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 |
+| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 |
+| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 |
+
+Shape entropy summaries:
+
+- `shape_entropy_summary.tsv`
+- `step_shape_summary.tsv`
+
+Per-slot draft distribution:
+
+| window | verify slots | draft counts | top draft share | unique `batch_before` |
+|--------|--------------|--------------|-----------------|-----------------------|
+| n8 | 162 | `{1: 4, 2: 2, 3: 156}` | 96.3% | 15 |
+| n32 | 610 | `{1: 8, 2: 11, 3: 591}` | 96.9% | 96 |
+| n128 | 2353 | `{1: 40, 2: 49, 3: 2264}` | 96.2% | 479 |
+
+Per-step aggregate shape:
+
+| window | steps | unique total rows | top full-shape rows |
+|--------|-------|-------------------|---------------------|
+| n8 | 26 | 12 | `32` rows for 14 steps |
+| n32 | 32 | 20 | `128` rows for 13 steps |
+| n128 | 37 | 34 | `512` rows for 4 steps |
+
+Decision:
+
+- Do not implement the Phase 20 group/defer-by-draft scheduler shortcut on this
+  evidence.
+- Draft length is already stable (`draft=3` is >96% of verify slots), yet MTP
+  still regresses decode throughput hard and worsens TTFT.
+- The residual shape churn is dominated by active-slot/tail churn and the MTP
+  `K + 1` verification-row expansion, not mixed draft lengths.
+- Any future MTP parity work needs a deeper target-verify graph/state design,
+  not a small server scheduling shortcut.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -262,6 +262,27 @@ grouping. Any scheduler experiment must be opt-in/default-off and killed by
 TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP
 rollback/prefix gate failure.

+Phase 19 ran that trace-only serving measurement and rejected the scheduler
+shortcut. Artifact:
+`/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`. Pre/post gates
+passed with canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
+`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
+
+Serving result:
+
+| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms |
+|---|---------------------|----------------|----------------|------------------|-------------|
+| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 |
+| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 |
+| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 |
+
+Shape result: `draft=3` already accounts for 96.2-96.9% of verify slots, so
+group/defer-by-draft has little to recover. Full in-flight steps already mostly
+use all-`draft=3` vectors; the remaining churn is active-slot/tail churn plus
+the real `K + 1` verification-row expansion. Do not build a Phase 20 scheduler
+experiment on this evidence. Future MTP work would need a deeper target-verify
+graph/state design, not another small server scheduling shortcut.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -586,6 +586,36 @@ should an opt-in group/defer-by-draft-length scheduler be built; kill it on
 TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP
 rollback/prefix gate failure.

+### Phase 19 MTP serving shape entropy
+
+Phase 19 ran the trace-only serving measurement. Artifact:
+`/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`.
+
+Pre/post canonical gates passed: MoE `8cb0ce23777bf55f92f63d0292c756b0`,
+dense `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
+
+MTP serving stayed slower:
+
+| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms |
+|---|---------------------|----------------|----------------|------------------|-------------|
+| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 |
+| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 |
+| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 |
+
+The shape trace rejects the small scheduler shortcut:
+
+- per-slot draft length is already stable: `draft=3` is 96.2-96.9% of verify
+  slots across n8/n32/n128;
+- full in-flight steps already mostly use all-`draft=3` vectors;
+- remaining aggregate shape churn is active-slot/tail churn plus MTP's real
+  `K + 1` output-row expansion;
+- group/defer-by-draft would not remove the dominant row expansion and would
+  risk more TTFT loss.
+
+Decision: do not build a Phase 20 group/defer scheduler on current evidence.
+Future MTP work would need a deeper target-verify graph/state design, not
+another small server scheduling shortcut.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update