From 310eb3c8662f7445f3c175c7f1ef5271726aec3c Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 03:03:49 +0000 Subject: [PATCH] docs(paged): reject MTP draft-shape scheduler Record the Phase 19 trace-only serving entropy run and close the group/defer-by-draft follow-up based on measured shape distribution, throughput regression, and green inference gates. Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 56 +++++++ .../docs/PARITY_HANDOFF.md | 21 +++ .../docs/VLLM_PARITY_LEVER_MAP.md | 30 ++++ ...07-01-mtp-serving-shape-entropy-phase19.md | 139 ++++++++++++++++++ 4 files changed, 246 insertions(+) create mode 100644 docs/superpowers/plans/2026-07-01-mtp-serving-shape-entropy-phase19.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 961734ea3..6c34730b3 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -1303,3 +1303,59 @@ Conclusion: request (`rows=4` and `rows=3`). - A follow-up scheduler experiment is not yet justified. First use this trace under real serving load to measure draft-length bucket entropy. + +## Phase 19 MTP Serving Shape Entropy + +Phase 19 ran Phase 18's shape trace under the direct serving harness with +`LLAMA_SPEC_SHAPE_TRACE=1`, `NPL="8 32 128"`, `GEN=64`, and `PTOK=128`. + +Artifact: + +- `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534` + +Pre/post gate result: + +- Pre-gate and post-gate both passed. +- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Full `MUL_MAT_ID`: `806/806` on CUDA0. + +Serving A/B: + +| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms | +|---|---------------------|----------------|----------------|------------------|-------------| +| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 | +| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 | +| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 | + +Shape entropy summaries: + +- `shape_entropy_summary.tsv` +- `step_shape_summary.tsv` + +Per-slot draft distribution: + +| window | verify slots | draft counts | top draft share | unique `batch_before` | +|--------|--------------|--------------|-----------------|-----------------------| +| n8 | 162 | `{1: 4, 2: 2, 3: 156}` | 96.3% | 15 | +| n32 | 610 | `{1: 8, 2: 11, 3: 591}` | 96.9% | 96 | +| n128 | 2353 | `{1: 40, 2: 49, 3: 2264}` | 96.2% | 479 | + +Per-step aggregate shape: + +| window | steps | unique total rows | top full-shape rows | +|--------|-------|-------------------|---------------------| +| n8 | 26 | 12 | `32` rows for 14 steps | +| n32 | 32 | 20 | `128` rows for 13 steps | +| n128 | 37 | 34 | `512` rows for 4 steps | + +Decision: + +- Do not implement the Phase 20 group/defer-by-draft scheduler shortcut on this + evidence. +- Draft length is already stable (`draft=3` is >96% of verify slots), yet MTP + still regresses decode throughput hard and worsens TTFT. +- The residual shape churn is dominated by active-slot/tail churn and the MTP + `K + 1` verification-row expansion, not mixed draft lengths. +- Any future MTP parity work needs a deeper target-verify graph/state design, + not a small server scheduling shortcut. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 5516e197e..d3c99c210 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -262,6 +262,27 @@ grouping. Any scheduler experiment must be opt-in/default-off and killed by TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP rollback/prefix gate failure. +Phase 19 ran that trace-only serving measurement and rejected the scheduler +shortcut. Artifact: +`/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`. Pre/post gates +passed with canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +Serving result: + +| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms | +|---|---------------------|----------------|----------------|------------------|-------------| +| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 | +| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 | +| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 | + +Shape result: `draft=3` already accounts for 96.2-96.9% of verify slots, so +group/defer-by-draft has little to recover. Full in-flight steps already mostly +use all-`draft=3` vectors; the remaining churn is active-slot/tail churn plus +the real `K + 1` verification-row expansion. Do not build a Phase 20 scheduler +experiment on this evidence. Future MTP work would need a deeper target-verify +graph/state design, not another small server scheduling shortcut. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 2ed083ee7..b74f943d8 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -586,6 +586,36 @@ should an opt-in group/defer-by-draft-length scheduler be built; kill it on TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP rollback/prefix gate failure. +### Phase 19 MTP serving shape entropy + +Phase 19 ran the trace-only serving measurement. Artifact: +`/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`. + +Pre/post canonical gates passed: MoE `8cb0ce23777bf55f92f63d0292c756b0`, +dense `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +MTP serving stayed slower: + +| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms | +|---|---------------------|----------------|----------------|------------------|-------------| +| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 | +| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 | +| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 | + +The shape trace rejects the small scheduler shortcut: + +- per-slot draft length is already stable: `draft=3` is 96.2-96.9% of verify + slots across n8/n32/n128; +- full in-flight steps already mostly use all-`draft=3` vectors; +- remaining aggregate shape churn is active-slot/tail churn plus MTP's real + `K + 1` output-row expansion; +- group/defer-by-draft would not remove the dominant row expansion and would + risk more TTFT loss. + +Decision: do not build a Phase 20 group/defer scheduler on current evidence. +Future MTP work would need a deeper target-verify graph/state design, not +another small server scheduling shortcut. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-mtp-serving-shape-entropy-phase19.md b/docs/superpowers/plans/2026-07-01-mtp-serving-shape-entropy-phase19.md new file mode 100644 index 000000000..2bd72509c --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mtp-serving-shape-entropy-phase19.md @@ -0,0 +1,139 @@ +# MTP Serving Shape Entropy Phase 19 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:verification-before-completion before recording the phase result. +> Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** use Phase 18's `LLAMA_SPEC_SHAPE_TRACE=1` instrumentation under real +serving load to decide whether a group/defer-by-draft-length scheduler +experiment is justified. + +**Architecture:** trace-only benchmark. Do not change llama.cpp source or +scheduling policy. Run the existing MTP serving A/B with pre/post canonical +inference gates. + +**Tech Stack:** `paged-mtp-serving-bench.sh`, llama.cpp `llama-server`, DGX +GB10, LocalAI paged patch stack. + +--- + +## Task 1: Run Trace-Only Serving A/B + +- [x] **Step 1: Confirm DGX is free** + + Preflight passed: + + - `docker=0` + - `local_ai_worker=0` + - `compute=0` + +- [x] **Step 2: Run serving harness with shape trace** + + Command shape: + + ```bash + LLAMA_SPEC_SHAPE_TRACE=1 \ + ART=~/bench/phase19_mtp_shape_entropy/20260701_045534 \ + NPL="8 32 128" GEN=64 PTOK=128 \ + /tmp/paged-mtp-serving-bench.sh + ``` + + Artifact: + + - `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534` + +## Task 2: Verify Inference Gates + +- [x] **Step 1: Pre-gate passed** + + Artifact: + + - `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534/gate_pre` + + Result: + + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +- [x] **Step 2: Post-gate passed** + + Artifact: + + - `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534/gate_post` + + Result: + + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +## Task 3: Analyze Serving Result + +- [x] **Step 1: Compare baseline vs MTP serving throughput** + + | n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms | + |---|---------------------|----------------|----------------|------------------|-------------| + | 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 | + | 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 | + | 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 | + + MTP remained materially slower at every concurrency. + +- [x] **Step 2: Parse per-slot draft entropy** + + Artifact: + + - `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534/shape_entropy_summary.tsv` + + Result: + + | window | verify slots | draft counts | top draft share | unique `batch_before` | + |--------|--------------|--------------|-----------------|-----------------------| + | n8 | 162 | `{1: 4, 2: 2, 3: 156}` | 96.3% | 15 | + | n32 | 610 | `{1: 8, 2: 11, 3: 591}` | 96.9% | 96 | + | n128 | 2353 | `{1: 40, 2: 49, 3: 2264}` | 96.2% | 479 | + + Draft length is already overwhelmingly `3`. Grouping by draft length has + little to recover. + +- [x] **Step 3: Parse per-step aggregate shapes** + + Artifact: + + - `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534/step_shape_summary.tsv` + + Result: + + | window | steps | unique total rows | top full-shape rows | + |--------|-------|-------------------|---------------------| + | n8 | 26 | 12 | `32` rows for 14 steps | + | n32 | 32 | 20 | `128` rows for 13 steps | + | n128 | 37 | 34 | `512` rows for 4 steps | + + Full in-flight steps already consist mostly of all-`draft=3` vectors. The + remaining shape churn is active-slot/tail churn plus the speculative `K + 1` + output-row expansion itself, not a draft-length scheduling problem. + +## Task 4: Decision + +- [x] **Step 1: Reject Phase 20 scheduler experiment for now** + + Do not build the group/defer-by-draft-length scheduler experiment on this + evidence: + + - draft length is already stable (`draft=3` >96% of verify slots), + - MTP still regresses decode throughput to 22-39% of baseline, + - TTFT gets worse at every concurrency, + - per-step shape variation is dominated by active-slot/tail churn and row + expansion, not mixed draft lengths. + + The next useful MTP work would need a deeper target-verify graph/state design, + not a small server scheduling shortcut. + +## Self-Review + +- No source behavior changed in this phase. +- Pre/post md5 and op gates passed. +- The phase result moves the plan by rejecting the scheduler follow-up rather + than leaving it as an attractive but unsupported idea.