docs(paged): reject MTP draft-shape scheduler

Record the Phase 19 trace-only serving entropy run and close the group/defer-by-draft follow-up based on measured shape distribution, throughput regression, and green inference gates.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 03:03:49 +00:00
parent cced07c7fe
commit 310eb3c866
4 changed files with 246 additions and 0 deletions

View File

@@ -1303,3 +1303,59 @@ Conclusion:
request (`rows=4` and `rows=3`).
- A follow-up scheduler experiment is not yet justified. First use this trace
under real serving load to measure draft-length bucket entropy.
## Phase 19 MTP Serving Shape Entropy
Phase 19 ran Phase 18's shape trace under the direct serving harness with
`LLAMA_SPEC_SHAPE_TRACE=1`, `NPL="8 32 128"`, `GEN=64`, and `PTOK=128`.
Artifact:
- `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`
Pre/post gate result:
- Pre-gate and post-gate both passed.
- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`.
- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`.
- Full `MUL_MAT_ID`: `806/806` on CUDA0.
Serving A/B:
| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms |
|---|---------------------|----------------|----------------|------------------|-------------|
| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 |
| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 |
| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 |
Shape entropy summaries:
- `shape_entropy_summary.tsv`
- `step_shape_summary.tsv`
Per-slot draft distribution:
| window | verify slots | draft counts | top draft share | unique `batch_before` |
|--------|--------------|--------------|-----------------|-----------------------|
| n8 | 162 | `{1: 4, 2: 2, 3: 156}` | 96.3% | 15 |
| n32 | 610 | `{1: 8, 2: 11, 3: 591}` | 96.9% | 96 |
| n128 | 2353 | `{1: 40, 2: 49, 3: 2264}` | 96.2% | 479 |
Per-step aggregate shape:
| window | steps | unique total rows | top full-shape rows |
|--------|-------|-------------------|---------------------|
| n8 | 26 | 12 | `32` rows for 14 steps |
| n32 | 32 | 20 | `128` rows for 13 steps |
| n128 | 37 | 34 | `512` rows for 4 steps |
Decision:
- Do not implement the Phase 20 group/defer-by-draft scheduler shortcut on this
evidence.
- Draft length is already stable (`draft=3` is >96% of verify slots), yet MTP
still regresses decode throughput hard and worsens TTFT.
- The residual shape churn is dominated by active-slot/tail churn and the MTP
`K + 1` verification-row expansion, not mixed draft lengths.
- Any future MTP parity work needs a deeper target-verify graph/state design,
not a small server scheduling shortcut.

View File

@@ -262,6 +262,27 @@ grouping. Any scheduler experiment must be opt-in/default-off and killed by
TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP
rollback/prefix gate failure.
Phase 19 ran that trace-only serving measurement and rejected the scheduler
shortcut. Artifact:
`/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`. Pre/post gates
passed with canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
Serving result:
| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms |
|---|---------------------|----------------|----------------|------------------|-------------|
| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 |
| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 |
| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 |
Shape result: `draft=3` already accounts for 96.2-96.9% of verify slots, so
group/defer-by-draft has little to recover. Full in-flight steps already mostly
use all-`draft=3` vectors; the remaining churn is active-slot/tail churn plus
the real `K + 1` verification-row expansion. Do not build a Phase 20 scheduler
experiment on this evidence. Future MTP work would need a deeper target-verify
graph/state design, not another small server scheduling shortcut.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)

View File

@@ -586,6 +586,36 @@ should an opt-in group/defer-by-draft-length scheduler be built; kill it on
TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP
rollback/prefix gate failure.
### Phase 19 MTP serving shape entropy
Phase 19 ran the trace-only serving measurement. Artifact:
`/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`.
Pre/post canonical gates passed: MoE `8cb0ce23777bf55f92f63d0292c756b0`,
dense `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
MTP serving stayed slower:
| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms |
|---|---------------------|----------------|----------------|------------------|-------------|
| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 |
| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 |
| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 |
The shape trace rejects the small scheduler shortcut:
- per-slot draft length is already stable: `draft=3` is 96.2-96.9% of verify
slots across n8/n32/n128;
- full in-flight steps already mostly use all-`draft=3` vectors;
- remaining aggregate shape churn is active-slot/tail churn plus MTP's real
`K + 1` output-row expansion;
- group/defer-by-draft would not remove the dominant row expansion and would
risk more TTFT loss.
Decision: do not build a Phase 20 group/defer scheduler on current evidence.
Future MTP work would need a deeper target-verify graph/state design, not
another small server scheduling shortcut.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update

View File

@@ -0,0 +1,139 @@
# MTP Serving Shape Entropy Phase 19 Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use
> superpowers:verification-before-completion before recording the phase result.
> Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** use Phase 18's `LLAMA_SPEC_SHAPE_TRACE=1` instrumentation under real
serving load to decide whether a group/defer-by-draft-length scheduler
experiment is justified.
**Architecture:** trace-only benchmark. Do not change llama.cpp source or
scheduling policy. Run the existing MTP serving A/B with pre/post canonical
inference gates.
**Tech Stack:** `paged-mtp-serving-bench.sh`, llama.cpp `llama-server`, DGX
GB10, LocalAI paged patch stack.
---
## Task 1: Run Trace-Only Serving A/B
- [x] **Step 1: Confirm DGX is free**
Preflight passed:
- `docker=0`
- `local_ai_worker=0`
- `compute=0`
- [x] **Step 2: Run serving harness with shape trace**
Command shape:
```bash
LLAMA_SPEC_SHAPE_TRACE=1 \
ART=~/bench/phase19_mtp_shape_entropy/20260701_045534 \
NPL="8 32 128" GEN=64 PTOK=128 \
/tmp/paged-mtp-serving-bench.sh
```
Artifact:
- `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`
## Task 2: Verify Inference Gates
- [x] **Step 1: Pre-gate passed**
Artifact:
- `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534/gate_pre`
Result:
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
- `MUL_MAT_ID`: `806/806`
- [x] **Step 2: Post-gate passed**
Artifact:
- `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534/gate_post`
Result:
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
- `MUL_MAT_ID`: `806/806`
## Task 3: Analyze Serving Result
- [x] **Step 1: Compare baseline vs MTP serving throughput**
| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms |
|---|---------------------|----------------|----------------|------------------|-------------|
| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 |
| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 |
| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 |
MTP remained materially slower at every concurrency.
- [x] **Step 2: Parse per-slot draft entropy**
Artifact:
- `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534/shape_entropy_summary.tsv`
Result:
| window | verify slots | draft counts | top draft share | unique `batch_before` |
|--------|--------------|--------------|-----------------|-----------------------|
| n8 | 162 | `{1: 4, 2: 2, 3: 156}` | 96.3% | 15 |
| n32 | 610 | `{1: 8, 2: 11, 3: 591}` | 96.9% | 96 |
| n128 | 2353 | `{1: 40, 2: 49, 3: 2264}` | 96.2% | 479 |
Draft length is already overwhelmingly `3`. Grouping by draft length has
little to recover.
- [x] **Step 3: Parse per-step aggregate shapes**
Artifact:
- `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534/step_shape_summary.tsv`
Result:
| window | steps | unique total rows | top full-shape rows |
|--------|-------|-------------------|---------------------|
| n8 | 26 | 12 | `32` rows for 14 steps |
| n32 | 32 | 20 | `128` rows for 13 steps |
| n128 | 37 | 34 | `512` rows for 4 steps |
Full in-flight steps already consist mostly of all-`draft=3` vectors. The
remaining shape churn is active-slot/tail churn plus the speculative `K + 1`
output-row expansion itself, not a draft-length scheduling problem.
## Task 4: Decision
- [x] **Step 1: Reject Phase 20 scheduler experiment for now**
Do not build the group/defer-by-draft-length scheduler experiment on this
evidence:
- draft length is already stable (`draft=3` >96% of verify slots),
- MTP still regresses decode throughput to 22-39% of baseline,
- TTFT gets worse at every concurrency,
- per-step shape variation is dominated by active-slot/tail churn and row
expansion, not mixed draft lengths.
The next useful MTP work would need a deeper target-verify graph/state design,
not a small server scheduling shortcut.
## Self-Review
- No source behavior changed in this phase.
- Pre/post md5 and op gates passed.
- The phase result moves the plan by rejecting the scheduler follow-up rather
than leaving it as an attractive but unsupported idea.