docs(paged): reject MTP draft-shape scheduler

Record the Phase 19 trace-only serving entropy run and close the group/defer-by-draft follow-up based on measured shape distribution, throughput regression, and green inference gates.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 03:03:49 +00:00
parent cced07c7fe
commit 310eb3c866
4 changed files with 246 additions and 0 deletions

View File

@@ -1303,3 +1303,59 @@ Conclusion:
request (`rows=4` and `rows=3`).
- A follow-up scheduler experiment is not yet justified. First use this trace
under real serving load to measure draft-length bucket entropy.
## Phase 19 MTP Serving Shape Entropy
Phase 19 ran Phase 18's shape trace under the direct serving harness with
`LLAMA_SPEC_SHAPE_TRACE=1`, `NPL="8 32 128"`, `GEN=64`, and `PTOK=128`.
Artifact:
- `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`
Pre/post gate result:
- Pre-gate and post-gate both passed.
- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`.
- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`.
- Full `MUL_MAT_ID`: `806/806` on CUDA0.
Serving A/B:
| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms |
|---|---------------------|----------------|----------------|------------------|-------------|
| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 |
| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 |
| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 |
Shape entropy summaries:
- `shape_entropy_summary.tsv`
- `step_shape_summary.tsv`
Per-slot draft distribution:
| window | verify slots | draft counts | top draft share | unique `batch_before` |
|--------|--------------|--------------|-----------------|-----------------------|
| n8 | 162 | `{1: 4, 2: 2, 3: 156}` | 96.3% | 15 |
| n32 | 610 | `{1: 8, 2: 11, 3: 591}` | 96.9% | 96 |
| n128 | 2353 | `{1: 40, 2: 49, 3: 2264}` | 96.2% | 479 |
Per-step aggregate shape:
| window | steps | unique total rows | top full-shape rows |
|--------|-------|-------------------|---------------------|
| n8 | 26 | 12 | `32` rows for 14 steps |
| n32 | 32 | 20 | `128` rows for 13 steps |
| n128 | 37 | 34 | `512` rows for 4 steps |
Decision:
- Do not implement the Phase 20 group/defer-by-draft scheduler shortcut on this
evidence.
- Draft length is already stable (`draft=3` is >96% of verify slots), yet MTP
still regresses decode throughput hard and worsens TTFT.
- The residual shape churn is dominated by active-slot/tail churn and the MTP
`K + 1` verification-row expansion, not mixed draft lengths.
- Any future MTP parity work needs a deeper target-verify graph/state design,
not a small server scheduling shortcut.

View File

@@ -262,6 +262,27 @@ grouping. Any scheduler experiment must be opt-in/default-off and killed by
TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP
rollback/prefix gate failure.
Phase 19 ran that trace-only serving measurement and rejected the scheduler
shortcut. Artifact:
`/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`. Pre/post gates
passed with canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
Serving result:
| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms |
|---|---------------------|----------------|----------------|------------------|-------------|
| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 |
| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 |
| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 |
Shape result: `draft=3` already accounts for 96.2-96.9% of verify slots, so
group/defer-by-draft has little to recover. Full in-flight steps already mostly
use all-`draft=3` vectors; the remaining churn is active-slot/tail churn plus
the real `K + 1` verification-row expansion. Do not build a Phase 20 scheduler
experiment on this evidence. Future MTP work would need a deeper target-verify
graph/state design, not another small server scheduling shortcut.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)

View File

@@ -586,6 +586,36 @@ should an opt-in group/defer-by-draft-length scheduler be built; kill it on
TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP
rollback/prefix gate failure.
### Phase 19 MTP serving shape entropy
Phase 19 ran the trace-only serving measurement. Artifact:
`/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`.
Pre/post canonical gates passed: MoE `8cb0ce23777bf55f92f63d0292c756b0`,
dense `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
MTP serving stayed slower:
| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms |
|---|---------------------|----------------|----------------|------------------|-------------|
| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 |
| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 |
| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 |
The shape trace rejects the small scheduler shortcut:
- per-slot draft length is already stable: `draft=3` is 96.2-96.9% of verify
slots across n8/n32/n128;
- full in-flight steps already mostly use all-`draft=3` vectors;
- remaining aggregate shape churn is active-slot/tail churn plus MTP's real
`K + 1` output-row expansion;
- group/defer-by-draft would not remove the dominant row expansion and would
risk more TTFT loss.
Decision: do not build a Phase 20 group/defer scheduler on current evidence.
Future MTP work would need a deeper target-verify graph/state design, not
another small server scheduling shortcut.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update