docs(paged): reject MTP serving lever

Add the repeatable MTP serving A/B runner and record Phase 15 results showing current llama-server MTP regresses GB10 serving throughput despite passing inference gates.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 02:29:28 +00:00
parent 70394364a3
commit 4d171e62bb
7 changed files with 497 additions and 12 deletions

View File

@@ -885,6 +885,65 @@ Decision:
- Do not count MTP as a GB10 speed-parity win until serving results show useful
target-verification throughput under the canonical inference gates.
## Phase 15 MTP Serving Throughput Gate
Phase 15 measured the direct `llama-server` serving path after Phase 14 proved
rollback safety. The test compared two same-shape arms:
- baseline: no speculative decoding,
- MTP: `--spec-type draft-mtp --spec-draft-n-max 3
--no-spec-draft-backend-sampling`.
Artifact:
- `/home/mudler/bench/phase15_mtp_serving/20260701_042005`
Harness:
- `backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.sh`
- `NPL="8 32 128" PTOK=128 GEN=128 CTX=131072 PARALLEL=128`
- client: `/home/mudler/bench/h2h_cli3.py` against `/v1/completions`
Result:
| arm | n | agg t/s | decode agg t/s | decode per-seq t/s | TTFT mean ms | wall s |
|---|---:|---:|---:|---:|---:|---:|
| baseline | 8 | 192.5 | 247.8 | 30.70 | 1181.1 | 5.318 |
| MTP | 8 | 92.9 | 109.8 | 14.26 | 1691.5 | 11.017 |
| baseline | 32 | 305.4 | 406.0 | 12.02 | 2762.2 | 13.412 |
| MTP | 32 | 95.8 | 111.7 | 3.61 | 4545.6 | 42.727 |
| baseline | 128 | 429.5 | 662.4 | 4.31 | 7747.2 | 38.144 |
| MTP | 128 | 100.3 | 138.5 | 0.97 | 20385.7 | 163.289 |
MTP did actually run:
- server initialized `draft-mtp` with bounded partial sequence removal,
- response/server timings included draft counters,
- server log tail included `#gen tokens = 17293`, `#acc tokens = 15493`.
Normal inference gates before and after the A/B:
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
- `MUL_MAT_ID`: `806/806`, `Backend CUDA0: OK`.
Decision:
- Reject current `llama-server` MTP as a GB10 serving parity lever.
- Do not enable MTP by default in LocalAI or llama-server.
- Do not tune `spec-draft-n-max` blindly. The regression is large enough that
the next MTP phase, if any, must start with graph/batch-shape profiling.
Likely root cause:
- Baseline serving preserved heavy graph reuse (`graphs reused = 361` in the
`n=128` tail).
- MTP serving showed `graphs reused = 1` and high per-slot eval time at high
concurrency.
- The working hypothesis is that MTP verification/draft batch shape churn
defeats the paged decode graph-reuse wins, so extra verification dominates
despite high draft acceptance.
## Phase 10 GDN C32 Slab Baseline and Source Check
Phase 10 starts a separate GDN prefill path; it does not reopen the rejected

View File

@@ -206,7 +206,7 @@ Dense decode is **AHEAD at low N (116.7% @ N=8)** - the one operating point wher
| S2 double-buffer set_inputs | overlap host input build with GPU | DROPPED | `set_inputs` ~0.05 ms/step, nothing to recover |
| whole-step graph / host loop | host loop as serving residual | CLOSED (~0-1%) | reuse 0% (757.6) == S1+S3 72% (763.3); hostproc only ~4-8% of step wall |
| padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested, commit b028c81e)** | inert (BE) but regresses everywhere; N=8 burst 28.16->6.05 tok/s/seq; serving decode is GPU-compute-bound, dummy-row compute > reuse recovered |
| speculative decode (MTP) | draft + verify | SAFETY-GATED, default-off | Phase 14 passed recurrent rollback, partial rejection, normalized greedy-prefix, and canonical inference gates; still needs serving/API throughput proof before any parity claim |
| speculative decode (MTP) | draft + verify | **REJECTED for current GB10 serving** | Phase 14 safety passed, but Phase 15 serving A/B regressed hard: n128 decode agg 662.4 -> 138.5 tok/s; likely graph/batch-shape disruption (`graphs reused` 361 -> 1) |
### 4.5 SHIPPED WINS (all BE / KL-benign) - keep these, do not regress
- **FP4-MMQ MoE/dense GEMM** (native Blackwell FP4-MMA at the FP4 weight-BW floor; reason 4.1 stays default-off).
@@ -225,11 +225,13 @@ Dense decode is **AHEAD at low N (116.7% @ N=8)** - the one operating point wher
The `VLLM_PARITY_LEVER_MAP.md` "pursue list" (A1-A7/B1-B7/C1: graph-safe ragged grouped FP4-MMA MoE kernel, FP8 paged KV, MTP spec-decode, etc.) is the **earlier working brainstorm written before the final profiling**. `VLLM_PARITY_FINAL.md` is the authoritative supersession; treat those buckets as rejected / infeasible / different-hardware unless re-validated on new silicon.
Phase 14 re-validated the MTP bucket as a separate default-off workstream:
rollback and ordinary inference safety are now gated, but speed parity is not
claimed. The serving follow-up must keep the same fixed gates before and after
any benchmark: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
Phase 14 re-validated the MTP bucket as safe, then Phase 15 rejected it as a
current GB10 serving-throughput lever. Do not enable it by default and do not
keep tuning draft length blindly. The only plausible follow-up is a graph-reuse
and speculative verification batch-shape profile with
`nsys --cuda-graph-trace=node`. The fixed safety gates stayed green before and
after the failed serving A/B: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense
md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
---

View File

@@ -281,7 +281,7 @@ operating point where paged is unambiguously faster than vLLM.
| S2 double-buffer set_inputs | overlap host input build with GPU | **DROPPED** | `set_inputs` is **~0.05 ms/step** - nothing to recover (the rebuild was the cost) | DSS |
| whole-step graph / host loop | the host scheduling loop as the serving residual | **CLOSED (~0-1%)** | baseline reuse 0% (agg 757.6) **statistically equal** to S1+S3 reuse 72% (agg 763.3); `hostproc` only ~4-8% of the per-step wall = **measured dead** | DSS |
| padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested)** | inert (md5 bit-exact) but **regresses at every concurrency**; N=8 burst 28.16 -> 6.05 tok/s/seq (~4.6x slower); serving decode is **GPU-compute-bound**, dummy-row compute > reuse recovered | DSS |
| speculative decode (MTP) | draft + verify; greedy is bit-exact | **SAFETY-GATED, default-off** | Phase 14 passed recurrent rollback on the actual MoE GGUF, partial rejection, normalized greedy-prefix, and canonical inference gates; still not a paged-specific gap and still needs serving/API throughput proof before any parity claim | LMAP |
| speculative decode (MTP) | draft + verify; greedy is bit-exact | **REJECTED for current GB10 serving** | Phase 14 passed safety, but Phase 15 direct serving A/B regressed at every tested concurrency (n128 decode agg 662.4 -> 138.5 tok/s) despite high acceptance; likely breaks paged decode graph reuse (`graphs reused` 361 -> 1). Not a parity lever unless a future graph/batch-shape fix changes this result | LMAP |
The serving regime was the one place the static-bench parity did not carry over
(paged ~3.7 vs vLLM ~5.9 tok/s/seq, -39%, DSS). S1 made the decode step reusable

View File

@@ -454,9 +454,9 @@ Phase 9 adds a narrow MTP smoke gate instead of production enablement:
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`.
MTP remains opt-in and, after Phase 14, safety-gated but not throughput-proven.
It does not supersede the next GDN prefill scope until a serving phase proves
target-verification cost.
MTP remains opt-in and, after Phase 15, rejected as a current GB10 serving
throughput lever. It does not supersede the GDN/paged-serving conclusions unless
a future graph/batch-shape fix changes the serving result.
### Phase 14 MTP rollback update
@@ -478,8 +478,35 @@ than exact transcript md5 because `llama-speculative-simple` emits accepted
token groups and can produce a longer completion than `llama-completion -no-cnv`
for the same `-n`. Across `n=8,16,24,32,48`, no first differing token was found.
Next step: Phase 15 may benchmark serving/API throughput with MTP still
default-off and only behind the canonical inference gates.
Phase 15 completed that serving/API benchmark and rejected current MTP serving.
### Phase 15 MTP serving update
Phase 15 ran the direct `llama-server` serving A/B that Phase 14 enabled. It
rejects current MTP serving as a parity lever on GB10:
| arm | n | decode agg t/s | decode per-seq t/s | TTFT mean ms |
|---|---:|---:|---:|---:|
| baseline | 8 | 247.8 | 30.70 | 1181.1 |
| MTP | 8 | 109.8 | 14.26 | 1691.5 |
| baseline | 32 | 406.0 | 12.02 | 2762.2 |
| MTP | 32 | 111.7 | 3.61 | 4545.6 |
| baseline | 128 | 662.4 | 4.31 | 7747.2 |
| MTP | 128 | 138.5 | 0.97 | 20385.7 |
Artifact: `/home/mudler/bench/phase15_mtp_serving/20260701_042005`.
MTP did draft and accept tokens (`#gen tokens = 17293`, `#acc tokens = 15493`),
so this is not a no-draft false negative. The likely culprit is graph/batch
shape disruption: baseline logs show heavy graph reuse (`graphs reused = 361`
in the high-concurrency tail), while MTP logs show `graphs reused = 1` and much
higher per-slot eval time. Pre/post canonical inference gates stayed green:
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
Do not keep tuning MTP draft length blindly. A follow-up must first profile
speculative verification batch shapes and CUDA graph reuse with
`nsys --cuda-graph-trace=node`.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.