mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): profile MTP graph reuse loss
Record Phase 16 nsys evidence that current MTP serving loses paged decode graph reuse and increases GPU work, explaining the Phase 15 serving regression. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -944,6 +944,42 @@ Likely root cause:
|
||||
defeats the paged decode graph-reuse wins, so extra verification dominates
|
||||
despite high draft acceptance.
|
||||
|
||||
## Phase 16 MTP Graph-Reuse Profile
|
||||
|
||||
Phase 16 profiled the Phase 15 hypothesis with
|
||||
`nsys --cuda-graph-trace=node` on a smaller direct serving shape:
|
||||
|
||||
- server: `-c 32768 -b 2048 -ub 512 --parallel 32`,
|
||||
- client: `h2h_cli3.py -n 8 --ptok 64 --gen 64`,
|
||||
- arms: baseline vs `--spec-type draft-mtp --spec-draft-n-max 3`.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase16_mtp_graph_profile/20260701_043016`
|
||||
|
||||
Result:
|
||||
|
||||
| arm | decode agg t/s | decode per-seq t/s | wall s | graph reuse |
|
||||
|---|---:|---:|---:|---:|
|
||||
| baseline | 230.5 | 28.07 | 3.523 | `graphs reused = 62` |
|
||||
| MTP | 97.7 | 12.83 | 7.049 | `graphs reused = 1` |
|
||||
|
||||
MTP drafted and accepted tokens:
|
||||
|
||||
- `draft acceptance = 0.81481 (44 accepted / 54 generated)`,
|
||||
- `#gen tokens = 460`, `#acc tokens = 346`.
|
||||
|
||||
Nsight kernel summaries also show materially more GPU work in the MTP run:
|
||||
roughly `5.89 s` top-level GPU kernel time versus `2.59 s` for the baseline
|
||||
small profile.
|
||||
|
||||
Decision:
|
||||
|
||||
- Phase 16 supports the Phase 15 root-cause hypothesis: current MTP serving
|
||||
defeats the paged decode graph-reuse advantage and increases GPU work.
|
||||
- A future source phase must start at speculative verification batch shapes and
|
||||
graph-reuse keys, not at MTP draft-length tuning.
|
||||
|
||||
## Phase 10 GDN C32 Slab Baseline and Source Check
|
||||
|
||||
Phase 10 starts a separate GDN prefill path; it does not reopen the rejected
|
||||
|
||||
@@ -229,9 +229,12 @@ Phase 14 re-validated the MTP bucket as safe, then Phase 15 rejected it as a
|
||||
current GB10 serving-throughput lever. Do not enable it by default and do not
|
||||
keep tuning draft length blindly. The only plausible follow-up is a graph-reuse
|
||||
and speculative verification batch-shape profile with
|
||||
`nsys --cuda-graph-trace=node`. The fixed safety gates stayed green before and
|
||||
after the failed serving A/B: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
|
||||
`nsys --cuda-graph-trace=node`. Phase 16 ran that profile and supported the
|
||||
root cause: small-shape baseline reused graphs (`graphs reused = 62`) while MTP
|
||||
did not (`graphs reused = 1`) and did ~2.3x more GPU kernel work. The fixed
|
||||
safety gates stayed green before and after the failed serving A/B: MoE md5
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -281,7 +281,7 @@ operating point where paged is unambiguously faster than vLLM.
|
||||
| S2 double-buffer set_inputs | overlap host input build with GPU | **DROPPED** | `set_inputs` is **~0.05 ms/step** - nothing to recover (the rebuild was the cost) | DSS |
|
||||
| whole-step graph / host loop | the host scheduling loop as the serving residual | **CLOSED (~0-1%)** | baseline reuse 0% (agg 757.6) **statistically equal** to S1+S3 reuse 72% (agg 763.3); `hostproc` only ~4-8% of the per-step wall = **measured dead** | DSS |
|
||||
| padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested)** | inert (md5 bit-exact) but **regresses at every concurrency**; N=8 burst 28.16 -> 6.05 tok/s/seq (~4.6x slower); serving decode is **GPU-compute-bound**, dummy-row compute > reuse recovered | DSS |
|
||||
| speculative decode (MTP) | draft + verify; greedy is bit-exact | **REJECTED for current GB10 serving** | Phase 14 passed safety, but Phase 15 direct serving A/B regressed at every tested concurrency (n128 decode agg 662.4 -> 138.5 tok/s) despite high acceptance; likely breaks paged decode graph reuse (`graphs reused` 361 -> 1). Not a parity lever unless a future graph/batch-shape fix changes this result | LMAP |
|
||||
| speculative decode (MTP) | draft + verify; greedy is bit-exact | **REJECTED for current GB10 serving** | Phase 14 passed safety, but Phase 15 direct serving A/B regressed at every tested concurrency (n128 decode agg 662.4 -> 138.5 tok/s) despite high acceptance; Phase 16 profile supports graph-reuse loss as root cause (`graphs reused` 62 -> 1 in the small nsys run). Not a parity lever unless a future graph/batch-shape fix changes this result | LMAP |
|
||||
|
||||
The serving regime was the one place the static-bench parity did not carry over
|
||||
(paged ~3.7 vs vLLM ~5.9 tok/s/seq, -39%, DSS). S1 made the decode step reusable
|
||||
|
||||
@@ -508,6 +508,26 @@ Do not keep tuning MTP draft length blindly. A follow-up must first profile
|
||||
speculative verification batch shapes and CUDA graph reuse with
|
||||
`nsys --cuda-graph-trace=node`.
|
||||
|
||||
### Phase 16 MTP graph-reuse profile
|
||||
|
||||
Phase 16 ran that profile on a smaller direct serving shape (`n=8`, `ptok=64`,
|
||||
`gen=64`) with `nsys --cuda-graph-trace=node`.
|
||||
|
||||
Artifact: `/home/mudler/bench/phase16_mtp_graph_profile/20260701_043016`.
|
||||
|
||||
Result:
|
||||
|
||||
- baseline: `decode_agg_tps=230.5`, `graphs reused = 62`,
|
||||
- MTP: `decode_agg_tps=97.7`, `graphs reused = 1`,
|
||||
- MTP drafted (`#gen tokens = 460`, `#acc tokens = 346`),
|
||||
- `nsys stats` showed materially more GPU kernel time in MTP (~`5.89 s`) than
|
||||
baseline (~`2.59 s`).
|
||||
|
||||
This supports the root-cause hypothesis: current MTP serving disrupts the paged
|
||||
decode graph-reuse path and increases GPU work. If MTP is reopened, start at
|
||||
`tools/server/server-context.cpp` speculative verification batch construction
|
||||
and graph-reuse keys, not draft-length tuning.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
Reference in New Issue
Block a user