docs(paged): refresh current serving snapshot

Record the Phase 20 same-session MoE paged-vs-vLLM serving snapshot on the current clean DGX mirror, including pre/post inference gates and the resulting GB10 parity decision.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 03:15:30 +00:00
parent 310eb3c866
commit c99678da42
5 changed files with 236 additions and 0 deletions

View File

@@ -564,6 +564,14 @@ backend-split + gallery plan is in
## 9. vLLM parity - final state (CLOSED)
> 2026-07-01 follow-up: the investigation was reopened for MTP safety,
> MTP-serving, graph-shape tracing, and a current-stack serving snapshot. Phases
> 14-20 are recorded in
> [`docs/GB10_PARITY_PHASE0_RESULTS.md`](docs/GB10_PARITY_PHASE0_RESULTS.md) and
> [`docs/PARITY_HANDOFF.md`](docs/PARITY_HANDOFF.md). They did not change the
> GB10 conclusion: MTP/scheduler shortcuts are rejected, and the latest clean
> stack remains below vLLM serving parity.
The multi-week GB10 (DGX Spark, sm_121) vLLM-parity investigation is **closed**.
The standing, never-re-litigate record - full benchmark, every lever and verdict,
the structural floors, the parity verdict - is
@@ -596,3 +604,13 @@ the structural floors, the parity verdict - is
the vLLM advantages that lose on GB10** (FLA blocked-solve GDN, Marlin/CUTLASS
grouped FP4, HBM-tuned full-cudagraph decode). Re-run the methodology on new
silicon; do not reopen the GB10 levers.
Latest current-stack MoE serving snapshot (`PTOK=128`, `GEN=64`, current clean
DGX mirror `f2521ab12`, artifact
`/home/mudler/bench/phase20_current_snapshot/20260701_050621`):
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |

View File

@@ -1359,3 +1359,49 @@ Decision:
`K + 1` verification-row expansion, not mixed draft lengths.
- Any future MTP parity work needs a deeper target-verify graph/state design,
not a small server scheduling shortcut.
## Phase 20 Current-Stack Serving Snapshot
Phase 20 refreshed the MoE paged-vs-vLLM serving baseline on the current clean
DGX mirror after the MTP investigation.
Artifact:
- `/home/mudler/bench/phase20_current_snapshot/20260701_050621`
Current source:
- `/home/mudler/llama-phase6-source`
- `f2521ab12 feat(server): trace speculative batch shapes`
Pre/post gate result:
- Pre-gate and post-gate both passed.
- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`.
- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`.
- Full `MUL_MAT_ID`: `806/806` on CUDA0.
Serving snapshot:
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
Latency/prefill snapshot:
| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps |
|---|---------------|--------------|------------------|--------------------|------------------|
| 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 |
| 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 |
| 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 |
Decision:
- The latest clean stack is still not at vLLM serving parity on GB10.
- The user-visible gap is dominated by prefill/TTFT and e2e serving throughput,
not by a now-open MTP or scheduler shortcut.
- Keep MTP scheduler work closed. The next credible parity path is either a
datacenter-Blackwell rerun or a larger fused-kernel project outside the
low-conflict GB10 patch stack.

View File

@@ -283,6 +283,27 @@ the real `K + 1` verification-row expansion. Do not build a Phase 20 scheduler
experiment on this evidence. Future MTP work would need a deeper target-verify
graph/state design, not another small server scheduling shortcut.
Phase 20 refreshed the current-stack MoE serving snapshot against vLLM using the
clean `~/llama-phase6-source` mirror (`f2521ab12`) rather than the stale
`llama-paged-dev` benchmark tree. Artifact:
`/home/mudler/bench/phase20_current_snapshot/20260701_050621`. Pre/post gates
passed with canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
Current MoE serving snapshot (`PTOK=128`, `GEN=64`):
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
TTFT remains the clearest user-visible gap: paged is 2.88x/3.36x/3.11x slower
than vLLM at n8/n32/n128, and paged prefill_tps is roughly one-third of vLLM.
This keeps the GB10 shortcut closure intact: do not reopen MTP or small
scheduler work. The credible next parity path is a datacenter-Blackwell rerun or
a larger fused-kernel project outside this low-conflict patch stack.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)

View File

@@ -616,6 +616,34 @@ Decision: do not build a Phase 20 group/defer scheduler on current evidence.
Future MTP work would need a deeper target-verify graph/state design, not
another small server scheduling shortcut.
### Phase 20 current-stack serving snapshot
Phase 20 refreshed the MoE serving baseline using the current clean DGX mirror
(`~/llama-phase6-source`, `f2521ab12`) and the same-session vLLM server. Artifact:
`/home/mudler/bench/phase20_current_snapshot/20260701_050621`.
Pre/post canonical gates passed: MoE `8cb0ce23777bf55f92f63d0292c756b0`,
dense `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
TTFT/prefill remains the largest user-visible gap:
| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps |
|---|---------------|--------------|------------------|--------------------|------------------|
| 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 |
| 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 |
| 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 |
Decision: the latest stack is still below vLLM serving parity on GB10. The next
credible parity path is not another MTP/scheduler shortcut; it is either the
documented datacenter-Blackwell rerun or a larger fused-kernel project outside
the low-conflict GB10 patch stack.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update