docs(paged): refresh current serving snapshot

Record the Phase 20 same-session MoE paged-vs-vLLM serving snapshot on the current clean DGX mirror, including pre/post inference gates and the resulting GB10 parity decision. Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-07-01 03:15:30 +00:00
parent 310eb3c866
commit c99678da42
5 changed files with 236 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -564,6 +564,14 @@ backend-split + gallery plan is in

 ## 9. vLLM parity - final state (CLOSED)

+> 2026-07-01 follow-up: the investigation was reopened for MTP safety,
+> MTP-serving, graph-shape tracing, and a current-stack serving snapshot. Phases
+> 14-20 are recorded in
+> [`docs/GB10_PARITY_PHASE0_RESULTS.md`](docs/GB10_PARITY_PHASE0_RESULTS.md) and
+> [`docs/PARITY_HANDOFF.md`](docs/PARITY_HANDOFF.md). They did not change the
+> GB10 conclusion: MTP/scheduler shortcuts are rejected, and the latest clean
+> stack remains below vLLM serving parity.
+
 The multi-week GB10 (DGX Spark, sm_121) vLLM-parity investigation is **closed**.
 The standing, never-re-litigate record - full benchmark, every lever and verdict,
 the structural floors, the parity verdict - is
@@ -596,3 +604,13 @@ the structural floors, the parity verdict - is
  the vLLM advantages that lose on GB10** (FLA blocked-solve GDN, Marlin/CUTLASS
  grouped FP4, HBM-tuned full-cudagraph decode). Re-run the methodology on new
  silicon; do not reopen the GB10 levers.
+
+Latest current-stack MoE serving snapshot (`PTOK=128`, `GEN=64`, current clean
+DGX mirror `f2521ab12`, artifact
+`/home/mudler/bench/phase20_current_snapshot/20260701_050621`):
+
+| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
+|---|------------------|-----------------|-------------------|-----------|----------|----------------|
+| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
+| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
+| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -1359,3 +1359,49 @@ Decision:
  `K + 1` verification-row expansion, not mixed draft lengths.
 - Any future MTP parity work needs a deeper target-verify graph/state design,
  not a small server scheduling shortcut.
+
+## Phase 20 Current-Stack Serving Snapshot
+
+Phase 20 refreshed the MoE paged-vs-vLLM serving baseline on the current clean
+DGX mirror after the MTP investigation.
+
+Artifact:
+
+- `/home/mudler/bench/phase20_current_snapshot/20260701_050621`
+
+Current source:
+
+- `/home/mudler/llama-phase6-source`
+- `f2521ab12 feat(server): trace speculative batch shapes`
+
+Pre/post gate result:
+
+- Pre-gate and post-gate both passed.
+- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`.
+- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`.
+- Full `MUL_MAT_ID`: `806/806` on CUDA0.
+
+Serving snapshot:
+
+| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
+|---|------------------|-----------------|-------------------|-----------|----------|----------------|
+| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
+| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
+| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
+
+Latency/prefill snapshot:
+
+| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps |
+|---|---------------|--------------|------------------|--------------------|------------------|
+| 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 |
+| 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 |
+| 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 |
+
+Decision:
+
+- The latest clean stack is still not at vLLM serving parity on GB10.
+- The user-visible gap is dominated by prefill/TTFT and e2e serving throughput,
+  not by a now-open MTP or scheduler shortcut.
+- Keep MTP scheduler work closed. The next credible parity path is either a
+  datacenter-Blackwell rerun or a larger fused-kernel project outside the
+  low-conflict GB10 patch stack.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -283,6 +283,27 @@ the real `K + 1` verification-row expansion. Do not build a Phase 20 scheduler
 experiment on this evidence. Future MTP work would need a deeper target-verify
 graph/state design, not another small server scheduling shortcut.

+Phase 20 refreshed the current-stack MoE serving snapshot against vLLM using the
+clean `~/llama-phase6-source` mirror (`f2521ab12`) rather than the stale
+`llama-paged-dev` benchmark tree. Artifact:
+`/home/mudler/bench/phase20_current_snapshot/20260701_050621`. Pre/post gates
+passed with canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
+`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
+
+Current MoE serving snapshot (`PTOK=128`, `GEN=64`):
+
+| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
+|---|------------------|-----------------|-------------------|-----------|----------|----------------|
+| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
+| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
+| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
+
+TTFT remains the clearest user-visible gap: paged is 2.88x/3.36x/3.11x slower
+than vLLM at n8/n32/n128, and paged prefill_tps is roughly one-third of vLLM.
+This keeps the GB10 shortcut closure intact: do not reopen MTP or small
+scheduler work. The credible next parity path is a datacenter-Blackwell rerun or
+a larger fused-kernel project outside this low-conflict patch stack.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -616,6 +616,34 @@ Decision: do not build a Phase 20 group/defer scheduler on current evidence.
 Future MTP work would need a deeper target-verify graph/state design, not
 another small server scheduling shortcut.

+### Phase 20 current-stack serving snapshot
+
+Phase 20 refreshed the MoE serving baseline using the current clean DGX mirror
+(`~/llama-phase6-source`, `f2521ab12`) and the same-session vLLM server. Artifact:
+`/home/mudler/bench/phase20_current_snapshot/20260701_050621`.
+
+Pre/post canonical gates passed: MoE `8cb0ce23777bf55f92f63d0292c756b0`,
+dense `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
+
+| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
+|---|------------------|-----------------|-------------------|-----------|----------|----------------|
+| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
+| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
+| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
+
+TTFT/prefill remains the largest user-visible gap:
+
+| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps |
+|---|---------------|--------------|------------------|--------------------|------------------|
+| 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 |
+| 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 |
+| 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 |
+
+Decision: the latest stack is still below vLLM serving parity on GB10. The next
+credible parity path is not another MTP/scheduler shortcut; it is either the
+documented datacenter-Blackwell rerun or a larger fused-kernel project outside
+the low-conflict GB10 patch stack.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/docs/superpowers/plans/2026-07-01-current-stack-serving-snapshot-phase20.md
+++ b/docs/superpowers/plans/2026-07-01-current-stack-serving-snapshot-phase20.md
@@ -0,0 +1,123 @@
+# Current Stack Serving Snapshot Phase 20 Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use
+> superpowers:verification-before-completion before recording the phase result.
+> Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** refresh the MoE paged-vs-vLLM serving baseline on the current clean
+llama.cpp stack after the MTP investigation.
+
+**Architecture:** benchmark only. Run the current DGX mirror
+`~/llama-phase6-source` against vLLM in the same lock window with the same h2h
+client, then run canonical pre/post inference gates. Do not change source.
+
+**Tech Stack:** llama.cpp `llama-server`, vLLM `0.23.0`, DGX GB10,
+`h2h_cli3.py`, LocalAI paged patch stack.
+
+---
+
+## Task 1: Run Current-Stack Snapshot
+
+- [x] **Step 1: Confirm DGX is free**
+
+  Preflight passed:
+
+  - `docker=0`
+  - `local_ai_worker=0`
+  - `compute=0`
+
+- [x] **Step 2: Build current mirror targets**
+
+  Source:
+
+  - `/home/mudler/llama-phase6-source`
+  - HEAD: `f2521ab12 feat(server): trace speculative batch shapes`
+
+  Build:
+
+  ```bash
+  cmake --build ~/llama-phase6-source/build-cuda \
+    --target llama-server llama-completion test-backend-ops -j8
+  ```
+
+- [x] **Step 3: Run paged and vLLM serving arms**
+
+  Artifact:
+
+  - `/home/mudler/bench/phase20_current_snapshot/20260701_050621`
+
+  Workload:
+
+  - MoE Qwen3.6-35B-A3B-NVFP4
+  - `NPL=8,32,128`
+  - `PTOK=128`
+  - `GEN=64`
+  - h2h OpenAI completions client with fresh nonces
+
+## Task 2: Verify Inference Gates
+
+- [x] **Step 1: Pre-gate passed**
+
+  Artifact:
+
+  - `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_pre`
+
+  Result:
+
+  - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
+  - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
+  - `MUL_MAT_ID`: `806/806`
+
+- [x] **Step 2: Post-gate passed**
+
+  Artifact:
+
+  - `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_post`
+
+  Result:
+
+  - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
+  - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
+  - `MUL_MAT_ID`: `806/806`
+
+## Task 3: Snapshot Result
+
+- [x] **Step 1: Compare serving throughput**
+
+  | n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
+  |---|------------------|-----------------|-------------------|-----------|----------|----------------|
+  | 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
+  | 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
+  | 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
+
+- [x] **Step 2: Compare latency and prefill**
+
+  | n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps |
+  |---|---------------|--------------|------------------|--------------------|------------------|
+  | 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 |
+  | 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 |
+  | 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 |
+
+  The current stack remains far from vLLM serving parity in e2e/TTFT because
+  prefill is still much slower.
+
+## Task 4: Decision
+
+- [x] **Step 1: Keep GB10 shortcut closure**
+
+  This snapshot confirms the Phase 19 direction:
+
+  - MTP and scheduling shortcuts should stay closed.
+  - Current paged serving is still below vLLM on MoE serving throughput.
+  - The largest user-visible gap is prefill/TTFT, where vLLM is roughly 2.6-3.2x
+    faster on this short serving snapshot.
+  - The next credible parity path is not another small GB10 server shortcut; it
+    is either a new-silicon rerun on datacenter Blackwell or a larger fused
+    kernel project outside the low-conflict patch stack.
+
+## Self-Review
+
+- No source behavior changed.
+- Pre/post inference gates passed.
+- The result uses the current clean mirror, not the stale `llama-paged-dev`
+  benchmark tree.