diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md index 18875a98c..426176294 100644 --- a/backend/cpp/llama-cpp-localai-paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -564,6 +564,14 @@ backend-split + gallery plan is in ## 9. vLLM parity - final state (CLOSED) +> 2026-07-01 follow-up: the investigation was reopened for MTP safety, +> MTP-serving, graph-shape tracing, and a current-stack serving snapshot. Phases +> 14-20 are recorded in +> [`docs/GB10_PARITY_PHASE0_RESULTS.md`](docs/GB10_PARITY_PHASE0_RESULTS.md) and +> [`docs/PARITY_HANDOFF.md`](docs/PARITY_HANDOFF.md). They did not change the +> GB10 conclusion: MTP/scheduler shortcuts are rejected, and the latest clean +> stack remains below vLLM serving parity. + The multi-week GB10 (DGX Spark, sm_121) vLLM-parity investigation is **closed**. The standing, never-re-litigate record - full benchmark, every lever and verdict, the structural floors, the parity verdict - is @@ -596,3 +604,13 @@ the structural floors, the parity verdict - is the vLLM advantages that lose on GB10** (FLA blocked-solve GDN, Marlin/CUTLASS grouped FP4, HBM-tuned full-cudagraph decode). Re-run the methodology on new silicon; do not reopen the GB10 levers. + +Latest current-stack MoE serving snapshot (`PTOK=128`, `GEN=64`, current clean +DGX mirror `f2521ab12`, artifact +`/home/mudler/bench/phase20_current_snapshot/20260701_050621`): + +| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | +|---|------------------|-----------------|-------------------|-----------|----------|----------------| +| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% | +| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% | +| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% | diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 6c34730b3..65fb1e00b 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -1359,3 +1359,49 @@ Decision: `K + 1` verification-row expansion, not mixed draft lengths. - Any future MTP parity work needs a deeper target-verify graph/state design, not a small server scheduling shortcut. + +## Phase 20 Current-Stack Serving Snapshot + +Phase 20 refreshed the MoE paged-vs-vLLM serving baseline on the current clean +DGX mirror after the MTP investigation. + +Artifact: + +- `/home/mudler/bench/phase20_current_snapshot/20260701_050621` + +Current source: + +- `/home/mudler/llama-phase6-source` +- `f2521ab12 feat(server): trace speculative batch shapes` + +Pre/post gate result: + +- Pre-gate and post-gate both passed. +- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Full `MUL_MAT_ID`: `806/806` on CUDA0. + +Serving snapshot: + +| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | +|---|------------------|-----------------|-------------------|-----------|----------|----------------| +| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% | +| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% | +| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% | + +Latency/prefill snapshot: + +| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps | +|---|---------------|--------------|------------------|--------------------|------------------| +| 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 | +| 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 | +| 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 | + +Decision: + +- The latest clean stack is still not at vLLM serving parity on GB10. +- The user-visible gap is dominated by prefill/TTFT and e2e serving throughput, + not by a now-open MTP or scheduler shortcut. +- Keep MTP scheduler work closed. The next credible parity path is either a + datacenter-Blackwell rerun or a larger fused-kernel project outside the + low-conflict GB10 patch stack. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index d3c99c210..7167e39c2 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -283,6 +283,27 @@ the real `K + 1` verification-row expansion. Do not build a Phase 20 scheduler experiment on this evidence. Future MTP work would need a deeper target-verify graph/state design, not another small server scheduling shortcut. +Phase 20 refreshed the current-stack MoE serving snapshot against vLLM using the +clean `~/llama-phase6-source` mirror (`f2521ab12`) rather than the stale +`llama-paged-dev` benchmark tree. Artifact: +`/home/mudler/bench/phase20_current_snapshot/20260701_050621`. Pre/post gates +passed with canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +Current MoE serving snapshot (`PTOK=128`, `GEN=64`): + +| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | +|---|------------------|-----------------|-------------------|-----------|----------|----------------| +| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% | +| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% | +| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% | + +TTFT remains the clearest user-visible gap: paged is 2.88x/3.36x/3.11x slower +than vLLM at n8/n32/n128, and paged prefill_tps is roughly one-third of vLLM. +This keeps the GB10 shortcut closure intact: do not reopen MTP or small +scheduler work. The credible next parity path is a datacenter-Blackwell rerun or +a larger fused-kernel project outside this low-conflict patch stack. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index b74f943d8..391356a11 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -616,6 +616,34 @@ Decision: do not build a Phase 20 group/defer scheduler on current evidence. Future MTP work would need a deeper target-verify graph/state design, not another small server scheduling shortcut. +### Phase 20 current-stack serving snapshot + +Phase 20 refreshed the MoE serving baseline using the current clean DGX mirror +(`~/llama-phase6-source`, `f2521ab12`) and the same-session vLLM server. Artifact: +`/home/mudler/bench/phase20_current_snapshot/20260701_050621`. + +Pre/post canonical gates passed: MoE `8cb0ce23777bf55f92f63d0292c756b0`, +dense `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | +|---|------------------|-----------------|-------------------|-----------|----------|----------------| +| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% | +| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% | +| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% | + +TTFT/prefill remains the largest user-visible gap: + +| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps | +|---|---------------|--------------|------------------|--------------------|------------------| +| 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 | +| 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 | +| 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 | + +Decision: the latest stack is still below vLLM serving parity on GB10. The next +credible parity path is not another MTP/scheduler shortcut; it is either the +documented datacenter-Blackwell rerun or a larger fused-kernel project outside +the low-conflict GB10 patch stack. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-current-stack-serving-snapshot-phase20.md b/docs/superpowers/plans/2026-07-01-current-stack-serving-snapshot-phase20.md new file mode 100644 index 000000000..20598110b --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-current-stack-serving-snapshot-phase20.md @@ -0,0 +1,123 @@ +# Current Stack Serving Snapshot Phase 20 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:verification-before-completion before recording the phase result. +> Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** refresh the MoE paged-vs-vLLM serving baseline on the current clean +llama.cpp stack after the MTP investigation. + +**Architecture:** benchmark only. Run the current DGX mirror +`~/llama-phase6-source` against vLLM in the same lock window with the same h2h +client, then run canonical pre/post inference gates. Do not change source. + +**Tech Stack:** llama.cpp `llama-server`, vLLM `0.23.0`, DGX GB10, +`h2h_cli3.py`, LocalAI paged patch stack. + +--- + +## Task 1: Run Current-Stack Snapshot + +- [x] **Step 1: Confirm DGX is free** + + Preflight passed: + + - `docker=0` + - `local_ai_worker=0` + - `compute=0` + +- [x] **Step 2: Build current mirror targets** + + Source: + + - `/home/mudler/llama-phase6-source` + - HEAD: `f2521ab12 feat(server): trace speculative batch shapes` + + Build: + + ```bash + cmake --build ~/llama-phase6-source/build-cuda \ + --target llama-server llama-completion test-backend-ops -j8 + ``` + +- [x] **Step 3: Run paged and vLLM serving arms** + + Artifact: + + - `/home/mudler/bench/phase20_current_snapshot/20260701_050621` + + Workload: + + - MoE Qwen3.6-35B-A3B-NVFP4 + - `NPL=8,32,128` + - `PTOK=128` + - `GEN=64` + - h2h OpenAI completions client with fresh nonces + +## Task 2: Verify Inference Gates + +- [x] **Step 1: Pre-gate passed** + + Artifact: + + - `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_pre` + + Result: + + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +- [x] **Step 2: Post-gate passed** + + Artifact: + + - `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_post` + + Result: + + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +## Task 3: Snapshot Result + +- [x] **Step 1: Compare serving throughput** + + | n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | + |---|------------------|-----------------|-------------------|-----------|----------|----------------| + | 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% | + | 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% | + | 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% | + +- [x] **Step 2: Compare latency and prefill** + + | n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps | + |---|---------------|--------------|------------------|--------------------|------------------| + | 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 | + | 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 | + | 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 | + + The current stack remains far from vLLM serving parity in e2e/TTFT because + prefill is still much slower. + +## Task 4: Decision + +- [x] **Step 1: Keep GB10 shortcut closure** + + This snapshot confirms the Phase 19 direction: + + - MTP and scheduling shortcuts should stay closed. + - Current paged serving is still below vLLM on MoE serving throughput. + - The largest user-visible gap is prefill/TTFT, where vLLM is roughly 2.6-3.2x + faster on this short serving snapshot. + - The next credible parity path is not another small GB10 server shortcut; it + is either a new-silicon rerun on datacenter Blackwell or a larger fused + kernel project outside the low-conflict patch stack. + +## Self-Review + +- No source behavior changed. +- Pre/post inference gates passed. +- The result uses the current clean mirror, not the stale `llama-paged-dev` + benchmark tree.