mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
docs(paged): refresh current serving snapshot
Record the Phase 20 same-session MoE paged-vs-vLLM serving snapshot on the current clean DGX mirror, including pre/post inference gates and the resulting GB10 parity decision. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -564,6 +564,14 @@ backend-split + gallery plan is in
|
||||
|
||||
## 9. vLLM parity - final state (CLOSED)
|
||||
|
||||
> 2026-07-01 follow-up: the investigation was reopened for MTP safety,
|
||||
> MTP-serving, graph-shape tracing, and a current-stack serving snapshot. Phases
|
||||
> 14-20 are recorded in
|
||||
> [`docs/GB10_PARITY_PHASE0_RESULTS.md`](docs/GB10_PARITY_PHASE0_RESULTS.md) and
|
||||
> [`docs/PARITY_HANDOFF.md`](docs/PARITY_HANDOFF.md). They did not change the
|
||||
> GB10 conclusion: MTP/scheduler shortcuts are rejected, and the latest clean
|
||||
> stack remains below vLLM serving parity.
|
||||
|
||||
The multi-week GB10 (DGX Spark, sm_121) vLLM-parity investigation is **closed**.
|
||||
The standing, never-re-litigate record - full benchmark, every lever and verdict,
|
||||
the structural floors, the parity verdict - is
|
||||
@@ -596,3 +604,13 @@ the structural floors, the parity verdict - is
|
||||
the vLLM advantages that lose on GB10** (FLA blocked-solve GDN, Marlin/CUTLASS
|
||||
grouped FP4, HBM-tuned full-cudagraph decode). Re-run the methodology on new
|
||||
silicon; do not reopen the GB10 levers.
|
||||
|
||||
Latest current-stack MoE serving snapshot (`PTOK=128`, `GEN=64`, current clean
|
||||
DGX mirror `f2521ab12`, artifact
|
||||
`/home/mudler/bench/phase20_current_snapshot/20260701_050621`):
|
||||
|
||||
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|
||||
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
|
||||
| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
|
||||
| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
|
||||
| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
|
||||
|
||||
@@ -1359,3 +1359,49 @@ Decision:
|
||||
`K + 1` verification-row expansion, not mixed draft lengths.
|
||||
- Any future MTP parity work needs a deeper target-verify graph/state design,
|
||||
not a small server scheduling shortcut.
|
||||
|
||||
## Phase 20 Current-Stack Serving Snapshot
|
||||
|
||||
Phase 20 refreshed the MoE paged-vs-vLLM serving baseline on the current clean
|
||||
DGX mirror after the MTP investigation.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase20_current_snapshot/20260701_050621`
|
||||
|
||||
Current source:
|
||||
|
||||
- `/home/mudler/llama-phase6-source`
|
||||
- `f2521ab12 feat(server): trace speculative batch shapes`
|
||||
|
||||
Pre/post gate result:
|
||||
|
||||
- Pre-gate and post-gate both passed.
|
||||
- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Full `MUL_MAT_ID`: `806/806` on CUDA0.
|
||||
|
||||
Serving snapshot:
|
||||
|
||||
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|
||||
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
|
||||
| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
|
||||
| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
|
||||
| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
|
||||
|
||||
Latency/prefill snapshot:
|
||||
|
||||
| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps |
|
||||
|---|---------------|--------------|------------------|--------------------|------------------|
|
||||
| 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 |
|
||||
| 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 |
|
||||
| 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 |
|
||||
|
||||
Decision:
|
||||
|
||||
- The latest clean stack is still not at vLLM serving parity on GB10.
|
||||
- The user-visible gap is dominated by prefill/TTFT and e2e serving throughput,
|
||||
not by a now-open MTP or scheduler shortcut.
|
||||
- Keep MTP scheduler work closed. The next credible parity path is either a
|
||||
datacenter-Blackwell rerun or a larger fused-kernel project outside the
|
||||
low-conflict GB10 patch stack.
|
||||
|
||||
@@ -283,6 +283,27 @@ the real `K + 1` verification-row expansion. Do not build a Phase 20 scheduler
|
||||
experiment on this evidence. Future MTP work would need a deeper target-verify
|
||||
graph/state design, not another small server scheduling shortcut.
|
||||
|
||||
Phase 20 refreshed the current-stack MoE serving snapshot against vLLM using the
|
||||
clean `~/llama-phase6-source` mirror (`f2521ab12`) rather than the stale
|
||||
`llama-paged-dev` benchmark tree. Artifact:
|
||||
`/home/mudler/bench/phase20_current_snapshot/20260701_050621`. Pre/post gates
|
||||
passed with canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
|
||||
|
||||
Current MoE serving snapshot (`PTOK=128`, `GEN=64`):
|
||||
|
||||
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|
||||
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
|
||||
| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
|
||||
| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
|
||||
| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
|
||||
|
||||
TTFT remains the clearest user-visible gap: paged is 2.88x/3.36x/3.11x slower
|
||||
than vLLM at n8/n32/n128, and paged prefill_tps is roughly one-third of vLLM.
|
||||
This keeps the GB10 shortcut closure intact: do not reopen MTP or small
|
||||
scheduler work. The credible next parity path is a datacenter-Blackwell rerun or
|
||||
a larger fused-kernel project outside this low-conflict patch stack.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
|
||||
@@ -616,6 +616,34 @@ Decision: do not build a Phase 20 group/defer scheduler on current evidence.
|
||||
Future MTP work would need a deeper target-verify graph/state design, not
|
||||
another small server scheduling shortcut.
|
||||
|
||||
### Phase 20 current-stack serving snapshot
|
||||
|
||||
Phase 20 refreshed the MoE serving baseline using the current clean DGX mirror
|
||||
(`~/llama-phase6-source`, `f2521ab12`) and the same-session vLLM server. Artifact:
|
||||
`/home/mudler/bench/phase20_current_snapshot/20260701_050621`.
|
||||
|
||||
Pre/post canonical gates passed: MoE `8cb0ce23777bf55f92f63d0292c756b0`,
|
||||
dense `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
|
||||
|
||||
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|
||||
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
|
||||
| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
|
||||
| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
|
||||
| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
|
||||
|
||||
TTFT/prefill remains the largest user-visible gap:
|
||||
|
||||
| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps |
|
||||
|---|---------------|--------------|------------------|--------------------|------------------|
|
||||
| 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 |
|
||||
| 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 |
|
||||
| 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 |
|
||||
|
||||
Decision: the latest stack is still below vLLM serving parity on GB10. The next
|
||||
credible parity path is not another MTP/scheduler shortcut; it is either the
|
||||
documented datacenter-Blackwell rerun or a larger fused-kernel project outside
|
||||
the low-conflict GB10 patch stack.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -0,0 +1,123 @@
|
||||
# Current Stack Serving Snapshot Phase 20 Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use
|
||||
> superpowers:verification-before-completion before recording the phase result.
|
||||
> Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** refresh the MoE paged-vs-vLLM serving baseline on the current clean
|
||||
llama.cpp stack after the MTP investigation.
|
||||
|
||||
**Architecture:** benchmark only. Run the current DGX mirror
|
||||
`~/llama-phase6-source` against vLLM in the same lock window with the same h2h
|
||||
client, then run canonical pre/post inference gates. Do not change source.
|
||||
|
||||
**Tech Stack:** llama.cpp `llama-server`, vLLM `0.23.0`, DGX GB10,
|
||||
`h2h_cli3.py`, LocalAI paged patch stack.
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Run Current-Stack Snapshot
|
||||
|
||||
- [x] **Step 1: Confirm DGX is free**
|
||||
|
||||
Preflight passed:
|
||||
|
||||
- `docker=0`
|
||||
- `local_ai_worker=0`
|
||||
- `compute=0`
|
||||
|
||||
- [x] **Step 2: Build current mirror targets**
|
||||
|
||||
Source:
|
||||
|
||||
- `/home/mudler/llama-phase6-source`
|
||||
- HEAD: `f2521ab12 feat(server): trace speculative batch shapes`
|
||||
|
||||
Build:
|
||||
|
||||
```bash
|
||||
cmake --build ~/llama-phase6-source/build-cuda \
|
||||
--target llama-server llama-completion test-backend-ops -j8
|
||||
```
|
||||
|
||||
- [x] **Step 3: Run paged and vLLM serving arms**
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase20_current_snapshot/20260701_050621`
|
||||
|
||||
Workload:
|
||||
|
||||
- MoE Qwen3.6-35B-A3B-NVFP4
|
||||
- `NPL=8,32,128`
|
||||
- `PTOK=128`
|
||||
- `GEN=64`
|
||||
- h2h OpenAI completions client with fresh nonces
|
||||
|
||||
## Task 2: Verify Inference Gates
|
||||
|
||||
- [x] **Step 1: Pre-gate passed**
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_pre`
|
||||
|
||||
Result:
|
||||
|
||||
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- `MUL_MAT_ID`: `806/806`
|
||||
|
||||
- [x] **Step 2: Post-gate passed**
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_post`
|
||||
|
||||
Result:
|
||||
|
||||
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- `MUL_MAT_ID`: `806/806`
|
||||
|
||||
## Task 3: Snapshot Result
|
||||
|
||||
- [x] **Step 1: Compare serving throughput**
|
||||
|
||||
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|
||||
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
|
||||
| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
|
||||
| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
|
||||
| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
|
||||
|
||||
- [x] **Step 2: Compare latency and prefill**
|
||||
|
||||
| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps |
|
||||
|---|---------------|--------------|------------------|--------------------|------------------|
|
||||
| 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 |
|
||||
| 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 |
|
||||
| 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 |
|
||||
|
||||
The current stack remains far from vLLM serving parity in e2e/TTFT because
|
||||
prefill is still much slower.
|
||||
|
||||
## Task 4: Decision
|
||||
|
||||
- [x] **Step 1: Keep GB10 shortcut closure**
|
||||
|
||||
This snapshot confirms the Phase 19 direction:
|
||||
|
||||
- MTP and scheduling shortcuts should stay closed.
|
||||
- Current paged serving is still below vLLM on MoE serving throughput.
|
||||
- The largest user-visible gap is prefill/TTFT, where vLLM is roughly 2.6-3.2x
|
||||
faster on this short serving snapshot.
|
||||
- The next credible parity path is not another small GB10 server shortcut; it
|
||||
is either a new-silicon rerun on datacenter Blackwell or a larger fused
|
||||
kernel project outside the low-conflict patch stack.
|
||||
|
||||
## Self-Review
|
||||
|
||||
- No source behavior changed.
|
||||
- Pre/post inference gates passed.
|
||||
- The result uses the current clean mirror, not the stale `llama-paged-dev`
|
||||
benchmark tree.
|
||||
Reference in New Issue
Block a user