mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): refresh current serving snapshot
Record the Phase 20 same-session MoE paged-vs-vLLM serving snapshot on the current clean DGX mirror, including pre/post inference gates and the resulting GB10 parity decision. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -564,6 +564,14 @@ backend-split + gallery plan is in
|
||||
|
||||
## 9. vLLM parity - final state (CLOSED)
|
||||
|
||||
> 2026-07-01 follow-up: the investigation was reopened for MTP safety,
|
||||
> MTP-serving, graph-shape tracing, and a current-stack serving snapshot. Phases
|
||||
> 14-20 are recorded in
|
||||
> [`docs/GB10_PARITY_PHASE0_RESULTS.md`](docs/GB10_PARITY_PHASE0_RESULTS.md) and
|
||||
> [`docs/PARITY_HANDOFF.md`](docs/PARITY_HANDOFF.md). They did not change the
|
||||
> GB10 conclusion: MTP/scheduler shortcuts are rejected, and the latest clean
|
||||
> stack remains below vLLM serving parity.
|
||||
|
||||
The multi-week GB10 (DGX Spark, sm_121) vLLM-parity investigation is **closed**.
|
||||
The standing, never-re-litigate record - full benchmark, every lever and verdict,
|
||||
the structural floors, the parity verdict - is
|
||||
@@ -596,3 +604,13 @@ the structural floors, the parity verdict - is
|
||||
the vLLM advantages that lose on GB10** (FLA blocked-solve GDN, Marlin/CUTLASS
|
||||
grouped FP4, HBM-tuned full-cudagraph decode). Re-run the methodology on new
|
||||
silicon; do not reopen the GB10 levers.
|
||||
|
||||
Latest current-stack MoE serving snapshot (`PTOK=128`, `GEN=64`, current clean
|
||||
DGX mirror `f2521ab12`, artifact
|
||||
`/home/mudler/bench/phase20_current_snapshot/20260701_050621`):
|
||||
|
||||
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|
||||
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
|
||||
| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
|
||||
| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
|
||||
| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
|
||||
|
||||
@@ -1359,3 +1359,49 @@ Decision:
|
||||
`K + 1` verification-row expansion, not mixed draft lengths.
|
||||
- Any future MTP parity work needs a deeper target-verify graph/state design,
|
||||
not a small server scheduling shortcut.
|
||||
|
||||
## Phase 20 Current-Stack Serving Snapshot
|
||||
|
||||
Phase 20 refreshed the MoE paged-vs-vLLM serving baseline on the current clean
|
||||
DGX mirror after the MTP investigation.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase20_current_snapshot/20260701_050621`
|
||||
|
||||
Current source:
|
||||
|
||||
- `/home/mudler/llama-phase6-source`
|
||||
- `f2521ab12 feat(server): trace speculative batch shapes`
|
||||
|
||||
Pre/post gate result:
|
||||
|
||||
- Pre-gate and post-gate both passed.
|
||||
- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Full `MUL_MAT_ID`: `806/806` on CUDA0.
|
||||
|
||||
Serving snapshot:
|
||||
|
||||
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|
||||
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
|
||||
| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
|
||||
| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
|
||||
| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
|
||||
|
||||
Latency/prefill snapshot:
|
||||
|
||||
| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps |
|
||||
|---|---------------|--------------|------------------|--------------------|------------------|
|
||||
| 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 |
|
||||
| 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 |
|
||||
| 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 |
|
||||
|
||||
Decision:
|
||||
|
||||
- The latest clean stack is still not at vLLM serving parity on GB10.
|
||||
- The user-visible gap is dominated by prefill/TTFT and e2e serving throughput,
|
||||
not by a now-open MTP or scheduler shortcut.
|
||||
- Keep MTP scheduler work closed. The next credible parity path is either a
|
||||
datacenter-Blackwell rerun or a larger fused-kernel project outside the
|
||||
low-conflict GB10 patch stack.
|
||||
|
||||
@@ -283,6 +283,27 @@ the real `K + 1` verification-row expansion. Do not build a Phase 20 scheduler
|
||||
experiment on this evidence. Future MTP work would need a deeper target-verify
|
||||
graph/state design, not another small server scheduling shortcut.
|
||||
|
||||
Phase 20 refreshed the current-stack MoE serving snapshot against vLLM using the
|
||||
clean `~/llama-phase6-source` mirror (`f2521ab12`) rather than the stale
|
||||
`llama-paged-dev` benchmark tree. Artifact:
|
||||
`/home/mudler/bench/phase20_current_snapshot/20260701_050621`. Pre/post gates
|
||||
passed with canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
|
||||
|
||||
Current MoE serving snapshot (`PTOK=128`, `GEN=64`):
|
||||
|
||||
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|
||||
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
|
||||
| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
|
||||
| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
|
||||
| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
|
||||
|
||||
TTFT remains the clearest user-visible gap: paged is 2.88x/3.36x/3.11x slower
|
||||
than vLLM at n8/n32/n128, and paged prefill_tps is roughly one-third of vLLM.
|
||||
This keeps the GB10 shortcut closure intact: do not reopen MTP or small
|
||||
scheduler work. The credible next parity path is a datacenter-Blackwell rerun or
|
||||
a larger fused-kernel project outside this low-conflict patch stack.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
|
||||
@@ -616,6 +616,34 @@ Decision: do not build a Phase 20 group/defer scheduler on current evidence.
|
||||
Future MTP work would need a deeper target-verify graph/state design, not
|
||||
another small server scheduling shortcut.
|
||||
|
||||
### Phase 20 current-stack serving snapshot
|
||||
|
||||
Phase 20 refreshed the MoE serving baseline using the current clean DGX mirror
|
||||
(`~/llama-phase6-source`, `f2521ab12`) and the same-session vLLM server. Artifact:
|
||||
`/home/mudler/bench/phase20_current_snapshot/20260701_050621`.
|
||||
|
||||
Pre/post canonical gates passed: MoE `8cb0ce23777bf55f92f63d0292c756b0`,
|
||||
dense `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
|
||||
|
||||
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|
||||
|---|------------------|-----------------|-------------------|-----------|----------|----------------|
|
||||
| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
|
||||
| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
|
||||
| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
|
||||
|
||||
TTFT/prefill remains the largest user-visible gap:
|
||||
|
||||
| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps |
|
||||
|---|---------------|--------------|------------------|--------------------|------------------|
|
||||
| 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 |
|
||||
| 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 |
|
||||
| 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 |
|
||||
|
||||
Decision: the latest stack is still below vLLM serving parity on GB10. The next
|
||||
credible parity path is not another MTP/scheduler shortcut; it is either the
|
||||
documented datacenter-Blackwell rerun or a larger fused-kernel project outside
|
||||
the low-conflict GB10 patch stack.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
Reference in New Issue
Block a user