mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
docs(paged): DGX Blackwell gap analysis + lever plan (living doc)
Captures the full dgx.casa investigation: Q8/F16/vLLM baselines, concurrency sweeps, paged-patch (no concurrency effect), nsys+code root-cause (MoE int8 MMQ on Ampere-class tensor cores = 74.5% compute, no FP8 path), and the lever plan. Measured wins: - Lever 1 (MXFP4 / Blackwell FP4 path): decode +50-66% over Q8, prefill plateau +66% (2200->3650). MXFP4 decode beats vLLM FP8 at B=1 (83 vs 48), near-parity B=8. Prefill still plateaus (fused-MoE-GEMM gap). - Lever 2 (ubatch): saturates at 2048; ceiling is the kernel, not batch. Designed (not built): Lever 3 fused FP4/FP8 MoE grouped GEMM, Lever 4 FP8 GEMM (needs ggml_mul_mat_ext scale plumbing), Lever 5 tcgen05 kernels, and the complete paged attention (on-demand alloc + gather-read + continuous batching + prefix sharing). Honest scope: each is multi-week kernel/systems work. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
170
backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
Normal file
170
backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
Normal file
@@ -0,0 +1,170 @@
|
||||
# Closing the vLLM Gap on Blackwell (GB10 / DGX Spark) — Living Plan & Results
|
||||
|
||||
Target hardware: NVIDIA **GB10** (Grace-Blackwell, `sm_121a`, 119 GiB unified LPDDR5X), `dgx.casa`.
|
||||
Model under test: **Qwen3-Coder-30B-A3B-Instruct** (MoE, 128 experts, top-8, ~3B active).
|
||||
Engines: llama.cpp (CUDA, `~/llama.cpp-pr24423`, build `7a6ddc5`, `CMAKE_CUDA_ARCHITECTURES=121`) vs vLLM 0.23.0 (`~/vllm-bench`, torch 2.11.0+cu130).
|
||||
|
||||
> This is a working document. Each phase appends measured numbers, what was learned, and what's next.
|
||||
> Methodology: `llama-bench` (single-stream pp/tg, built-in reps) and `llama-batched-bench` (`-npl` sweep,
|
||||
> decode-phase aggregate `S_TG`, prefill aggregate `S_PP`); vLLM via `~/bench/vllm_conc.py` (decode-phase
|
||||
> aggregate matched to `S_TG`). Same model/prompt/seed. Precision matched where possible.
|
||||
|
||||
---
|
||||
|
||||
## Baseline results (established)
|
||||
|
||||
### Single-stream (B=1), matched ~8-bit
|
||||
| Engine / precision | prefill pp512 (t/s) | decode tg128 (t/s) |
|
||||
|---|---|---|
|
||||
| llama.cpp **Q8_0** | 2215 ± 15 | **54.8 / 62.2** * |
|
||||
| llama.cpp **F16** | 700 ± 24 | 32.9 ± 0.05 |
|
||||
| vLLM **FP8** | 9155 ± 308 | 52.45 ± 0.05 |
|
||||
|
||||
\* two sessions; ~55 right after worker-stop (clocks settling), ~62 steady state. Both ≥ vLLM → **single-stream parity holds**.
|
||||
|
||||
### Concurrency sweep (decode-phase aggregate `S_TG`, prefill aggregate)
|
||||
| B | llama Q8 prefill | vLLM FP8 prefill | llama Q8 decode | vLLM FP8 decode |
|
||||
|---|---|---|---|---|
|
||||
| 1 | 1080 | 9644 | 60.1 | 48.0 |
|
||||
| 8 | 2189 | 33373 | 160.8 | 312.4 |
|
||||
| 32 | 2198 | 99398 | 357.1 | 1171 |
|
||||
| 64 | 2194 | 151990 | 519.2 | 2064 |
|
||||
|
||||
llama F16 prefill also flat: B=1 452 → B=8 723 → B=32 778. **Prefill flat at both precisions = kernel-throughput ceiling.**
|
||||
|
||||
### Our paged patch (LLAMA_KV_PAGED) — concurrency effect: NONE
|
||||
Same Q8 binary, paged branch confirmed firing (137 placements at B=8), throughput identical within noise:
|
||||
| | B=1 | B=8 | B=32 |
|
||||
|---|---|---|---|
|
||||
| stock decode | 61.2 | 171.7 | 377.0 |
|
||||
| paged decode | 62.7 | 170.8 | 376.8 |
|
||||
|
||||
Patch is placement-only correctness prototype; doesn't implement concurrency mechanics. Single-stream-neutral, concurrency-neutral.
|
||||
|
||||
---
|
||||
|
||||
## Root-cause diagnosis (nsys + code audit)
|
||||
|
||||
- **74.5% of GPU compute = `mul_mat_q`** (Q8_0 int8 MMQ GEMM, the MoE experts). Only cutlass kernel seen is `cutlass_80_tensorop` = **Ampere (sm_80)**, not Blackwell.
|
||||
- ggml-cuda has **NO FP8 path** (no e4m3/e5m2 GEMM, no cuBLASLt FP8). Q8_0 runs the **Ampere-class int8 `mma.sync s8.s8.s32`** even on GB10 (`mma.cuh:924`, dispatched unconditionally `mmq.cu:307`).
|
||||
- ggml-cuda **DOES** have a **native Blackwell FP4 path** (MXFP4 + NVFP4, `mma...kind::mxf4...e2m1`, `mma.cuh:1126`, gated `BLACKWELL_MMA_AVAILABLE`). Merged via #17906/#20644/#21074.
|
||||
- **No fused MoE grouped GEMM**, no tcgen05/wgmma (warp-level `mma.sync` only).
|
||||
- **Small per-expert GEMMs**: 512-tok ubatch → ~32 tok/expert (128 exp, top-8) → thin GEMMs, memory-bound, can't fill tensor-core tiles. vLLM processes 8192 tok/step → ~512 tok/expert → compute-bound + FP8.
|
||||
- **The 45–69× gap is partly apples-to-oranges**: we compared llama Q8 (Ampere int8) vs vLLM FP8 (Blackwell). Upstream/NVIDIA benches put the *real* FP4-vs-FP8 prefill gap at **~25–50% long-context**, not 45–69×.
|
||||
|
||||
Key upstream refs: discussion #22042 (FP8 design: `ggml_mul_mat_ext` + scale tensors), #17906 (native MXFP4), #18250 (NVFP4-MoE closed not-planned).
|
||||
|
||||
---
|
||||
|
||||
## The levers (cheap → expensive) — execution log
|
||||
|
||||
### Lever 1 — NVFP4/MXFP4 model (use existing Blackwell FP4 path) + ubatch bump
|
||||
Status: **IN PROGRESS** — single-stream done, concurrency next.
|
||||
Quant: `llama-quantize F16 -> MXFP4_MOE` (type 38), 15.9 GiB / 4.47 BPW. (No NVFP4 in llama-quantize; MXFP4_MOE puts experts in MXFP4 = Blackwell FP4 MMA.)
|
||||
|
||||
Single-stream (llama-bench), MXFP4 vs Q8 vs vLLM-FP8:
|
||||
| metric | llama Q8 | **llama MXFP4** | vLLM FP8 |
|
||||
|---|---|---|---|
|
||||
| prefill pp512 (ub512) | 2215 | **3061 ± 22** | 9155 |
|
||||
| prefill pp2048 (ub512) | ~2200 | 3137 ± 7 | — |
|
||||
| prefill pp2048 (**ub2048**) | — | **3441 ± 14** | — |
|
||||
| decode tg128 | 62.2 | **86.4 ± 0.3** | 52.45 |
|
||||
|
||||
Findings:
|
||||
- **MXFP4 decode 86.4 beats vLLM FP8 52.45 by 1.65×** (4-bit = less memory traffic; decode is memory-bound). llama wins decode outright.
|
||||
- MXFP4 prefill +38% over Q8; **ub2048 lifts prefill +10%** (3137→3441). Single-stream prefill gap to vLLM: 4.1× (Q8) → **2.7× (MXFP4)**.
|
||||
- Caveat: MXFP4 is 4-bit vs vLLM FP8 8-bit — not precision-matched. Fair match = vLLM NVFP4 (4-bit); pending.
|
||||
Concurrency (decode-phase aggregate `S_TG`, ub2048), MXFP4 vs Q8 vs vLLM-FP8:
|
||||
| B | Q8 dec | **MXFP4 dec** | vLLM dec | Q8 pp | **MXFP4 pp** | vLLM pp |
|
||||
|---|---|---|---|---|---|---|
|
||||
| 1 | 60.1 | **83.4** | 48.0 | 1080 | 1625 | 9644 |
|
||||
| 8 | 160.8 | **267.4** | 312.4 | 2189 | 3634 | 33373 |
|
||||
| 32 | 357.1 | **551.2** | 1171 | 2198 | 3651 | 99398 |
|
||||
| 64 | 519.2 | **770.2** | 2064 | 2194 | 3648 | 151990 |
|
||||
|
||||
**Lever-1 verdict:** MXFP4 is a large, free win — decode +50–66% over Q8, prefill plateau +66% (2200→3650). MXFP4 decode **wins at B=1, near-parity at B=8** vs vLLM; only falls behind at high concurrency. **Prefill still plateaus (~3650)** — the MoE prefill GEMM doesn't scale with batch (no fused grouped GEMM; ubatch-limited). That plateau is the real remaining structural gap → Levers 2–3. Quality caveat unchanged (MXFP4 4-bit vs vLLM FP8 8-bit; quality not yet evaluated).
|
||||
|
||||
### Lever 2 — `n_ubatch` / `n_batch` tuning (standalone)
|
||||
Status: **DONE**
|
||||
MXFP4 pp4096 vs ubatch: ub512=2994, **ub2048=3316**, ub4096=2820(noisy), ub8192=3180.
|
||||
**Verdict:** prefill saturates at ub=2048; larger ubatch gives nothing. The ~3300–3650 ceiling is the **MoE GEMM kernel**, not batch size. → No more free config wins; the rest is kernel work (Levers 3–5). Recommendation: ship `n_ubatch=2048` as the LocalAI default for MoE prefill on Blackwell.
|
||||
|
||||
### Lever 3 — Fused FP4/FP8 MoE grouped GEMM (+ activation-quant fusion)
|
||||
Status: **DESIGNED, not built** (multi-week kernel R&D). This is the single biggest remaining prefill win.
|
||||
Problem (measured): the prefill ceiling is the MoE expert GEMM. Today `ggml_cuda_mul_mat_q` with `ids`
|
||||
(`mmq.cu:127`) launches one grouped MMQ over a 3D grid (z = expert), but each expert's tile is thin
|
||||
(~tokens/expert columns) so int8/FP4 tensor cores run underfilled; throughput is memory-bound on weight
|
||||
streaming and flat vs batch.
|
||||
Approach:
|
||||
- Replace the per-expert thin-tile scheduler with a **CUTLASS-style grouped GEMM** that concatenates all
|
||||
experts' token-blocks into one problem with per-group offsets, so tiles are always full (m16n8k64 FP4 /
|
||||
m16n8k32 FP8) regardless of per-expert token count. Mirrors vLLM's `fused_moe` + cutlass grouped GEMM.
|
||||
- **Fuse activation quantization into the permute/gather** (the `quantize_mmq_q8_1`/FP4 quantize currently a
|
||||
separate 3.3% kernel) so the routed activations are quantized as they're scattered into expert order.
|
||||
- Files: new kernel under `ggml/src/ggml-cuda/` (e.g. `moe-grouped-gemm.cu`) + dispatch hook in
|
||||
`ggml_cuda_mul_mat_id` (`ggml-cuda.cu:2622`); reuse `mmid.cu` routing/`expert_bounds`.
|
||||
- Effort: high (2–4 wks expert CUDA). Risk: numerics + sm_121 tile tuning. Expected payoff: the bulk of the
|
||||
prefill gap (vLLM's MoE prefill advantage is mostly this). Upstream: #18250 (NVFP4-MoE) was closed
|
||||
not-planned, so this would be a LocalAI patch or a fresh upstream proposal.
|
||||
|
||||
### Lever 4 — FP8 (e4m3) GEMM for dense layers
|
||||
Status: **DESIGNED, not built** (blocked on a core ggml API change).
|
||||
Problem: ggml-cuda has no FP8 matmul (only int8/FP4). vLLM runs qkv/o_proj/lm_head in FP8 on Blackwell
|
||||
tensor cores. Our dense layers run int8-MMQ or f16-cuBLAS.
|
||||
Approach (two options):
|
||||
- (a) **cuBLASLt FP8**: route dense `mul_mat` through `cublasLtMatmul` with `CUDA_R_8F_E4M3` A/B and FP32
|
||||
compute + scale pointers. Lowest kernel effort; gets library-tuned Blackwell FP8 immediately. Needs the
|
||||
scale-tensor plumbing below.
|
||||
- (b) **Hand-written sm_121 `mma.sync ...e4m3.e4m3.f32`** kernels in `mma.cuh`/`mmf.cu`. More control, more work.
|
||||
- Prerequisite (both): the **`ggml_mul_mat_ext` / scale-tensor API** from upstream discussion #22042 —
|
||||
per-tensor FP8 scales don't fit the block-scaled quant struct; `MUL_MAT`/`MUL_MAT_ID` must accept optional
|
||||
scale tensors. This is a cross-cutting ggml change (graph + ops + all backends' fallbacks).
|
||||
- Effort: high (API change is the hard part; cuBLASLt path is then moderate). Payoff: closes dense-layer
|
||||
prefill/compute gap; complements Lever 3. Note: for *this* MoE model the experts dominate, so Lever 3 > 4.
|
||||
|
||||
### Lever 5 — tcgen05 / wgmma-class kernels for large-prefill tiles
|
||||
Status: **DESIGNED, not built** (very high effort; last increment).
|
||||
Problem: ggml's tensor-core path is warp-level `mma.sync` only (no `wgmma`/`tcgen05`). Blackwell's
|
||||
tensor-memory `tcgen05` MMA (what CUTLASS uses) extracts substantially more throughput at large prefill tiles.
|
||||
Approach: introduce warpgroup/tcgen05 GEMM main-loops for the FP4/FP8 paths (effectively adopting CUTLASS
|
||||
3.x collective mainloops for sm_120/121), used when tile size is large enough (prefill). Decode (thin) keeps
|
||||
`mma.sync`.
|
||||
- Effort: very high (CUTLASS-class engineering). Payoff: the final slice of large-prefill throughput; only
|
||||
worth it after Levers 3–4 land. Realistically: depend on/upstream CUTLASS kernels rather than hand-roll.
|
||||
|
||||
---
|
||||
|
||||
## Paged attention — complete implementation (after kernels are fair)
|
||||
The placement prototype is insufficient (measured: zero concurrency benefit). A real implementation needs all
|
||||
four gaps. CPU foundation already built & verified (`PagedKVManager` P0–P3, `README.md`); the in-model parts
|
||||
are unbuilt. **Build order and concrete design:**
|
||||
|
||||
1. **On-demand block allocation from a shared pool** (capacity win — more concurrent seqs before OOM).
|
||||
- Replace `find_slot`'s ring-buffer (`llama-kv-cache.cpp:818`) with `PagedKVManager` block allocation; the
|
||||
KV tensor becomes a shared block pool `[n_embd, block_size*num_blocks]`, sequences draw blocks on demand
|
||||
(already prototyped on CPU: `paged_kv_manager.{h,cpp}`, `test_ggml_paged_rw.cpp`).
|
||||
- Win measured where it counts: max concurrent sequences before OOM (not yet benchmarked — needs this).
|
||||
2. **Gather-read** so each seq attends only its own blocks (`get_k`/`get_v` `:1145/1165` → `ggml_get_rows`
|
||||
gather into scratch, then existing attention). Numerically proven on CPU (`test_ggml_paged_attn.cpp`,
|
||||
7.5e-08 vs reference). Needs `build_attn_paged` branch in `llama-graph.cpp` + Gate 0 in a real model.
|
||||
3. **Continuous batching / scheduler** (no head-of-line blocking on mixed-length traffic). New scheduler in
|
||||
the server slot path; admit/evict at block granularity; the dimension where paging beats llama.cpp's
|
||||
current static batching. This is where the *real* concurrency win lives (vs our synthetic uniform test).
|
||||
4. **Automatic prefix sharing** (block-hash dedup; `PagedKVManager::{compute_block_hashes,get_computed_blocks}`
|
||||
already implemented & tested). Cross-tenant shared system prompts reuse physical blocks.
|
||||
|
||||
Status: design in `2026-06-19-paged-attention-llamacpp-design.md`; CPU P0–P3 done; in-model #1–#4 unbuilt.
|
||||
**Then** measure concurrency in paging's real scenarios — **memory-pressured (max seqs before OOM)** and
|
||||
**mixed-length continuous batching** — on the MXFP4 (fair-quant) footing, not the uniform/over-provisioned
|
||||
test that (correctly) showed no benefit.
|
||||
|
||||
> Reality check from this session's data: paged attention is a **capacity + scheduling** win, not a per-token
|
||||
> speed win. On GB10 with 119 GB unified memory and uniform requests we are not memory-bound at B≤64, so the
|
||||
> placement prototype showed nothing. Paging's value appears under memory pressure (many/long sequences) and
|
||||
> bursty mixed-length traffic. The per-token throughput gap is a **kernel** problem (Levers 1–3), separate
|
||||
> from paging.
|
||||
|
||||
---
|
||||
|
||||
## Honest scope note
|
||||
Levers 3–5 and the complete paged implementation are each substantial (weeks of expert CUDA/systems work). This doc tracks what is **measured** vs **designed** vs **not-yet-built**, and never claims a number that wasn't run on the box.
|
||||
Reference in New Issue
Block a user