mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
docs(paged): Lever-3 + paged-attention implementation plans + upstream ggml issue draft
Plan A (Lever 3): phased path to FP4 MoE GEMM parity — cheap tweaks, act-quant fusion, then the real lever (tcgen05/CUTLASS grouped GEMM), full-model FP4. Plan B (paged attention): on-demand pool, gather-read + Gate 0, continuous batching, prefix sharing; benchmark in memory-pressured/mixed-length regimes. Upstream issue draft: GB10 numbers, nsys profile, ruled-out config knobs, tcgen05 proposal. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -194,5 +194,46 @@ test that (correctly) showed no benefit.
|
||||
|
||||
---
|
||||
|
||||
## Implementation plan A — Lever 3: FP4 MoE GEMM to vLLM parity
|
||||
|
||||
Goal: lift batched MoE prefill from ~3.65k t/s (B=32) toward vLLM's ~99k. Root cause (profiled):
|
||||
`mul_mat_q<MXFP4>` runs at ~22 effective TFLOP/s — warp-level `mma.sync`, not Blackwell tcgen05.
|
||||
Cheap knobs are exhausted (ubatch saturates at 2048; `GGML_CUDA_FORCE_CUBLAS` is a no-op 3419↔3423;
|
||||
tile width already full at mmq_x=128). So parity needs kernel work, done iteratively on the DGX
|
||||
(`~/llama.cpp-pr24423`, editable + rebuildable; diffs captured as `patches/`).
|
||||
|
||||
Phases (each: hypothesis → edit `ggml/src/ggml-cuda/` → `cmake --build build --target llama-bench` →
|
||||
`llama-bench` MXFP4 pp/concurrency → record):
|
||||
1. **Cheap kernel tweaks (low confidence, fast).** nwarps (occupancy), `mmq_y` tile, stream-k on/off,
|
||||
FP4 load-tile path. Measure each. Likely small (<1.3x) — these don't change the warp-MMA ceiling.
|
||||
2. **Fuse activation quant** (`quantize_mmq_mxfp4`, 8%) into the permute/gather. Removes a kernel +
|
||||
a global round-trip. Tractable, ~1.1x.
|
||||
3. **The real lever — tcgen05 / CUTLASS FP4 grouped GEMM.** Replace the per-expert MMQ scheduler with a
|
||||
CUTLASS 3.x collective-mainloop grouped GEMM (sm_120a, `e2m1` block-scaled, tcgen05 tensor-memory MMA),
|
||||
one problem over all experts with per-group offsets, fused act-quant. This is what vLLM/FlashInfer use.
|
||||
Multi-week; the honest path to parity. Prefer **upstream ggml** (issue drafted) over a private patch.
|
||||
4. **Full-model low precision.** Quantize dense layers (qkv/o_proj/lm_head, the 10% Q8) to FP4/FP8 too so
|
||||
the whole prefill runs on FP4 tensor cores, not int8-MMQ.
|
||||
Exit per phase: measured t/s recorded here; stop a phase when it's a dead end (recorded as such).
|
||||
Matching vLLM realistically requires phase 3; phases 1–2 are the warm-up + de-risking.
|
||||
|
||||
## Implementation plan B — Complete paged attention (the pivot)
|
||||
|
||||
CPU foundation done (P0–P3, `README.md`): vLLM-parity block manager + ggml write/gather + attention
|
||||
numerics + placement Gate 0 (token-identical in-model). Remaining = make it deliver the multi-tenant wins.
|
||||
Phases:
|
||||
1. **On-demand shared-block pool** — replace `find_slot` ring buffer (`llama-kv-cache.cpp:818`) with
|
||||
`PagedKVManager` block allocation; KV tensor = `[n_embd, block_size*num_blocks]` shared pool. Win:
|
||||
fit more concurrent seqs before OOM. Test: max concurrent seqs at fixed budget vs contiguous.
|
||||
2. **Gather-read** (`get_k/get_v` `:1145/1165` → `ggml_get_rows` into scratch) + `build_attn_paged` branch
|
||||
in `llama-graph.cpp`. Numerically proven on CPU (7.5e-08). Gate 0: token-identical multi-seq.
|
||||
3. **Continuous batching / scheduler** — admit/evict at block granularity in the server slot path. The
|
||||
real concurrency win on mixed-length traffic (where the placement prototype showed nothing).
|
||||
4. **Automatic prefix sharing** — block-hash dedup (`PagedKVManager::{compute_block_hashes,get_computed_blocks}`
|
||||
already implemented + tested). Cross-tenant shared system prompts reuse physical blocks.
|
||||
Then benchmark in paging's real regimes — **memory-pressured** + **mixed-length continuous batching** — on
|
||||
the MXFP4 (fair-quant) footing. Note: GB10's 119 GB unified memory means win-1 needs genuine pressure
|
||||
(long/many seqs) to show; the win is capacity + scheduling, not per-token speed.
|
||||
|
||||
## Honest scope note
|
||||
Levers 3–5 and the complete paged implementation are each substantial (weeks of expert CUDA/systems work). This doc tracks what is **measured** vs **designed** vs **not-yet-built**, and never claims a number that wasn't run on the box.
|
||||
|
||||
78
backend/cpp/llama-cpp/paged/UPSTREAM_GGML_ISSUE.md
Normal file
78
backend/cpp/llama-cpp/paged/UPSTREAM_GGML_ISSUE.md
Normal file
@@ -0,0 +1,78 @@
|
||||
# Upstream ggml issue draft: MXFP4 MoE prefill underutilizes Blackwell (GB10) — ~22 TFLOP/s, ~27× behind vLLM
|
||||
|
||||
**Title:** CUDA: MXFP4 MoE prefill runs the Ampere-class warp `mma.sync`, far below Blackwell FP4 peak (GB10 / sm_121)
|
||||
|
||||
## Summary
|
||||
|
||||
On a GB10 (DGX Spark, sm_121), MXFP4 MoE prefill for Qwen3-Coder-30B-A3B is bottlenecked by
|
||||
`mul_mat_q<MXFP4>` (the per-expert grouped MMQ), which runs at only **~22 effective TFLOP/s** — a small
|
||||
fraction of the GPU's FP4 capability. Batched prefill plateaus at ~3.65k tok/s (B=32) vs vLLM FP8 ~99k
|
||||
on the same box (~27×). The native FP4 block-scaled `mma.sync` path (PR #17906 et al.) *is* engaged — the
|
||||
limit is that it's a warp-level MMA kernel, not a tcgen05/CUTLASS-class grouped GEMM.
|
||||
|
||||
## Hardware / build
|
||||
|
||||
- NVIDIA GB10, compute capability 12.1, 119 GiB unified LPDDR5X.
|
||||
- llama.cpp built `-DCMAKE_CUDA_ARCHITECTURES=121` (sm_121a/compute_121a confirmed in cubins).
|
||||
- Model: Qwen3-Coder-30B-A3B-Instruct, `MXFP4_MOE` (15.9 GiB, 4.47 BPW).
|
||||
|
||||
## Measurements
|
||||
|
||||
Single-stream (`llama-bench`, ub2048):
|
||||
|
||||
| metric | Q8_0 | MXFP4 | vLLM FP8 |
|
||||
|---|---|---|---|
|
||||
| prefill pp2048 | ~2200 | 3441 | — |
|
||||
| decode tg128 | 62 | 86 | 52 |
|
||||
|
||||
Batched (decode-phase aggregate `S_TG`; prefill aggregate `S_PP`):
|
||||
|
||||
| B | llama MXFP4 prefill | vLLM FP8 prefill | llama MXFP4 decode | vLLM FP8 decode |
|
||||
|---|---|---|---|---|
|
||||
| 1 | 1625 | 9644 | 83 | 48 |
|
||||
| 8 | 3634 | 33373 | 267 | 312 |
|
||||
| 32 | 3651 | 99398 | 551 | 1171 |
|
||||
| 64 | 3648 | 151990 | 770 | 2064 |
|
||||
|
||||
Decode is competitive (we win at B=1). **Prefill plateaus and is the gap.**
|
||||
|
||||
## Profiling (nsys, MXFP4 pp2048 kernel time)
|
||||
|
||||
| kernel | % |
|
||||
|---|---|
|
||||
| `mul_mat_q<(ggml_type)39>` (MXFP4 MoE GEMM) | **37.2** |
|
||||
| `mul_mat_q<(ggml_type)8>` (dense/attn, still Q8) | 10.1 |
|
||||
| `flash_attn_ext_f16` | 8.8 |
|
||||
| `quantize_mmq_mxfp4` (activation quant) | 8.0 |
|
||||
|
||||
Only cutlass kernel present is `cutlass_80_tensorop` (Ampere). No tcgen05 / wgmma anywhere.
|
||||
|
||||
## What we ruled out (so it's the kernel, not config)
|
||||
|
||||
- **ubatch**: saturates at 2048 (pp4096: ub512 2994 → ub2048 3316 → ub8192 3180).
|
||||
- **tile width**: `mmq_x` already selects the full 128-wide tile at ub2048 (~128 tokens/expert).
|
||||
- **cuBLAS fallback**: `GGML_CUDA_FORCE_CUBLAS` is a no-op (3419 ↔ 3423 t/s) — dequant→cuBLAS-FP16 neither
|
||||
helps nor hurts, i.e. the FP4 MMQ kernel isn't worse than FP16 cuBLAS, both hit a common ceiling.
|
||||
- prefill does **not** scale with bigger single prompts (attention O(N²) confounds): pp2048 3295, pp8192
|
||||
1524, pp16384 2051 — so it's the many-sequence batched MoE GEMM, not batch size.
|
||||
|
||||
## Proposal
|
||||
|
||||
A tcgen05 / CUTLASS-3.x grouped-GEMM path for FP4 (MXFP4 + NVFP4) MoE on sm_120/121:
|
||||
- One grouped GEMM over all experts with per-group token offsets (full tiles regardless of tokens/expert),
|
||||
vs today's per-expert MMQ scheduler.
|
||||
- Block-scaled `e2m1` operands via tcgen05 tensor-memory MMA (`mma.sync.aligned.kind::mxf4…` is the
|
||||
warp-level form; the collective-mainloop/tcgen05 form is what extracts Blackwell throughput at prefill
|
||||
tile sizes).
|
||||
- Fuse activation quantization (`quantize_mmq_mxfp4`, ~8%) into the permute/gather.
|
||||
- Optionally extend to dense layers (qkv/o_proj/lm_head) so full-model prefill is FP4/FP8.
|
||||
|
||||
This mirrors what vLLM/FlashInfer/TensorRT-LLM do for Blackwell MoE. Happy to test iterations on the GB10.
|
||||
|
||||
## Repro
|
||||
|
||||
```sh
|
||||
llama-quantize qwen3coder-f16.gguf qwen3coder-mxfp4.gguf MXFP4_MOE
|
||||
llama-bench -m qwen3coder-mxfp4.gguf -ngl 99 -p 2048 -n 0 -ub 2048
|
||||
llama-batched-bench -m qwen3coder-mxfp4.gguf -ngl 99 -c 45056 -b 2048 -ub 2048 -npp 512 -ntg 128 -npl 1,8,32,64
|
||||
```
|
||||
Reference in New Issue
Block a user