mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 08:08:52 -04:00
docs(paged): stock GPU batch-shape determinism + vLLM shared-prefix comparison
Two closing measurements on DGX Spark (GB10, sm_121): 1. Stock GPU determinism (no paging): with LLAMA_KV_PAGED unset, stock llama.cpp produces a different greedy token stream when the same prompt is decoded in a full-prefill batch vs a split (prefix-then-suffix) batch. At G=24 the generated stream diverges 1/5 prompts on CPU and 2/5 on CUDA (and earlier on CUDA). This confirms the patch-0007 GPU byte-identity failure is stock floating-point batch-shape non-determinism, not a paged bug. CPU exhibits it too, just less often, which is why 0007's short CPU scenarios passed 16/16 while the CUDA run flipped. 2. vLLM vs llama.cpp+paged on a shared-prefix fan-out (K reqs share a 1024-tok prefix + unique 32-tok suffix, gen 64). llama.cpp+paged prefix cache gives 7.15x (K=16) / 10.3x (K=32) prefill reduction vs its no-share baseline - the same cross-request prefix-skip vLLM's APC provides (97% hit rate confirmed). Head-to-head on cached prefill vLLM is ~5x faster (Q4_K_M vs nvfp4a16 quant, vLLM on FP4 emulation + eager), and wider end-to-end due to continuous batched decode. Competitive in kind, behind in absolute terms on this hardware. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
165
backend/cpp/llama-cpp/patches/paged/PAGED_VLLM_COMPARE.md
Normal file
165
backend/cpp/llama-cpp/patches/paged/PAGED_VLLM_COMPARE.md
Normal file
@@ -0,0 +1,165 @@
|
||||
# Paged-attention closing measurements: stock GPU determinism + vLLM comparison
|
||||
|
||||
Two closing measurements for the paged-attention series, run on a DGX Spark
|
||||
(NVIDIA GB10, compute capability 12.1 / sm_121), CUDA 13. Dev tree
|
||||
`~/llama-paged-dev` branch `paged`, paged engine gated by env `LLAMA_KV_PAGED`
|
||||
(default-off = stock). Models: `Qwen3-0.6B-Q8_0.gguf` and
|
||||
`Qwen3-32B-Q4_K_M.gguf` (llama.cpp), `Qwen3-32B` nvfp4a16 / W4A16 HF safetensors
|
||||
(vLLM 0.23.0). All dev drivers are dev-tree-only and not shipped.
|
||||
|
||||
## Deliverable 1: stock GPU determinism across batch shapes (no paging)
|
||||
|
||||
Question: is the patch-0007 GPU byte-identity "failure" (a near-tie greedy token
|
||||
flips on CUDA, e.g. 17971 vs 5671) caused by paging, or is it inherent stock
|
||||
CUDA non-determinism from running the same tokens in a different batch shape?
|
||||
|
||||
Method: a new dev-only driver `llama-paged-batchshape` (paging explicitly OFF:
|
||||
`unsetenv("LLAMA_KV_PAGED")`). For a prompt `[P+S]` it greedy-decodes two ways,
|
||||
both stock contiguous KV:
|
||||
|
||||
- (a) `full` - prefill the whole `[P+S]` in ONE `llama_decode`.
|
||||
- (b) `split` - prefill `P` in one `llama_decode`, then `S` in a second.
|
||||
|
||||
The two paths write byte-for-identical token ids; the only difference is the
|
||||
batch shape submitted to the kernels (full prefill vs P-then-S), which changes
|
||||
the float reduction order in the GEMMs and therefore the KV values by tiny
|
||||
amounts. 5 distinct prompts, suffix S=16.
|
||||
|
||||
### Single next token (the literal T_full vs T_split)
|
||||
|
||||
Both CPU and CUDA returned the SAME greedy next token for all 5 prompts
|
||||
(0/5 flips). BUT the top-2 logit gap measurably changes with the batch shape on
|
||||
CUDA, proving the float order does differ:
|
||||
|
||||
```
|
||||
CUDA, S=8: prompt 1 T_full=1896 (gap 0.07072) T_split=1896 (gap 0.17986)
|
||||
CUDA, S=8: prompt 4 T_full=49584 (gap 0.93304) T_split=49584 (gap 0.85785)
|
||||
```
|
||||
|
||||
The argmax simply did not flip on the immediate next token for these prompts -
|
||||
the gaps, while shifting, stayed wide enough.
|
||||
|
||||
### Generated stream (what 0007 actually byte-asserts)
|
||||
|
||||
0007 asserts byte-identity over a *generated* token stream, where the tiny
|
||||
prefill-shape KV perturbation accumulates and eventually crosses a near-tie.
|
||||
Generating G tokens greedily from `full` vs `split` and reporting first
|
||||
divergence:
|
||||
|
||||
| gen length | CPU diverged | CUDA diverged |
|
||||
|-----------|--------------|---------------|
|
||||
| G=24 (0007 default) | 1/5 (prompt 0 @ step 5) | 2/5 (prompt 1 @ step 3, prompt 4 @ step 6) |
|
||||
| G=64 | 2/5 (steps 5, 42) | 3/5 (steps 3, 6, 30) |
|
||||
|
||||
Example CUDA divergence, pure stock, zero paging:
|
||||
`prompt 1: DIVERGES at gen step 3: full=1260 split=576`.
|
||||
|
||||
### Verdict (Deliverable 1): HYPOTHESIS HELD
|
||||
|
||||
The 0007 GPU byte-identity failure is **stock batch-shape non-determinism, not a
|
||||
paged bug**. With paging entirely OFF, stock llama.cpp produces a different
|
||||
greedy token stream when the same prompt is processed in a full-prefill batch vs
|
||||
a split (prefix-then-suffix) batch - exactly the shape difference that 0007's
|
||||
prefix-share path introduces (full B-from-scratch vs prefix-cached + suffix-only).
|
||||
|
||||
Refinement (reported honestly): it is **not strictly CUDA-only**. CPU exhibits
|
||||
the same divergence, just less often and later (1/5 vs 2/5 at G=24, and CPU's
|
||||
flips land at later generation steps). This is exactly why 0007's small, short
|
||||
CPU scenarios happened to pass 16/16 while the CUDA run flipped: CUDA's larger
|
||||
parallel reductions reorder more aggressively, so a near-tie crosses earlier and
|
||||
more frequently. The phenomenon is floating-point GEMM-batching non-determinism,
|
||||
inherent to both backends; paging is not the cause.
|
||||
|
||||
## Deliverable 2: vLLM vs llama.cpp+paged on a shared-prefix fan-out
|
||||
|
||||
Workload: K requests share a 1024-token system prefix, each with a unique
|
||||
32-token suffix, then generate 64 tokens. Both engines cache the shared prefix
|
||||
(vLLM automatic prefix caching ON by default; llama.cpp via the paged
|
||||
cross-request prefix cache, `LLAMA_KV_PAGED=1`).
|
||||
|
||||
Quant is the realistic apples-to-oranges, reported honestly:
|
||||
- llama.cpp: Qwen3-32B **Q4_K_M** (GGUF), `-ngl 99`, CUDA dequant kernels.
|
||||
- vLLM: Qwen3-32B **nvfp4a16 (W4A16)**, served via the **Marlin FP4
|
||||
weight-only** kernel because GB10 (sm_121) has **no native FP4 compute** -
|
||||
i.e. vLLM is on a slower-than-ideal kernel path here. vLLM also ran
|
||||
`enforce_eager=True` (no CUDA graphs / torch.compile; the env lacked a working
|
||||
inductor/ninja toolchain), so the vLLM numbers are if anything **conservative**.
|
||||
|
||||
### vLLM (automatic prefix caching), end-to-end
|
||||
|
||||
APC hits confirmed in the engine log: **"Prefix cache hit rate: 97.0%"**,
|
||||
`prefix_cache_hits 33040/34848` (K=16) and `99344/102432` (K=32).
|
||||
|
||||
| K | APC | prefill wall (G=1) | total wall (G=64) | throughput |
|
||||
|---|-----|--------------------|--------------------|-----------|
|
||||
| 16 | ON | 0.749 s | 6.63 s | 2.41 req/s |
|
||||
| 16 | OFF | 20.19 s | 27.21 s | 0.59 req/s |
|
||||
| 32 | ON | 1.13 s | 7.56 s | 4.23 req/s |
|
||||
| 32 | OFF | 40.19 s | 48.71 s | 0.66 req/s |
|
||||
|
||||
vLLM's APC cuts the fan-out prefill ~27x (K=16) to ~36x (K=32) vs APC-off; the
|
||||
huge ratio reflects how slow the FP4-emulation prefill is when forced to
|
||||
recompute all K prefixes.
|
||||
|
||||
### llama.cpp + paged prefix cache (prefill phase)
|
||||
|
||||
The paged shared-prefix bench (`llama-paged-prefix-bench`, `BENCH_GEN=0`,
|
||||
`PAGED_NGL=99`). Reuse confirmed: `kshare(seq1)=1024`, shared-block
|
||||
`ref_cnt = K` (all sequences hold the one prefix), 15360 / 31744 prefix tokens
|
||||
skipped.
|
||||
|
||||
| K | mode | prefill tokens submitted | prefill wall | vs no-share |
|
||||
|---|------|--------------------------|--------------|-------------|
|
||||
| 16 | PAGED-SHARE | 1536 | 3.66 s | 7.15x |
|
||||
| 16 | NO-SHARE | 16896 | 26.17 s | 1.0x |
|
||||
| 32 | PAGED-SHARE | 2048 | 6.04 s | 10.3x |
|
||||
| 32 | NO-SHARE | 33792 | 62.17 s | 1.0x |
|
||||
|
||||
The paged prefix cache delivers the expected **7.15x (K=16) / 10.3x (K=32)**
|
||||
prefill wall-time reduction - the headline cross-request prefix-skip win, on a
|
||||
real 32B model on GPU.
|
||||
|
||||
### Head-to-head, both engines caching the shared prefix
|
||||
|
||||
Prefill of the cached fan-out (vLLM G=1, ~prefill; llama.cpp G=0, pure prefill):
|
||||
|
||||
| K | llama.cpp+paged prefill | vLLM APC prefill | vLLM faster by |
|
||||
|---|-------------------------|------------------|----------------|
|
||||
| 16 | 3.66 s | 0.749 s | ~4.9x |
|
||||
| 32 | 6.04 s | 1.13 s | ~5.3x |
|
||||
|
||||
### Verdict (Deliverable 2): competitive in kind, behind in absolute terms
|
||||
|
||||
With both engines caching the shared prefix, **llama.cpp+paged is qualitatively
|
||||
competitive but absolutely behind vLLM on this GB10 box**:
|
||||
|
||||
- **Same optimization, same order of magnitude.** llama.cpp's paged prefix cache
|
||||
reproduces exactly the win vLLM's APC gives - skip the shared-prefix recompute
|
||||
- and yields a 7-10x prefill reduction vs its own no-share baseline. On the
|
||||
RAG/system-prompt fan-out the algorithmic gap is closed: llama.cpp no longer
|
||||
pays K x prefix.
|
||||
|
||||
- **vLLM still wins head-to-head by ~5x on the cached prefill** (0.75s vs 3.66s
|
||||
at K=16; 1.13s vs 6.04s at K=32), and by more end-to-end because it does
|
||||
**continuous batched decode** (all K sequences decoded in one fused step)
|
||||
while the llama.cpp paged *dev driver* decodes each sequence serially. That
|
||||
decode-batching gap is a property of the serving stack, not of the paged
|
||||
prefix cache. Notably vLLM wins here while handicapped (eager mode, FP4
|
||||
weight-only emulation with no native FP4 on GB10); a tuned vLLM would lead by
|
||||
more.
|
||||
|
||||
- **Honest caveats / blockers.** (1) Quant differs (Q4_K_M vs nvfp4a16). (2) The
|
||||
comparison is prefill-vs-prefill plus vLLM end-to-end; a clean llama.cpp
|
||||
end-to-end on this driver is blocked because its generation phase has a
|
||||
stale-logits bug (`get_logits_ith` reads seq 0's prefill index after later
|
||||
sequences' prefills overwrote the logits buffer -> segfault), and even fixed
|
||||
its decode is serial, so it would not be apples-to-apples vs vLLM's batched
|
||||
decode. The fair end-to-end llama.cpp number needs the grpc / llama-server
|
||||
continuous-batching path, not this dev scaffold. (3) vLLM ran eager + FP4
|
||||
emulation, making its numbers conservative.
|
||||
|
||||
Bottom line: paged gives llama.cpp the cross-request prefix-skip that vLLM's APC
|
||||
provides, which is the categorical win and removes the K x prefix penalty on
|
||||
RAG/system-prompt fan-out. On absolute wall-time on this hardware vLLM retains a
|
||||
~5x prefill lead and a larger end-to-end lead from continuous batched decode and
|
||||
a more optimized serving stack.
|
||||
Reference in New Issue
Block a user