From 2975a74fb4dc3e4b741c0711f724dd798f3e4bb7 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Tue, 23 Jun 2026 20:21:50 +0000 Subject: [PATCH] docs(paged): Qwen3.6 NVFP4 apples-to-apples scorecard (llama vs vLLM, dense + MoE) Full 4-way sweep (npl 8/32/64/128): dense Qwen3.6-27B (clean W4A4) + MoE Qwen3.6-35B-A3B (vLLM Marlin NvFp4). Parity at npl8; vLLM scales ~2.8-2.9x ahead on decode at npl128. llama TTFT explodes at high concurrency - run WITHOUT max_prefill_tokens (0013), the prefill-starvation also drags decode_agg; fair re-run with the QoS budget pending. llama wins on on-demand memory (paged). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../patches/paged/QWEN36_NVFP4_BENCH.md | 90 +++++++++++++++---- 1 file changed, 75 insertions(+), 15 deletions(-) diff --git a/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md b/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md index 86e0490a9..6b45f2e17 100644 --- a/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md +++ b/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md @@ -12,13 +12,13 @@ ahead of / behind vLLM?" unified-memory used GB (`MemTotal-MemAvailable`), so they cover weights + KV + runtime. - **llama.cpp**: dev tree `~/llama-paged-dev` branch `paged` HEAD `151343b` (patch 0015), `build-cuda` sm_121, `LLAMA_KV_PAGED=1`, `llama-server -c 131072 --parallel 128 -b 2048 - -ub 512 -ngl 99 -fa on`. + -ub 512 -ngl 99 -fa on`. **NOTE: run WITHOUT `max_prefill_tokens` (patch 0013) - see the + TTFT caveat in the verdict.** - **vLLM**: 0.23.0, `--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096 --max-num-seqs 256 -tp 1`. -- **Client**: identical async client (`h2h_cli.py`) for both engines. Per request: - 512-token unique prompt (unique leading tokens defeat cross-request prefix caching), - `max_tokens=256`, `temperature=0`, `ignore_eos=True`, streaming with usage. Concurrency - (npl) swept at 8 / 32 / 64 / 128. +- **Client**: identical async client for both engines. Per request: 512-token unique prompt + (unique leading tokens defeat cross-request prefix caching), `max_tokens=256`, + `temperature=0`, `ignore_eos=True`, streaming with usage. Concurrency (npl) swept 8/32/64/128. - **Metrics** (localmaxxing.com schema): `decode_agg_tps` (aggregate decode tok/s across all live seqs), `decode_perseq_tps` (mean per-sequence decode), `prefill_tps`, `ttft_mean_ms`, `PEAK_GB` (unified-memory peak). @@ -32,17 +32,77 @@ ahead of / behind vLLM?" --- -## Results +## Results (decode aggregate tok/s, per-seq, prefill, TTFT, peak GB) -### MoE Qwen3.6-35B-A3B (~3B active) - llama.cpp (paged, patch 0015) +### MoE Qwen3.6-35B-A3B (~3B active) -| npl | decode agg tok/s | decode per-seq tok/s | prefill tok/s | TTFT mean ms | peak GB | -|----:|-----------------:|---------------------:|--------------:|-------------:|--------:| -| 8 | 170.2 | 20.27 | 2813.4 | 855.0 | 38.98 | -| 32 | 235.4 | 6.77 | 2004.5 | 4970.5 | 43.06 | -| 64 | 271.7 | 3.88 | 2388.7 | 7205.0 | 52.53 | -| 128 | 292.2 | 2.05 | 656.5 | 84799.7 | 61.42 | +| npl | engine | decode agg | decode/seq | prefill | TTFT mean ms | peak GB | +|----:|--------|-----------:|-----------:|--------:|-------------:|--------:| +| 8 | llama | 170.2 | 20.27 | 2813 | 855 | 38.98 | +| 8 | vLLM | 202.0 | 24.92 | 4648 | 799 | 111.49 | +| 32 | llama | 235.4 | 6.77 | 2005 | 4970 | 43.06 | +| 32 | vLLM | 462.0 | 13.59 | 4755 | 2308 | 111.26 | +| 64 | llama | 271.7 | 3.88 | 2389 | 7205 | 52.53 | +| 64 | vLLM | 624.5 | 8.90 | 4784 | 4072 | 111.46 | +| 128 | llama | 292.2 | 2.05 | 657 | 84800 | 61.42 | +| 128 | vLLM | 811.1 | 5.46 | 4263 | 7980 | 111.61 | -Baseline (weights loaded, idle): 37.67 GB. +llama decode as % of vLLM: **84 / 51 / 44 / 36** at npl 8/32/64/128. - +### DENSE Qwen3.6-27B + +| npl | engine | decode agg | decode/seq | prefill | TTFT mean ms | peak GB | +|----:|--------|-----------:|-----------:|--------:|-------------:|--------:| +| 8 | llama | 63.8 | 7.60 | 1117 | 2029 | 51.72 | +| 8 | vLLM | 64.3 | 7.98 | 1514 | 2593 | 112.07 | +| 32 | llama | 108.9 | 3.08 | 752 | 13212 | 61.48 | +| 32 | vLLM | 189.8 | 5.57 | 1555 | 7477 | 112.09 | +| 64 | llama | 126.2 | 1.78 | 465 | 53818 | 74.90 | +| 64 | vLLM | 284.2 | 3.92 | 1526 | 12942 | 112.11 | +| 128 | llama | 134.6 | 0.93 | 125 | 491195 | 94.03 | +| 128 | vLLM | 390.7 | 2.50 | 1420 | 24806 | 112.12 | + +llama decode as % of vLLM: **99 / 57 / 44 / 34** at npl 8/32/64/128. + +--- + +## Verdict + +**At matched NVFP4 on one GB10 box: llama.cpp is at parity only at low concurrency; vLLM +scales substantially better as concurrency rises.** + +1. **npl=8 (low concurrency): near parity.** Dense 99%, MoE 84% of vLLM decode. The MoE's + ~3B active shows: per-seq decode 20-25 tok/s (MoE) vs 8 tok/s (dense) on both engines. + +2. **npl>=32 (high concurrency): vLLM pulls decisively ahead** - decode ~2x (npl32) rising to + ~2.8-2.9x (npl128) on both models. vLLM scales monotonically (dense 64->391, MoE 202->811); + llama plateaus (dense 64->135, MoE 170->292). + +3. **TTFT is the clearest gap, and it is largely self-inflicted here.** llama's TTFT explodes + at high concurrency (dense **491 s**, MoE **85 s** at npl128) while vLLM stays bounded (25 s, + 8 s). **This run used llama WITHOUT `max_prefill_tokens` (patch 0013)** - so 128 concurrent + 512-token prefills starve each other and the decode. Crucially, that starvation also drags + `decode_agg` down: while many slots are stuck prefilling, fewer are actually decoding, so the + measured aggregate understates llama's steady-state decode. A re-run with `max_prefill_tokens` + (the QoS budget this PR already ships) is expected to bound TTFT AND raise high-concurrency + decode by keeping all slots live. + +4. **Memory: llama wins on efficiency.** vLLM pre-reserves the whole pool (~112 GB at + gpu-mem-util 0.85); llama grows on demand (MoE 38->61 GB, dense 52->94 GB). The paged + on-demand KV is materially more memory-efficient / multi-tenant-friendly. + +5. **vs the localmaxxing reference (259.5 MoE / 254.8 dense top-speed):** those are single-stream + on fast datacenter HW. GB10 per-seq decode tops out far lower (MoE ~25, dense ~8 tok/s at + npl8) - the LPDDR5x ~273 GB/s bandwidth floor, as expected. The reference is a ceiling, not a + GB10 target. + +### Honest bottom line + +The "par-or-beat vLLM" goal is **met at low concurrency but NOT at high concurrency** on these +NVFP4 models. vLLM's continuous-batched decode + bounded prefill scheduling scale better on a +bandwidth-limited box. Two of the three gap drivers are addressable on our side: (a) **prefill +starvation** - re-run with `max_prefill_tokens` (patch 0013), which this PR ships; (b) **decode +batching efficiency at high concurrency** - the runtime/scheduler lever (the small/unsaturated +regime). The kernel itself is at parity (npl8). Next step: a fair re-run with the prefill budget +on, plus decode-batch tuning, to get llama's true high-concurrency numbers before concluding the +absolute gap.