docs(paged): Qwen3.6 NVFP4 apples-to-apples scorecard (llama vs vLLM, dense + MoE)

Full 4-way sweep (npl 8/32/64/128): dense Qwen3.6-27B (clean W4A4) + MoE Qwen3.6-35B-A3B (vLLM Marlin NvFp4). Parity at npl8; vLLM scales ~2.8-2.9x ahead on decode at npl128. llama TTFT explodes at high concurrency - run WITHOUT max_prefill_tokens (0013), the prefill-starvation also drags decode_agg; fair re-run with the QoS budget pending. llama wins on on-demand memory (paged). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-24 08:38:51 -04:00 · 2026-06-23 20:21:50 +00:00
parent ee78ae4a11
commit 2975a74fb4
1 changed files with 75 additions and 15 deletions
--- a/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md
+++ b/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md
@@ -12,13 +12,13 @@ ahead of / behind vLLM?"
  unified-memory used GB (`MemTotal-MemAvailable`), so they cover weights + KV + runtime.
 - **llama.cpp**: dev tree `~/llama-paged-dev` branch `paged` HEAD `151343b` (patch 0015),
  `build-cuda` sm_121, `LLAMA_KV_PAGED=1`, `llama-server -c 131072 --parallel 128 -b 2048
-  -ub 512 -ngl 99 -fa on`.
+  -ub 512 -ngl 99 -fa on`. **NOTE: run WITHOUT `max_prefill_tokens` (patch 0013) - see the
+  TTFT caveat in the verdict.**
 - **vLLM**: 0.23.0, `--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096
  --max-num-seqs 256 -tp 1`.
- **Client**: identical async client (`h2h_cli.py`) for both engines. Per request:
-  512-token unique prompt (unique leading tokens defeat cross-request prefix caching),
-  `max_tokens=256`, `temperature=0`, `ignore_eos=True`, streaming with usage. Concurrency
-  (npl) swept at 8 / 32 / 64 / 128.
+- **Client**: identical async client for both engines. Per request: 512-token unique prompt
+  (unique leading tokens defeat cross-request prefix caching), `max_tokens=256`,
+  `temperature=0`, `ignore_eos=True`, streaming with usage. Concurrency (npl) swept 8/32/64/128.
 - **Metrics** (localmaxxing.com schema): `decode_agg_tps` (aggregate decode tok/s across all
  live seqs), `decode_perseq_tps` (mean per-sequence decode), `prefill_tps`, `ttft_mean_ms`,
  `PEAK_GB` (unified-memory peak).
@@ -32,17 +32,77 @@ ahead of / behind vLLM?"

 ---

-## Results
+## Results (decode aggregate tok/s, per-seq, prefill, TTFT, peak GB)

-### MoE Qwen3.6-35B-A3B (~3B active) - llama.cpp (paged, patch 0015)
+### MoE Qwen3.6-35B-A3B (~3B active)

-| npl | decode agg tok/s | decode per-seq tok/s | prefill tok/s | TTFT mean ms | peak GB |
-|----:|-----------------:|---------------------:|--------------:|-------------:|--------:|
-| 8   | 170.2 | 20.27 | 2813.4 | 855.0   | 38.98 |
-| 32  | 235.4 | 6.77  | 2004.5 | 4970.5  | 43.06 |
-| 64  | 271.7 | 3.88  | 2388.7 | 7205.0  | 52.53 |
-| 128 | 292.2 | 2.05  | 656.5  | 84799.7 | 61.42 |
+| npl | engine | decode agg | decode/seq | prefill | TTFT mean ms | peak GB |
+|----:|--------|-----------:|-----------:|--------:|-------------:|--------:|
+| 8   | llama  | 170.2 | 20.27 | 2813 | 855     | 38.98 |
+| 8   | vLLM   | 202.0 | 24.92 | 4648 | 799     | 111.49 |
+| 32  | llama  | 235.4 | 6.77  | 2005 | 4970    | 43.06 |
+| 32  | vLLM   | 462.0 | 13.59 | 4755 | 2308    | 111.26 |
+| 64  | llama  | 271.7 | 3.88  | 2389 | 7205    | 52.53 |
+| 64  | vLLM   | 624.5 | 8.90  | 4784 | 4072    | 111.46 |
+| 128 | llama  | 292.2 | 2.05  | 657  | 84800   | 61.42 |
+| 128 | vLLM   | 811.1 | 5.46  | 4263 | 7980    | 111.61 |

-Baseline (weights loaded, idle): 37.67 GB.
+llama decode as % of vLLM: **84 / 51 / 44 / 36** at npl 8/32/64/128.

-<!-- MoE vLLM, DENSE llama, DENSE vLLM tables appended by orchestrator phases below -->
+### DENSE Qwen3.6-27B
+
+| npl | engine | decode agg | decode/seq | prefill | TTFT mean ms | peak GB |
+|----:|--------|-----------:|-----------:|--------:|-------------:|--------:|
+| 8   | llama  | 63.8  | 7.60 | 1117 | 2029    | 51.72 |
+| 8   | vLLM   | 64.3  | 7.98 | 1514 | 2593    | 112.07 |
+| 32  | llama  | 108.9 | 3.08 | 752  | 13212   | 61.48 |
+| 32  | vLLM   | 189.8 | 5.57 | 1555 | 7477    | 112.09 |
+| 64  | llama  | 126.2 | 1.78 | 465  | 53818   | 74.90 |
+| 64  | vLLM   | 284.2 | 3.92 | 1526 | 12942   | 112.11 |
+| 128 | llama  | 134.6 | 0.93 | 125  | 491195  | 94.03 |
+| 128 | vLLM   | 390.7 | 2.50 | 1420 | 24806   | 112.12 |
+
+llama decode as % of vLLM: **99 / 57 / 44 / 34** at npl 8/32/64/128.
+
+---
+
+## Verdict
+
+**At matched NVFP4 on one GB10 box: llama.cpp is at parity only at low concurrency; vLLM
+scales substantially better as concurrency rises.**
+
+1. **npl=8 (low concurrency): near parity.** Dense 99%, MoE 84% of vLLM decode. The MoE's
+   ~3B active shows: per-seq decode 20-25 tok/s (MoE) vs 8 tok/s (dense) on both engines.
+
+2. **npl>=32 (high concurrency): vLLM pulls decisively ahead** - decode ~2x (npl32) rising to
+   ~2.8-2.9x (npl128) on both models. vLLM scales monotonically (dense 64->391, MoE 202->811);
+   llama plateaus (dense 64->135, MoE 170->292).
+
+3. **TTFT is the clearest gap, and it is largely self-inflicted here.** llama's TTFT explodes
+   at high concurrency (dense **491 s**, MoE **85 s** at npl128) while vLLM stays bounded (25 s,
+   8 s). **This run used llama WITHOUT `max_prefill_tokens` (patch 0013)** - so 128 concurrent
+   512-token prefills starve each other and the decode. Crucially, that starvation also drags
+   `decode_agg` down: while many slots are stuck prefilling, fewer are actually decoding, so the
+   measured aggregate understates llama's steady-state decode. A re-run with `max_prefill_tokens`
+   (the QoS budget this PR already ships) is expected to bound TTFT AND raise high-concurrency
+   decode by keeping all slots live.
+
+4. **Memory: llama wins on efficiency.** vLLM pre-reserves the whole pool (~112 GB at
+   gpu-mem-util 0.85); llama grows on demand (MoE 38->61 GB, dense 52->94 GB). The paged
+   on-demand KV is materially more memory-efficient / multi-tenant-friendly.
+
+5. **vs the localmaxxing reference (259.5 MoE / 254.8 dense top-speed):** those are single-stream
+   on fast datacenter HW. GB10 per-seq decode tops out far lower (MoE ~25, dense ~8 tok/s at
+   npl8) - the LPDDR5x ~273 GB/s bandwidth floor, as expected. The reference is a ceiling, not a
+   GB10 target.
+
+### Honest bottom line
+
+The "par-or-beat vLLM" goal is **met at low concurrency but NOT at high concurrency** on these
+NVFP4 models. vLLM's continuous-batched decode + bounded prefill scheduling scale better on a
+bandwidth-limited box. Two of the three gap drivers are addressable on our side: (a) **prefill
+starvation** - re-run with `max_prefill_tokens` (patch 0013), which this PR ships; (b) **decode
+batching efficiency at high concurrency** - the runtime/scheduler lever (the small/unsaturated
+regime). The kernel itself is at parity (npl8). Next step: a fair re-run with the prefill budget
+on, plus decode-batch tuning, to get llama's true high-concurrency numbers before concluding the
+absolute gap.