docs(paged): Qwen3.6 NVFP4 h2h bench doc - MoE llama.cpp table

First crash-resilient slab of the apples-to-apples NVFP4-vs-NVFP4
llama.cpp-vs-vLLM benchmark on GB10. MoE Qwen3.6-35B-A3B paged
llama.cpp (patch 0015) decode/prefill/TTFT/VRAM at npl 8/32/64/128.
vLLM and dense tables append as the sweeps land.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-23 19:43:55 +00:00
parent acb22a66ed
commit ee78ae4a11

View File

@@ -0,0 +1,48 @@
# Qwen3.6 NVFP4-vs-NVFP4: llama.cpp vs vLLM on GB10 (DGX Spark)
Apples-to-apples benchmark. Both engines run the **same NVFP4 weights** on the **same box**
(GB10, sm_121, LPDDR5x unified memory ~273 GB/s). The question is not "who wins the HW
lottery" but "at matched NVFP4, on one bandwidth-limited box, does our paged llama.cpp
(patch 0015, expert-density-aware MoE token-tile auto-select, default-on) sit at par with /
ahead of / behind vLLM?"
## Setup
- **Box**: GB10 / DGX Spark, sm_121, unified LPDDR5x (~273 GB/s). Memory figures are
unified-memory used GB (`MemTotal-MemAvailable`), so they cover weights + KV + runtime.
- **llama.cpp**: dev tree `~/llama-paged-dev` branch `paged` HEAD `151343b` (patch 0015),
`build-cuda` sm_121, `LLAMA_KV_PAGED=1`, `llama-server -c 131072 --parallel 128 -b 2048
-ub 512 -ngl 99 -fa on`.
- **vLLM**: 0.23.0, `--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096
--max-num-seqs 256 -tp 1`.
- **Client**: identical async client (`h2h_cli.py`) for both engines. Per request:
512-token unique prompt (unique leading tokens defeat cross-request prefix caching),
`max_tokens=256`, `temperature=0`, `ignore_eos=True`, streaming with usage. Concurrency
(npl) swept at 8 / 32 / 64 / 128.
- **Metrics** (localmaxxing.com schema): `decode_agg_tps` (aggregate decode tok/s across all
live seqs), `decode_perseq_tps` (mean per-sequence decode), `prefill_tps`, `ttft_mean_ms`,
`PEAK_GB` (unified-memory peak).
## The 4 models (NVFP4, matched weights)
| Model | llama.cpp GGUF | vLLM checkpoint | Match |
|-------|----------------|-----------------|-------|
| DENSE Qwen3.6-27B (28B dense) | `q36-27b-nvfp4.gguf` (native Blackwell FP4) | `q36-27b-nvfp4-vllm/` (unsloth TRUE W4A4) | clean W4A4 both sides |
| MoE Qwen3.6-35B-A3B (36B total, ~3B active) | `q36-35b-a3b-nvfp4.gguf` (241 NVFP4 tensors, nvidia weights) | `q36-35b-a3b-nvfp4-vllm/` (nvidia modelopt; vLLM picks Marlin NvFp4 MoE + FA2) | NVFP4 weight-only, identical nvidia weights |
---
## Results
### MoE Qwen3.6-35B-A3B (~3B active) - llama.cpp (paged, patch 0015)
| npl | decode agg tok/s | decode per-seq tok/s | prefill tok/s | TTFT mean ms | peak GB |
|----:|-----------------:|---------------------:|--------------:|-------------:|--------:|
| 8 | 170.2 | 20.27 | 2813.4 | 855.0 | 38.98 |
| 32 | 235.4 | 6.77 | 2004.5 | 4970.5 | 43.06 |
| 64 | 271.7 | 3.88 | 2388.7 | 7205.0 | 52.53 |
| 128 | 292.2 | 2.05 | 656.5 | 84799.7 | 61.42 |
Baseline (weights loaded, idle): 37.67 GB.
<!-- MoE vLLM, DENSE llama, DENSE vLLM tables appended by orchestrator phases below -->