Files
LocalAI/backend/cpp/llama-cpp/patches/paged/final_benchmark.csv
Ettore Di Giacinto aaaa90ae4b bench(paged): final apples-to-apples NVFP4 decode benchmark (0023 vs vLLM 0.23.0, GB10)
Publishable, plot-ready head-to-head on GB10 / DGX Spark with matched NVFP4 weights,
both engines at their best realistic config (CUDA graphs ON both sides; vLLM util 0.85
max-model-len 4096 max-num-seqs 256; llama -c 131072 --parallel 128 LLAMA_KV_PAGED=1
LLAMA_MAX_BATCH_TOKENS=512). Identical async client: 512-tok unique-nonce prompt
(fresh full prefill), max_tokens=256, temp 0, ignore_eos, stream+usage; npl 8/32/64/128.

llama = clean patch 0023 (dev tree f7409c2, bf16 GDN-state work reverted, build-cuda
rebuilt). llama runs at HIGHER precision (f32 GDN state + q8 act) than vLLM (bf16 + w4a4).

decode_agg t/s, llama as % of vLLM:
  DENSE q36-27b-nvfp4:  npl8 117%  npl32 91%  npl64 90%  npl128 92%
  MoE   q36-35b-a3b:    npl8  83%  npl32 78%  npl64 77%  npl128 82%
memory: llama on-demand paged KV 50-90 GB (dense) / 36-58 GB (MoE) vs vLLM fixed ~107 GB
pool at all npl (1.5-3x lower). TTFT: vLLM wins under synchronized burst (llama
decode-first budget trades burst-prefill for decode; decode + memory unaffected).

Outputs: final_benchmark.csv (16 rows, 5 metrics each), refreshed QWEN36_NVFP4_BENCH.md
(FINAL section), BENCHMARK_PROGRESS.md (per-row checkpoint log). Methodology notes:
per-npl llama server restart (paged-pool degrades after high-npl bursts; decode robust),
vLLM npl8 re-check confirms no degradation; clean env (service containers stopped for the
run, restored after).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 03:47:24 +00:00

1.2 KiB

1modelenginenpldecode_agg_tpsdecode_perseq_tpsprefill_tpsttft_mean_mspeak_gbpeak_engine_gbllama_decode_pct_of_vllm
2q36-27b-nvfp4llama882.59.57507.36038.153.5150.22117.2
3q36-27b-nvfp4llama32192.64.79115.0133551.769.6366.3290.9
4q36-27b-nvfp4llama64277.83.0995.9321618.883.9680.6489.9
5q36-27b-nvfp4llama128384.61.8669.7902762.793.8290.5291.8
6q36-27b-nvfp4vllm870.48.762096.21861.1110.92107.61100.0
7q36-27b-nvfp4vllm32211.86.282182.65353.2110.87107.56100.0
8q36-27b-nvfp4vllm64309.14.382088.99512.4110.88107.57100.0
9q36-27b-nvfp4vllm128418.82.791929.118449.5110.95107.64100.0
10q36-35b-a3b-nvfp4llama8211.824.451236.42477.139.6636.1382.6
11q36-35b-a3b-nvfp4llama32393.010.021213.98225.247.1143.7778.5
12q36-35b-a3b-nvfp4llama64527.06.151152.315849.557.1353.8376.8
13q36-35b-a3b-nvfp4llama128726.43.73276.8213017.261.5158.2382.3
14q36-35b-a3b-nvfp4vllm8256.531.845186.5768.8109.62106.34100.0
15q36-35b-a3b-nvfp4vllm32500.814.906223.41830.4109.63106.35100.0
16q36-35b-a3b-nvfp4vllm64686.19.835926.53224.4109.63106.35100.0
17q36-35b-a3b-nvfp4vllm128882.26.055300.56487.7109.64106.36100.0