Files
LocalAI/backend/cpp/llama-cpp/patches/paged/final_benchmark.csv
Ettore Di Giacinto ae0042f214 docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots
Public deliverable for the patch-0018..0023 f32 bit-exact paged-attention ship:
the apples-to-apples NVFP4 decode benchmark (llama.cpp paged 0023 vs vLLM 0.23.0
on GB10 / DGX Spark, matched weights, CUDA graphs ON both sides).

- final_benchmark.csv: clean 8-column plot-ready schema
  (model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb),
  16 rows (2 models x 2 engines x npl 8/32/64/128).
- QWEN36_NVFP4_BENCH.md: embed the two decode-vs-npl plots; add the
  internal-consistency note (decode_agg vs perseq*npl is TTFT-governed, holds on
  both engines, no stale-baseline carry-over).
- decode-vs-npl PNGs (one per model), llama vs vLLM, per-point llama-%-of-vLLM labels.

Headline (measured, nothing pre-assumed): dense llama 90-117% of vLLM decode
(ahead at npl8), MoE 77-83%, at higher precision (f32 GDN state + q8 act vs vLLM
bf16 GDN + w4a4) and 1.5-3x lower unified memory (on-demand paged KV vs vLLM's
flat ~107 GB pool).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 03:51:35 +00:00

983 B

1modelenginenpldecode_agg_tpsdecode_perseq_tpsprefill_tpsttft_mean_mspeak_gb
2q36-27b-nvfp4llama882.59.57507.36038.153.51
3q36-27b-nvfp4llama32192.64.79115.0133551.769.63
4q36-27b-nvfp4llama64277.83.0995.9321618.883.96
5q36-27b-nvfp4llama128384.61.8669.7902762.793.82
6q36-27b-nvfp4vllm870.48.762096.21861.1110.92
7q36-27b-nvfp4vllm32211.86.282182.65353.2110.87
8q36-27b-nvfp4vllm64309.14.382088.99512.4110.88
9q36-27b-nvfp4vllm128418.82.791929.118449.5110.95
10q36-35b-a3b-nvfp4llama8211.824.451236.42477.139.66
11q36-35b-a3b-nvfp4llama32393.010.021213.98225.247.11
12q36-35b-a3b-nvfp4llama64527.06.151152.315849.557.13
13q36-35b-a3b-nvfp4llama128726.43.73276.8213017.261.51
14q36-35b-a3b-nvfp4vllm8256.531.845186.5768.8109.62
15q36-35b-a3b-nvfp4vllm32500.814.906223.41830.4109.63
16q36-35b-a3b-nvfp4vllm64686.19.835926.53224.4109.63
17q36-35b-a3b-nvfp4vllm128882.26.055300.56487.7109.64