Public deliverable for the patch-0018..0023 f32 bit-exact paged-attention ship:
the apples-to-apples NVFP4 decode benchmark (llama.cpp paged 0023 vs vLLM 0.23.0
on GB10 / DGX Spark, matched weights, CUDA graphs ON both sides).
- final_benchmark.csv: clean 8-column plot-ready schema
(model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb),
16 rows (2 models x 2 engines x npl 8/32/64/128).
- QWEN36_NVFP4_BENCH.md: embed the two decode-vs-npl plots; add the
internal-consistency note (decode_agg vs perseq*npl is TTFT-governed, holds on
both engines, no stale-baseline carry-over).
- decode-vs-npl PNGs (one per model), llama vs vLLM, per-point llama-%-of-vLLM labels.
Headline (measured, nothing pre-assumed): dense llama 90-117% of vLLM decode
(ahead at npl8), MoE 77-83%, at higher precision (f32 GDN state + q8 act vs vLLM
bf16 GDN + w4a4) and 1.5-3x lower unified memory (on-demand paged KV vs vLLM's
flat ~107 GB pool).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>