LocalAI/backend/cpp/llama-cpp/patches/paged/final_benchmark.csv at 7dd3431040705882663be015da98f9b2bfc2a2d5

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-26 01:16:58 -04:00

Files

Ettore Di Giacinto ae0042f214 docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots

Public deliverable for the patch-0018..0023 f32 bit-exact paged-attention ship:
the apples-to-apples NVFP4 decode benchmark (llama.cpp paged 0023 vs vLLM 0.23.0
on GB10 / DGX Spark, matched weights, CUDA graphs ON both sides).

- final_benchmark.csv: clean 8-column plot-ready schema
  (model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb),
  16 rows (2 models x 2 engines x npl 8/32/64/128).
- QWEN36_NVFP4_BENCH.md: embed the two decode-vs-npl plots; add the
  internal-consistency note (decode_agg vs perseq*npl is TTFT-governed, holds on
  both engines, no stale-baseline carry-over).
- decode-vs-npl PNGs (one per model), llama vs vLLM, per-point llama-%-of-vLLM labels.

Headline (measured, nothing pre-assumed): dense llama 90-117% of vLLM decode
(ahead at npl8), MoE 77-83%, at higher precision (f32 GDN state + q8 act vs vLLM
bf16 GDN + w4a4) and 1.5-3x lower unified memory (on-demand paged KV vs vLLM's
flat ~107 GB pool).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-26 03:51:35 +00:00

983 B

Raw Blame History

1	model	engine	npl	decode_agg_tps	decode_perseq_tps	prefill_tps	ttft_mean_ms	peak_gb
2	q36-27b-nvfp4	llama	8	82.5	9.57	507.3	6038.1	53.51
3	q36-27b-nvfp4	llama	32	192.6	4.79	115.0	133551.7	69.63
4	q36-27b-nvfp4	llama	64	277.8	3.09	95.9	321618.8	83.96
5	q36-27b-nvfp4	llama	128	384.6	1.86	69.7	902762.7	93.82
6	q36-27b-nvfp4	vllm	8	70.4	8.76	2096.2	1861.1	110.92
7	q36-27b-nvfp4	vllm	32	211.8	6.28	2182.6	5353.2	110.87
8	q36-27b-nvfp4	vllm	64	309.1	4.38	2088.9	9512.4	110.88
9	q36-27b-nvfp4	vllm	128	418.8	2.79	1929.1	18449.5	110.95
10	q36-35b-a3b-nvfp4	llama	8	211.8	24.45	1236.4	2477.1	39.66
11	q36-35b-a3b-nvfp4	llama	32	393.0	10.02	1213.9	8225.2	47.11
12	q36-35b-a3b-nvfp4	llama	64	527.0	6.15	1152.3	15849.5	57.13
13	q36-35b-a3b-nvfp4	llama	128	726.4	3.73	276.8	213017.2	61.51
14	q36-35b-a3b-nvfp4	vllm	8	256.5	31.84	5186.5	768.8	109.62
15	q36-35b-a3b-nvfp4	vllm	32	500.8	14.90	6223.4	1830.4	109.63
16	q36-35b-a3b-nvfp4	vllm	64	686.1	9.83	5926.5	3224.4	109.63
17	q36-35b-a3b-nvfp4	vllm	128	882.2	6.05	5300.5	6487.7	109.64

983 B Raw Blame History

983 B

Raw Blame History