LocalAI/backend/cpp/llama-cpp/patches/paged/final_benchmark.csv at aaaa90ae4bd133c28c2a568171a813e5349909a1

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-26 01:16:58 -04:00

Files

Ettore Di Giacinto aaaa90ae4b bench(paged): final apples-to-apples NVFP4 decode benchmark (0023 vs vLLM 0.23.0, GB10)

Publishable, plot-ready head-to-head on GB10 / DGX Spark with matched NVFP4 weights,
both engines at their best realistic config (CUDA graphs ON both sides; vLLM util 0.85
max-model-len 4096 max-num-seqs 256; llama -c 131072 --parallel 128 LLAMA_KV_PAGED=1
LLAMA_MAX_BATCH_TOKENS=512). Identical async client: 512-tok unique-nonce prompt
(fresh full prefill), max_tokens=256, temp 0, ignore_eos, stream+usage; npl 8/32/64/128.

llama = clean patch 0023 (dev tree f7409c2, bf16 GDN-state work reverted, build-cuda
rebuilt). llama runs at HIGHER precision (f32 GDN state + q8 act) than vLLM (bf16 + w4a4).

decode_agg t/s, llama as % of vLLM:
  DENSE q36-27b-nvfp4:  npl8 117%  npl32 91%  npl64 90%  npl128 92%
  MoE   q36-35b-a3b:    npl8  83%  npl32 78%  npl64 77%  npl128 82%
memory: llama on-demand paged KV 50-90 GB (dense) / 36-58 GB (MoE) vs vLLM fixed ~107 GB
pool at all npl (1.5-3x lower). TTFT: vLLM wins under synchronized burst (llama
decode-first budget trades burst-prefill for decode; decode + memory unaffected).

Outputs: final_benchmark.csv (16 rows, 5 metrics each), refreshed QWEN36_NVFP4_BENCH.md
(FINAL section), BENCHMARK_PROGRESS.md (per-row checkpoint log). Methodology notes:
per-npl llama server restart (paged-pool degrades after high-npl bursts; decode robust),
vLLM npl8 re-check confirms no degradation; clean env (service containers stopped for the
run, restored after).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-26 03:47:24 +00:00

1.2 KiB

Raw Blame History

1	model	engine	npl	decode_agg_tps	decode_perseq_tps	prefill_tps	ttft_mean_ms	peak_gb	peak_engine_gb	llama_decode_pct_of_vllm
2	q36-27b-nvfp4	llama	8	82.5	9.57	507.3	6038.1	53.51	50.22	117.2
3	q36-27b-nvfp4	llama	32	192.6	4.79	115.0	133551.7	69.63	66.32	90.9
4	q36-27b-nvfp4	llama	64	277.8	3.09	95.9	321618.8	83.96	80.64	89.9
5	q36-27b-nvfp4	llama	128	384.6	1.86	69.7	902762.7	93.82	90.52	91.8
6	q36-27b-nvfp4	vllm	8	70.4	8.76	2096.2	1861.1	110.92	107.61	100.0
7	q36-27b-nvfp4	vllm	32	211.8	6.28	2182.6	5353.2	110.87	107.56	100.0
8	q36-27b-nvfp4	vllm	64	309.1	4.38	2088.9	9512.4	110.88	107.57	100.0
9	q36-27b-nvfp4	vllm	128	418.8	2.79	1929.1	18449.5	110.95	107.64	100.0
10	q36-35b-a3b-nvfp4	llama	8	211.8	24.45	1236.4	2477.1	39.66	36.13	82.6
11	q36-35b-a3b-nvfp4	llama	32	393.0	10.02	1213.9	8225.2	47.11	43.77	78.5
12	q36-35b-a3b-nvfp4	llama	64	527.0	6.15	1152.3	15849.5	57.13	53.83	76.8
13	q36-35b-a3b-nvfp4	llama	128	726.4	3.73	276.8	213017.2	61.51	58.23	82.3
14	q36-35b-a3b-nvfp4	vllm	8	256.5	31.84	5186.5	768.8	109.62	106.34	100.0
15	q36-35b-a3b-nvfp4	vllm	32	500.8	14.90	6223.4	1830.4	109.63	106.35	100.0
16	q36-35b-a3b-nvfp4	vllm	64	686.1	9.83	5926.5	3224.4	109.63	106.35	100.0
17	q36-35b-a3b-nvfp4	vllm	128	882.2	6.05	5300.5	6487.7	109.64	106.36	100.0

1.2 KiB Raw Blame History

1.2 KiB

Raw Blame History