Ettore Di Giacinto
aaaa90ae4b
bench(paged): final apples-to-apples NVFP4 decode benchmark (0023 vs vLLM 0.23.0, GB10)
Publishable, plot-ready head-to-head on GB10 / DGX Spark with matched NVFP4 weights,
both engines at their best realistic config (CUDA graphs ON both sides; vLLM util 0.85
max-model-len 4096 max-num-seqs 256; llama -c 131072 --parallel 128 LLAMA_KV_PAGED=1
LLAMA_MAX_BATCH_TOKENS=512). Identical async client: 512-tok unique-nonce prompt
(fresh full prefill), max_tokens=256, temp 0, ignore_eos, stream+usage; npl 8/32/64/128.
llama = clean patch 0023 (dev tree f7409c2, bf16 GDN-state work reverted, build-cuda
rebuilt). llama runs at HIGHER precision (f32 GDN state + q8 act) than vLLM (bf16 + w4a4).
decode_agg t/s, llama as % of vLLM:
DENSE q36-27b-nvfp4: npl8 117% npl32 91% npl64 90% npl128 92%
MoE q36-35b-a3b: npl8 83% npl32 78% npl64 77% npl128 82%
memory: llama on-demand paged KV 50-90 GB (dense) / 36-58 GB (MoE) vs vLLM fixed ~107 GB
pool at all npl (1.5-3x lower). TTFT: vLLM wins under synchronized burst (llama
decode-first budget trades burst-prefill for decode; decode + memory unaffected).
Outputs: final_benchmark.csv (16 rows, 5 metrics each), refreshed QWEN36_NVFP4_BENCH.md
(FINAL section), BENCHMARK_PROGRESS.md (per-row checkpoint log). Methodology notes:
per-npl llama server restart (paged-pool degrades after high-npl bursts; decode robust),
vLLM npl8 re-check confirms no degradation; clean env (service containers stopped for the
run, restored after).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 03:47:24 +00:00
..
2026-06-22 09:22:36 +00:00
2026-06-22 09:22:36 +00:00
2026-06-22 09:22:36 +00:00
2026-06-22 09:22:36 +00:00
2026-06-22 09:22:36 +00:00
2026-06-22 10:14:27 +00:00
2026-06-22 10:47:10 +00:00
2026-06-22 15:03:16 +00:00
2026-06-22 18:04:09 +00:00
2026-06-22 20:37:12 +00:00
2026-06-22 22:38:28 +00:00
2026-06-23 09:13:08 +00:00
2026-06-23 09:55:32 +00:00
2026-06-23 13:49:15 +00:00
2026-06-23 19:04:55 +00:00
2026-06-24 07:48:20 +00:00
2026-06-24 17:58:00 +00:00
2026-06-24 22:45:49 +00:00
2026-06-24 23:47:51 +00:00
2026-06-25 10:41:38 +00:00
2026-06-25 16:56:35 +00:00
2026-06-25 18:34:17 +00:00
2026-06-25 21:49:15 +00:00
2026-06-24 21:45:42 +00:00
2026-06-22 09:22:36 +00:00
2026-06-26 03:47:24 +00:00
2026-06-25 16:46:59 +00:00
2026-06-26 00:49:49 +00:00
2026-06-26 00:49:49 +00:00
2026-06-25 16:55:25 +00:00
2026-06-25 15:24:49 +00:00
2026-06-23 22:48:31 +00:00
2026-06-25 16:56:35 +00:00
2026-06-25 15:03:18 +00:00
2026-06-22 15:44:24 +00:00
2026-06-25 09:06:50 +00:00
2026-06-26 03:47:24 +00:00
2026-06-24 14:31:35 +00:00
2026-06-26 00:53:09 +00:00
2026-06-24 11:21:44 +00:00
2026-06-25 15:27:04 +00:00
2026-06-25 10:41:38 +00:00
2026-06-23 19:04:55 +00:00
2026-06-23 13:17:03 +00:00
2026-06-25 21:49:15 +00:00
2026-06-23 13:49:15 +00:00
2026-06-25 21:49:15 +00:00
2026-06-25 18:34:17 +00:00
2026-06-24 10:56:13 +00:00
2026-06-22 12:59:09 +00:00
2026-06-22 11:50:01 +00:00
2026-06-22 14:16:52 +00:00
2026-06-22 13:48:01 +00:00
2026-06-26 03:47:24 +00:00
2026-06-25 22:42:08 +00:00
2026-06-23 12:22:15 +00:00
2026-06-24 23:47:51 +00:00
2026-06-24 17:58:00 +00:00
2026-06-24 07:44:07 +00:00