mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-26 09:26:55 -04:00
Publishable, plot-ready head-to-head on GB10 / DGX Spark with matched NVFP4 weights, both engines at their best realistic config (CUDA graphs ON both sides; vLLM util 0.85 max-model-len 4096 max-num-seqs 256; llama -c 131072 --parallel 128 LLAMA_KV_PAGED=1 LLAMA_MAX_BATCH_TOKENS=512). Identical async client: 512-tok unique-nonce prompt (fresh full prefill), max_tokens=256, temp 0, ignore_eos, stream+usage; npl 8/32/64/128. llama = clean patch 0023 (dev tree f7409c2, bf16 GDN-state work reverted, build-cuda rebuilt). llama runs at HIGHER precision (f32 GDN state + q8 act) than vLLM (bf16 + w4a4). decode_agg t/s, llama as % of vLLM: DENSE q36-27b-nvfp4: npl8 117% npl32 91% npl64 90% npl128 92% MoE q36-35b-a3b: npl8 83% npl32 78% npl64 77% npl128 82% memory: llama on-demand paged KV 50-90 GB (dense) / 36-58 GB (MoE) vs vLLM fixed ~107 GB pool at all npl (1.5-3x lower). TTFT: vLLM wins under synchronized burst (llama decode-first budget trades burst-prefill for decode; decode + memory unaffected). Outputs: final_benchmark.csv (16 rows, 5 metrics each), refreshed QWEN36_NVFP4_BENCH.md (FINAL section), BENCHMARK_PROGRESS.md (per-row checkpoint log). Methodology notes: per-npl llama server restart (paged-pool degrades after high-npl bursts; decode robust), vLLM npl8 re-check confirms no degradation; clean env (service containers stopped for the run, restored after). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>