LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-26 09:26:55 -04:00

Files

Ettore Di Giacinto ae0042f214 docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots

Public deliverable for the patch-0018..0023 f32 bit-exact paged-attention ship:
the apples-to-apples NVFP4 decode benchmark (llama.cpp paged 0023 vs vLLM 0.23.0
on GB10 / DGX Spark, matched weights, CUDA graphs ON both sides).

- final_benchmark.csv: clean 8-column plot-ready schema
  (model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb),
  16 rows (2 models x 2 engines x npl 8/32/64/128).
- QWEN36_NVFP4_BENCH.md: embed the two decode-vs-npl plots; add the
  internal-consistency note (decode_agg vs perseq*npl is TTFT-governed, holds on
  both engines, no stale-baseline carry-over).
- decode-vs-npl PNGs (one per model), llama vs vLLM, per-point llama-%-of-vLLM labels.

Headline (measured, nothing pre-assumed): dense llama 90-117% of vLLM decode
(ahead at npl8), MoE 77-83%, at higher precision (f32 GDN state + q8 act vs vLLM
bf16 GDN + w4a4) and 1.5-3x lower unified memory (on-demand paged KV vs vLLM's
flat ~107 GB pool).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-26 03:51:35 +00:00

ds4

chore: ⬆️ Update antirez/ds4 to 80ebbc396aee40eedc1d829222f3362d10fa4c6c (#10378 )

2026-06-18 00:32:13 +02:00

grpc

fix: speedup git submodule update with --single-branch (#2847 )

2024-07-13 22:32:25 +02:00

ik-llama-cpp

chore: ⬆️ Update ikawrakow/ik_llama.cpp to d5507e33ae7ee2b7b41475f08044d3bde3b839ee (#10498 )

2026-06-25 00:57:42 +02:00

llama-cpp

docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots