LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-26 09:26:55 -04:00

Files

Ettore Di Giacinto ae0042f214 docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots

Public deliverable for the patch-0018..0023 f32 bit-exact paged-attention ship:
the apples-to-apples NVFP4 decode benchmark (llama.cpp paged 0023 vs vLLM 0.23.0
on GB10 / DGX Spark, matched weights, CUDA graphs ON both sides).

- final_benchmark.csv: clean 8-column plot-ready schema
  (model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb),
  16 rows (2 models x 2 engines x npl 8/32/64/128).
- QWEN36_NVFP4_BENCH.md: embed the two decode-vs-npl plots; add the
  internal-consistency note (decode_agg vs perseq*npl is TTFT-governed, holds on
  both engines, no stale-baseline carry-over).
- decode-vs-npl PNGs (one per model), llama vs vLLM, per-point llama-%-of-vLLM labels.

Headline (measured, nothing pre-assumed): dense llama 90-117% of vLLM decode
(ahead at npl8), MoE 77-83%, at higher precision (f32 GDN state + q8 act vs vLLM
bf16 GDN + w4a4) and 1.5-3x lower unified memory (on-demand paged KV vs vLLM's
flat ~107 GB pool).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-26 03:51:35 +00:00

paged

feat(paged): target-readiness for 2xH200 - correctness PASS, load-gen harness, projection

2026-06-21 23:16:28 +00:00

patches

docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots

2026-06-26 03:51:35 +00:00

CMakeLists.txt

feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple) (#10497 )

2026-06-25 15:47:03 +02:00

grpc-server.cpp

Merge branch 'master' into worktree-feat+paged-attention