LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 08:08:52 -04:00

Files

Ettore Di Giacinto 07985ba45b analysis: measured llama.cpp aggregate vs vLLM - already ~75-80% at npl<=128

llama-batched-bench Qwen3-32B-Q4_K_M: aggregate decode 235/391/540 t/s at
npl=32/64/128 vs vLLM 328/569/667 = 72/69/81%, multiplier 53x (vLLM 56x), still
climbing at 128. The 30x headline is wrong at realistic concurrency: llama.cpp is
ahead single-stream (MXFP4 1153 > 800) and ~75-80% aggregate. Aggregate prefill is
flat ~760 but GB10-compute-capped (vLLM ~800 too), so chunked prefill is a
latency/TTFT win not throughput; paged KV is the high-concurrency (thousands-seqs)
lever for vLLM's 24k regime. ROI: MXFP4 ship -> chunked prefill -> paged KV.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-21 11:32:40 +00:00

ds4

chore: ⬆️ Update antirez/ds4 to 80ebbc396aee40eedc1d829222f3362d10fa4c6c (#10378 )

2026-06-18 00:32:13 +02:00

grpc

fix: speedup git submodule update with --single-branch (#2847 )

2024-07-13 22:32:25 +02:00

ik-llama-cpp

chore: ⬆️ Update ikawrakow/ik_llama.cpp to b3dfb7858cfcb9166e92f366e5af87f19ebc94be (#10395 )

2026-06-19 00:03:37 +02:00

llama-cpp

analysis: measured llama.cpp aggregate vs vLLM - already ~75-80% at npl<=128