LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Files

Ettore Di Giacinto 0337505dc8 docs(paged): measure paged KV at high concurrency (LLAMA_MAX_SEQ=2048) - no single-GB10 win

Closes the open question from PR22569_EVAL: that eval was blocked by the 256-seq
compile cap and used a compute-bound 32B. Recompiled LLAMA_MAX_SEQ=2048 and swept a
bandwidth-bound model (Qwen3-1.7B) to npl=2048, both KV layouts.

Result: aggregate decode plateaus at the hardware ceiling for BOTH layouts - 1.7B
flattens ~3200-3700 t/s by npl=512 (contiguous and paged alike), 32B-dense ~540 by
npl=128. Pushing concurrency past the plateau collapses per-seq tps (23->1.9) and
explodes TTFT (0.6s->64s) with no aggregate gain. Paged KV is a memory-capacity /
anti-fragmentation / prefix-sharing feature, not a single-node throughput lever; the
24k aggregate is a fleet-level (multi-GPU) result, unreachable on one GB10 regardless
of KV layout.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-21 22:47:20 +00:00

paged

docs(paged): measure paged KV at high concurrency (LLAMA_MAX_SEQ=2048) - no single-GB10 win

2026-06-21 22:47:20 +00:00

patches

bench(dense): root-cause the W4A4 NVFP4 hang; W4A16 vs Q4 is the headline

2026-06-20 06:59:50 +00:00

CMakeLists.txt

fix(turboquant): resolve common.h by detecting llama-common vs common target (#9413 )

2026-04-18 20:30:28 +02:00

grpc-server.cpp

feat: generic chat_template_kwargs (model config + per-request metadata) (#10359 )