LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Files

Ettore Di Giacinto 92e93dfc34 analysis: paged KV gives ZERO benefit on GB10 (measured) - not the lever

Full sweep, Qwen3-32B: contiguous decode 537/541 t/s at npl=128/256 (plateau);
paged (#22569) 477/471 - SLOWER at matched concurrency. Both FAIL at npl=512/1024
with n_seq_max<=256 - paged does NOT bypass the LLAMA_MAX_SEQ=256 compile cap, its
whole purpose. GB10's limit is the 256-seq cap + the ~540 decode plateau (flat by
npl=128), NOT KV capacity/fragmentation (122 GB unified). Paged KV solves a problem
GB10 doesn't have; it remains valid for memory-constrained datacenter GPUs (24-48GB)
but must be validated there, not GB10. Do not adopt #22569; do not build paged KV
for GB10. Real GB10 questions: the 256 cap (cheap) + the 540 plateau (vs vLLM 667).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-21 13:31:33 +00:00

paged

analysis: paged KV gives ZERO benefit on GB10 (measured) - not the lever

2026-06-21 13:31:33 +00:00

patches

bench(dense): root-cause the W4A4 NVFP4 hang; W4A16 vs Q4 is the headline

2026-06-20 06:59:50 +00:00

CMakeLists.txt

fix(turboquant): resolve common.h by detecting llama-common vs common target (#9413 )

2026-04-18 20:30:28 +02:00

grpc-server.cpp

feat: generic chat_template_kwargs (model config + per-request metadata) (#10359 )