mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
Full sweep, Qwen3-32B: contiguous decode 537/541 t/s at npl=128/256 (plateau); paged (#22569) 477/471 - SLOWER at matched concurrency. Both FAIL at npl=512/1024 with n_seq_max<=256 - paged does NOT bypass the LLAMA_MAX_SEQ=256 compile cap, its whole purpose. GB10's limit is the 256-seq cap + the ~540 decode plateau (flat by npl=128), NOT KV capacity/fragmentation (122 GB unified). Paged KV solves a problem GB10 doesn't have; it remains valid for memory-constrained datacenter GPUs (24-48GB) but must be validated there, not GB10. Do not adopt #22569; do not build paged KV for GB10. Real GB10 questions: the 256 cap (cheap) + the 540 plateau (vs vLLM 667). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>