LocalAI

mirror/LocalAI

Fork 0

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 08:08:52 -04:00

Commit Graph

Author	SHA1	Message	Date
Ettore Di Giacinto	d6c91b7d62	analysis: finalize PR #22569 paged-KV eval (full detail + compute-bound note) Agent-finalized eval: builds (1-line Qwen3 reshape fix), but on GB10+32B paged is ~12% slower than contiguous and both cap at LLAMA_MAX_SEQ=256 (not OOM; 16GiB/119). Agent argues 32B is compute-bound + plateaus by npl=128 so raising the cap won't help - but 540 t/s << ~1900 bandwidth ceiling, so the plateau cause is unconfirmed (attention-over-KV or CPU sampling, not matmul saturation). Next: raise the cap + remeasure to settle it. Verdict: do not adopt #22569; paged KV not a GB10 lever. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 14:35:02 +00:00
Ettore Di Giacinto	92e93dfc34	analysis: paged KV gives ZERO benefit on GB10 (measured) - not the lever Full sweep, Qwen3-32B: contiguous decode 537/541 t/s at npl=128/256 (plateau); paged (#22569) 477/471 - SLOWER at matched concurrency. Both FAIL at npl=512/1024 with n_seq_max<=256 - paged does NOT bypass the LLAMA_MAX_SEQ=256 compile cap, its whole purpose. GB10's limit is the 256-seq cap + the ~540 decode plateau (flat by npl=128), NOT KV capacity/fragmentation (122 GB unified). Paged KV solves a problem GB10 doesn't have; it remains valid for memory-constrained datacenter GPUs (24-48GB) but must be validated there, not GB10. Do not adopt #22569; do not build paged KV for GB10. Real GB10 questions: the 256 cap (cheap) + the 540 plateau (vs vLLM 667). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 13:31:33 +00:00

Author

SHA1

Message

Date

Ettore Di Giacinto

d6c91b7d62

analysis: finalize PR #22569 paged-KV eval (full detail + compute-bound note)

Agent-finalized eval: builds (1-line Qwen3 reshape fix), but on GB10+32B paged is
~12% slower than contiguous and both cap at LLAMA_MAX_SEQ=256 (not OOM; 16GiB/119).
Agent argues 32B is compute-bound + plateaus by npl=128 so raising the cap won't
help - but 540 t/s << ~1900 bandwidth ceiling, so the plateau cause is unconfirmed
(attention-over-KV or CPU sampling, not matmul saturation). Next: raise the cap +
remeasure to settle it. Verdict: do not adopt #22569; paged KV not a GB10 lever.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-21 14:35:02 +00:00

Ettore Di Giacinto

92e93dfc34

analysis: paged KV gives ZERO benefit on GB10 (measured) - not the lever

Full sweep, Qwen3-32B: contiguous decode 537/541 t/s at npl=128/256 (plateau);
paged (#22569) 477/471 - SLOWER at matched concurrency. Both FAIL at npl=512/1024
with n_seq_max<=256 - paged does NOT bypass the LLAMA_MAX_SEQ=256 compile cap, its
whole purpose. GB10's limit is the 256-seq cap + the ~540 decode plateau (flat by
npl=128), NOT KV capacity/fragmentation (122 GB unified). Paged KV solves a problem
GB10 doesn't have; it remains valid for memory-constrained datacenter GPUs (24-48GB)
but must be validated there, not GB10. Do not adopt #22569; do not build paged KV
for GB10. Real GB10 questions: the 256 cap (cheap) + the 540 plateau (vs vLLM 667).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-21 13:31:33 +00:00

2 Commits