mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
Agent-finalized eval: builds (1-line Qwen3 reshape fix), but on GB10+32B paged is ~12% slower than contiguous and both cap at LLAMA_MAX_SEQ=256 (not OOM; 16GiB/119). Agent argues 32B is compute-bound + plateaus by npl=128 so raising the cap won't help - but 540 t/s << ~1900 bandwidth ceiling, so the plateau cause is unconfirmed (attention-over-KV or CPU sampling, not matmul saturation). Next: raise the cap + remeasure to settle it. Verdict: do not adopt #22569; paged KV not a GB10 lever. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>