mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-24 08:38:51 -04:00
Budget 256/512 sweep on the A3B MoE under patch 0013. Mirror image of the dense case: stock MoE was never prefill-starved (3B active, TTFT 84.8s @npl128), so the budget is a decode-throughput lever paid for in TTFT, not a TTFT fix. Budget 256 lifts decode_agg +14% (292->333.5 tok/s) and restores monotonic decode scaling (kills the stock +7.4% plateau, now +20% into npl128), moving llama 36.0%->41.1% of vLLM decode. Gap not closed: vLLM still ~2.4x decode and ~12x lower TTFT @npl128. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>