Files
LocalAI/backend/cpp/llama-cpp/patches/paged
Ettore Di Giacinto c7075fb796 docs(paged): MoE 35B-A3B NVFP4 fair re-run with max_prefill_tokens budget
Budget 256/512 sweep on the A3B MoE under patch 0013. Mirror image of the
dense case: stock MoE was never prefill-starved (3B active, TTFT 84.8s @npl128),
so the budget is a decode-throughput lever paid for in TTFT, not a TTFT fix.
Budget 256 lifts decode_agg +14% (292->333.5 tok/s) and restores monotonic
decode scaling (kills the stock +7.4% plateau, now +20% into npl128), moving
llama 36.0%->41.1% of vLLM decode. Gap not closed: vLLM still ~2.4x decode and
~12x lower TTFT @npl128.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-23 21:38:08 +00:00
..