mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-24 16:49:06 -04:00
The prior all-at-once BURST H2H is adversarial to any prefill budget (TTFT is prefill-rate-bound, a cap only slows the drain) and showed 0016 ~= 0013. Run a STAGGERED-arrival benchmark on the GB10 DGX (patch 0016 built @253cbae): a steady-rate client that keeps a mix of in-flight decoders + newly-arriving prefills, capturing per-request TTFT and the full inter-token-latency series. Append the metrics (in-flight decode protection + new-request TTFT, per arm) and an honest verdict to P1_DYNAMIC_BUDGET_RESULTS.md. On staggered traffic stock's in-flight decoders freeze multi-second on every prefill admission while both budget arms keep ITL flat; 0016 (mbt512) sits at a strictly better point on the protection/TTFT frontier than 0013-256 (equal spike-free protection, materially lower TTFT/throughput/wall) and adds a decode-adaptive single-T knob. It does not strictly dominate stock (Pareto tradeoff: smoothness vs raw TTFT). Verdict: 0016 earns its keep over 0013 on staggered traffic; recommend LLAMA_MAX_BATCH_TOKENS=512. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>