Files
LocalAI/backend/cpp
Ettore Di Giacinto 4bc2b4a9b2 feat(paged): add patch 0013 decoupled per-step prefill-token budget
Mirror of the dev-tree paged scheduler patch into the llama.cpp backend's
vendored patch series. Adds LLAMA_PREFILL_BUDGET, a per-step prefill-token
budget for the inherited update_slots() scheduler, decoupled from n_batch
(the analogue of vLLM's --max-num-batched-tokens). It caps how many prompt
tokens a single update_slots() step ingests, splitting a long prefill across
more steps so co-batched decode keeps advancing instead of freezing for the
duration of one fat ~n_batch prefill chunk. Default (env unset or <= 0) =
disabled, so stock behaviour is byte-identical; orthogonal to LLAMA_KV_PAGED.

Measured on GB10 (dense Qwen3-32B-NVFP4, 8 steady decoders + one injected
6000-token prefill, same binary, only the env differs): worst decode freeze
3380 -> 482 ms (7.0x) and decode_stall 3285 -> 387 ms (8.5x) at budget=256,
for a +20% TTFT on the long request; budget=512 gives 4.8x at ~no TTFT cost.
This is a latency/fairness lever, not an aggregate-throughput lever (steady
decode is NVFP4 weight-read-bound on GB10, which the scheduler cannot lift).

Correctness: budget unset or >= n_batch is byte-identical to stock; budget=N
is byte-identical to stock -bN while preserving n_batch for decode width; the
only deviation on long prompts is intrinsic flash-attn chunk-size FP grouping
that pure stock -b exhibits too. Verified applying on the pinned llama.cpp
f3e1828 after patch 0008.

Productisation follow-up: surface as a grpc-server.cpp options knob
(max_prefill_tokens) per CHUNKED_PREFILL_PLAN Phase B.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-23 09:55:32 +00:00
..