mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
feat(paged): add patch 0013 decoupled per-step prefill-token budget
Mirror of the dev-tree paged scheduler patch into the llama.cpp backend's vendored patch series. Adds LLAMA_PREFILL_BUDGET, a per-step prefill-token budget for the inherited update_slots() scheduler, decoupled from n_batch (the analogue of vLLM's --max-num-batched-tokens). It caps how many prompt tokens a single update_slots() step ingests, splitting a long prefill across more steps so co-batched decode keeps advancing instead of freezing for the duration of one fat ~n_batch prefill chunk. Default (env unset or <= 0) = disabled, so stock behaviour is byte-identical; orthogonal to LLAMA_KV_PAGED. Measured on GB10 (dense Qwen3-32B-NVFP4, 8 steady decoders + one injected 6000-token prefill, same binary, only the env differs): worst decode freeze 3380 -> 482 ms (7.0x) and decode_stall 3285 -> 387 ms (8.5x) at budget=256, for a +20% TTFT on the long request; budget=512 gives 4.8x at ~no TTFT cost. This is a latency/fairness lever, not an aggregate-throughput lever (steady decode is NVFP4 weight-read-bound on GB10, which the scheduler cannot lift). Correctness: budget unset or >= n_batch is byte-identical to stock; budget=N is byte-identical to stock -bN while preserving n_batch for decode width; the only deviation on long prompts is intrinsic flash-attn chunk-size FP grouping that pure stock -b exhibits too. Verified applying on the pinned llama.cpp f3e1828 after patch 0008. Productisation follow-up: surface as a grpc-server.cpp options knob (max_prefill_tokens) per CHUNKED_PREFILL_PLAN Phase B. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -0,0 +1,137 @@
|
||||
From 17d97cb74e3e8c93751afd33f5c183e57056fde9 Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Tue, 23 Jun 2026 11:52:45 +0200
|
||||
Subject: [PATCH] feat(paged): decoupled per-step prefill-token budget (patch
|
||||
0013)
|
||||
|
||||
llama-server already co-batches decode with chunked prefill: update_slots()
|
||||
appends every generating slot's sampled token first, then fills the rest of the
|
||||
n_batch budget with prompt tokens, deferring the overflow to the next step. But
|
||||
the prefill chunk size is hard-wired to n_batch (default 2048): one slot's
|
||||
~2048-token prefill chunk lands in a single compute-heavy step, and every decode
|
||||
co-batched into that step sees a multi-second inter-token-latency (ITL) spike.
|
||||
Lowering n_batch shrinks the chunk but also caps decode-concurrency width and
|
||||
prefill throughput, because they are coupled.
|
||||
|
||||
Add LLAMA_PREFILL_BUDGET: a per-step prefill-token budget decoupled from n_batch
|
||||
(the analogue of vLLM's --max-num-batched-tokens / long_prefill_token_threshold).
|
||||
The prompt-fill loop and the outer slot loop now also stop once this many prompt
|
||||
tokens have been added in the current update_slots() step, so a long prefill is
|
||||
split across more steps that each still advance in-flight decode. Default (env
|
||||
unset or <= 0) = disabled, so stock behaviour is byte-identical. Orthogonal to
|
||||
LLAMA_KV_PAGED: this is a pure scheduler knob and works with paged off.
|
||||
|
||||
Measured on GB10 (sm_121), dense Qwen3-32B-NVFP4, paged build, 8 steady decode
|
||||
streams with one 6000-token prefill injected mid-stream; same binary, only
|
||||
LLAMA_PREFILL_BUDGET differs:
|
||||
|
||||
metric stock(off) budget=256 budget=512
|
||||
worst decode freeze (ms) 3380 482 (7.0x) 778 (4.3x)
|
||||
median decode ITL in window 2264 411 (5.5x) 689
|
||||
decode_stall (ms) 3285 387 (8.5x) 684 (4.8x)
|
||||
decode steps during prefill 38 201 (5.3x) 108
|
||||
injected-req TTFT (ms) 8493 10172 (+20%) 8432 (~0%)
|
||||
steady-state baseline ITL 94 95 94
|
||||
|
||||
This is a LATENCY/fairness lever, not an aggregate-throughput lever: it flattens
|
||||
the decode ITL spike a long prefill inflicts on co-batched decoders (8.5x smaller
|
||||
worst freeze and 5.3x more decode progress during the prefill at budget=256), in
|
||||
exchange for a modest TTFT rise on the long request (the classic chunked-prefill
|
||||
trade-off; budget=512 buys 4.8x with ~no TTFT cost). Steady aggregate decode is
|
||||
unchanged: it is bandwidth/weight-capped on GB10 (the NVFP4 weight-read floor),
|
||||
which the scheduler cannot lift.
|
||||
|
||||
Correctness (same model, greedy temp 0, fa on):
|
||||
- budget unset or >= n_batch: byte-identical to stock (the added break never
|
||||
fires before the existing n_batch break; the off-path is a no-op by
|
||||
construction).
|
||||
- short prompt (<= budget): byte-identical to stock.
|
||||
- the knob is exactly equivalent to stock's native -b chunking: budget=512 ==
|
||||
stock -b512 and budget=256 == stock -b256, both BYTE-IDENTICAL, while keeping
|
||||
n_batch=2048 for decode width.
|
||||
- on a prompt larger than the budget the chunked greedy output diverges from the
|
||||
single n_batch chunk only by intrinsic flash-attn chunk-size FP grouping: PURE
|
||||
stock -b256 diverges from stock -b2048 the same way with the patch inactive,
|
||||
and the output stays coherent and answers correctly.
|
||||
|
||||
Productisation (LocalAI): surface as a model options knob (max_prefill_tokens /
|
||||
mpt) parsed in grpc-server.cpp, default 0 = disabled, per CHUNKED_PREFILL_PLAN
|
||||
Phase B; the vendored update_slots() hunk here is that plan's scheduler patch and
|
||||
stays disjoint from the paged allocation hunks.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
---
|
||||
tools/server/server-context.cpp | 35 ++++++++++++++++++++++++++++++++-
|
||||
1 file changed, 34 insertions(+), 1 deletion(-)
|
||||
|
||||
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
|
||||
index 04c6361..5d83b30 100644
|
||||
--- a/tools/server/server-context.cpp
|
||||
+++ b/tools/server/server-context.cpp
|
||||
@@ -2723,6 +2723,29 @@ private:
|
||||
int32_t n_batch = llama_n_batch(ctx_tgt);
|
||||
int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
|
||||
|
||||
+ // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
|
||||
+ // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
|
||||
+ // tokens ingested per update_slots() step at n_batch only; with cont_batching the
|
||||
+ // sampled decode tokens of every generating slot are appended FIRST, then prompt
|
||||
+ // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
|
||||
+ // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
|
||||
+ // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
|
||||
+ // tokens added per step independently of n_batch, splitting a long prefill across
|
||||
+ // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
|
||||
+ // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
|
||||
+ // (this is a pure scheduler knob; works with paged off).
|
||||
+ int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
|
||||
+ {
|
||||
+ const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
|
||||
+ if (env_pb) {
|
||||
+ const int v = atoi(env_pb);
|
||||
+ if (v > 0) {
|
||||
+ n_prefill_budget = std::min(n_batch, std::max(1, v));
|
||||
+ }
|
||||
+ }
|
||||
+ }
|
||||
+ int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
|
||||
+
|
||||
float alora_scale = -1.0f;
|
||||
size_t alora_disabled_id = 0;
|
||||
|
||||
@@ -3159,7 +3182,10 @@ private:
|
||||
const bool n_before_user_known = n_before_user > 0;
|
||||
|
||||
// add prompt tokens for processing in the current batch
|
||||
- while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch) {
|
||||
+ // (patch 0013) also stop once the per-step prefill budget is spent, so a long
|
||||
+ // prompt is split across more steps and leaves batch room for co-batched decode
|
||||
+ while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch &&
|
||||
+ (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
|
||||
// get next token to process
|
||||
llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
|
||||
if (cur_tok == LLAMA_TOKEN_NULL) {
|
||||
@@ -3185,6 +3211,7 @@ private:
|
||||
slot.prompt.tokens.push_back(cur_tok);
|
||||
|
||||
slot.n_prompt_tokens_processed++;
|
||||
+ n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
|
||||
|
||||
// stop the prompt batch exactly before the latest user input, so a checkpoint
|
||||
// can be created after the previous messages
|
||||
@@ -3293,6 +3320,12 @@ private:
|
||||
if (batch.n_tokens >= n_batch) {
|
||||
break;
|
||||
}
|
||||
+
|
||||
+ // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
|
||||
+ // leaving the remaining batch capacity for co-batched decode of other slots
|
||||
+ if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
|
||||
+ break;
|
||||
+ }
|
||||
}
|
||||
}
|
||||
|
||||
--
|
||||
2.43.0
|
||||
|
||||
Reference in New Issue
Block a user