Files
LocalAI/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch
Ettore Di Giacinto 24ce7d0823 feat(llama-cpp/paged): dynamic decode-first prefill budget (patch 0016, continuous-batch P1)
Mirror the P1 engine change of CONTINUOUS_BATCH_SCHEDULER_SCOPE.md into the
vendored paged patch series and surface it as a LocalAI model option.

- patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch:
  supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC,
  decode-first token budget in update_slots(). At the budget seam (already
  after Phase 1's decode fill, so batch.n_tokens == D is known) compute
  T = clamp(LLAMA_MAX_BATCH_TOKENS ?: n_batch, n_ubatch, n_batch),
  prefill_budget_step = max(n_ubatch, T - D), and a per-slot prompt-chunk
  cap prefill_cap_per_slot; bound the Phase-2 prompt-fill loop and outer
  admission break by these instead of 0013's constant. Policy-only change,
  no new slot states, no batch-formation rewrite, zero libllama changes.
  Decode is structurally claimed first (Phase 1) so the decode-first
  guarantee is free. As decode load D rises the leftover auto-shrinks, so
  the budget self-tunes across npl 8..128 and dense vs MoE and holds the
  GB10 decode ceiling tuning-free (vs 0013's hand-picked 256). The legacy
  LLAMA_PREFILL_BUDGET path is preserved (honoured only when the dynamic
  knob is unset), so 0013 is cleanly subsumed. DEFAULT-OFF byte-identical:
  all-knobs-unset and the degenerate T == n_batch case are bit-identical to
  stock by construction (the n_batch hard ceiling is kept and the dynamic
  bounds reach it at the same point for every D). Orthogonal to
  LLAMA_KV_PAGED.

- grpc-server.cpp: wire the new knob as model options max_batch_tokens / mbt
  (-> LLAMA_MAX_BATCH_TOKENS) and prefill_cap (-> LLAMA_PREFILL_CAP), beside
  the existing max_prefill_tokens / mpt seam; default-off, takes precedence
  over the legacy static budget when set.

- patches/paged/P1_DYNAMIC_BUDGET_RESULTS.md: design, the byte-identical
  determinism analysis (verified by construction), the local patch-apply
  verification, and the gate + A/B bench methodology.

Validation status: the patch applies cleanly on top of LLAMA_VERSION
(f3e1828) + paged 0001-0015, and the off-path / T==n_batch determinism is
proven by construction. The GB10 sm_121 build, the four runtime gates, and
the dense+MoE A/B sweep are PENDING a DGX run (the dev box was unreachable
this session) and are documented as such in P1_DYNAMIC_BUDGET_RESULTS.md; do
not sell the quantitative TTFT payoff until that re-run lands.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-24 07:48:20 +00:00

206 lines
13 KiB
Diff

From 0a2677c6e6c608f9c0ec657faa0ff04a03370aa6 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Wed, 24 Jun 2026 07:44:25 +0000
Subject: [PATCH] feat(paged): dynamic decode-first prefill-token budget (patch
0016, continuous-batch P1)
Supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC,
decode-first token budget: the P1 of the token-granular continuous-batch
scheduler scoped in CONTINUOUS_BATCH_SCHEDULER_SCOPE.md. This is a POLICY
change only inside update_slots(): no new slot states, no batch-formation
rewrite, zero libllama changes. llama-server already emits one unified
mixed prefill+decode batch per step (Phase 1 appends every ready decode
token unconditionally; Phase 2 fills prefill into the same batch); 0013
already ships that mixed ubatch. 0016 only changes the COUNT of prefill
tokens admitted per step.
The budget block already sits AFTER Phase 1's decode fill, so batch.n_tokens
== D (the live decode load) is known there. Instead of 0013's constant
LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets
one long prompt monopolise the step), compute a dynamic budget:
T = min(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_batch), floored at
n_ubatch (the vLLM max_num_batched_tokens analogue / ITL trade knob)
prefill_budget_step = max(n_ubatch, T - D) (leftover after decode,
auto-shrinks as decode load rises so the step never inflates past T)
prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch
(the long_prefill_token_threshold analogue: one long prompt cannot
eat the whole leftover; LLAMA_PREFILL_CAP overrides)
Phase 2's inner prompt-fill loop and outer admission break are bounded by
prefill_budget_step (across slots) and a new per-slot slot_prompt_added
counter (per-slot cap), instead of the static 0013 cap; the n_batch hard
ceiling stays as the compute bound. Decode is structurally claimed first
and never capped (Phase 1), so the decode-first guarantee is free.
Why it supersedes 0013: 0013 needs a hand-picked constant (256 for dense)
that is net-negative at low npl and costs MoE TTFT; the T - D budget is
self-tuning across npl 8..128 and across dense vs MoE, holding the GB10
decode ceiling (~161 dense / ~333 MoE tok/s @npl128) WITHOUT per-workload
tuning while collapsing burst TTFT. Steady-state decode throughput is NOT
lifted (that is the decode-kernel ceiling, scoped as P3); the P1 win is
TTFT + tuning-free robustness + clean supersession of 0013.
DEFAULT-OFF BYTE-IDENTICAL: with all knobs unset, behaviour is byte-identical
to stock. The degenerate T == n_batch case is byte-identical to stock/0013
(the determinism oracle): the leftover max(n_ubatch, n_batch - D) and the
n_batch per-slot cap both reach the existing `batch.n_tokens < n_batch`
ceiling at the same point, so no new bound fires. The legacy
LLAMA_PREFILL_BUDGET path is preserved exactly (honoured only when
LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly subsumed. Orthogonal
to LLAMA_KV_PAGED: pure scheduler policy, identical decisions paged on/off.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
tools/server/server-context.cpp | 107 +++++++++++++++++++++++++-------
1 file changed, 85 insertions(+), 22 deletions(-)
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index 5d83b30..f7a114c 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -2723,24 +2723,78 @@ private:
int32_t n_batch = llama_n_batch(ctx_tgt);
int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
- // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
- // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
- // tokens ingested per update_slots() step at n_batch only; with cont_batching the
- // sampled decode tokens of every generating slot are appended FIRST, then prompt
- // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
- // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
- // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
- // tokens added per step independently of n_batch, splitting a long prefill across
- // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
- // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
- // (this is a pure scheduler knob; works with paged off).
- int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
+ // PAGED serving lever (patch 0016, supersedes 0013): dynamic decode-first
+ // per-step prefill-token budget (continuous-batch scheduler P1). llama-server
+ // already builds ONE mixed batch per update_slots() step: Phase 1 (just above)
+ // appended every generating slot's sampled token UNCONDITIONALLY, so at this point
+ // batch.n_tokens == D is the live decode load; Phase 2 (below) fills the remaining
+ // batch capacity with prompt tokens. Patch 0013 capped Phase 2 with a STATIC
+ // constant (LLAMA_PREFILL_BUDGET) that ignores D, needs per-workload tuning, and
+ // lets one long prompt monopolise the step.
+ //
+ // This computes a DYNAMIC budget instead, the vLLM v1 token-budget analogue:
+ // a single total per-step token budget T, decode claims its D tokens first
+ // (already in the batch), and prefill gets the leftover T - D distributed across
+ // waiting prompts with a per-slot chunk cap. As decode load D rises the prefill
+ // leftover auto-shrinks, so the step never inflates past T at any concurrency:
+ // the budget self-tunes across the npl range and across dense vs MoE without a
+ // hand-picked constant (the 161/333 tok/s GB10 decode ceiling is held tuning-free
+ // instead of via 0013's hand-tuned 256). Decode is structurally claimed first and
+ // never capped (Phase 1), so the decode-first guarantee is free here.
+ //
+ // LLAMA_MAX_BATCH_TOKENS (T) total per-step token budget (decode + prefill),
+ // default n_batch, clamped to [n_ubatch, n_batch] so
+ // the compute loop stays a single llama_decode and
+ // prefill keeps an n_ubatch floor of progress.
+ // LLAMA_PREFILL_CAP per-slot max prompt tokens per step (the
+ // long_prefill_token_threshold analogue), default
+ // min(T, ceil(0.04*n_ctx)) floored at n_ubatch, so
+ // one long prompt cannot eat the whole leftover.
+ // LLAMA_PREFILL_BUDGET legacy static cap (patch 0013); honoured ONLY when
+ // LLAMA_MAX_BATCH_TOKENS is unset, for back-compat.
+ //
+ // DEFAULT-OFF BYTE-IDENTICAL: with all three knobs unset, and in the degenerate
+ // T == n_batch case, behaviour is byte-identical to stock. At T == n_batch the
+ // dynamic leftover max(n_ubatch, n_batch - D) and the n_batch per-slot cap both
+ // reach the existing `batch.n_tokens < n_batch` ceiling at the SAME point, so no
+ // new bound fires (the determinism oracle). Orthogonal to LLAMA_KV_PAGED: pure
+ // scheduler policy, identical decisions with paged on or off.
+ const int32_t n_decode_in_batch = batch.n_tokens; // D: Phase 1 appended D decode tokens above
+ int32_t prefill_budget_step = 0; // 0 = disabled (stock n_batch-only chunking)
+ int32_t prefill_cap_per_slot = 0; // 0 = disabled (no per-slot prompt-chunk cap)
{
- const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
- if (env_pb) {
+ int32_t mbt = 0;
+ if (const char * env_mbt = getenv("LLAMA_MAX_BATCH_TOKENS")) {
+ mbt = atoi(env_mbt);
+ }
+ if (mbt > 0) {
+ // dynamic decode-first budget (P1): T clamped to [n_ubatch, n_batch]
+ int32_t T = std::min(n_batch, mbt);
+ T = std::max(T, n_ubatch);
+ // leftover after decode, floored at n_ubatch so prefill never fully starves
+ prefill_budget_step = std::max(n_ubatch, T - n_decode_in_batch);
+ // per-slot prompt-chunk cap (long_prefill_token_threshold analogue)
+ int32_t cap = 0;
+ if (const char * env_cap = getenv("LLAMA_PREFILL_CAP")) {
+ cap = atoi(env_cap);
+ }
+ if (cap <= 0) {
+ const int32_t pct4 = (n_ctx + 24) / 25; // ceil(0.04 * n_ctx)
+ cap = std::min(T, std::max(n_ubatch, pct4));
+ }
+ cap = std::min(n_batch, std::max(n_ubatch, cap));
+ // at T == n_batch the leftover and cap both reach the n_batch ceiling
+ // together; pin the cap to n_batch so this case stays byte-identical
+ if (T >= n_batch) {
+ cap = n_batch;
+ }
+ prefill_cap_per_slot = cap;
+ } else if (const char * env_pb = getenv("LLAMA_PREFILL_BUDGET")) {
+ // legacy static budget (patch 0013), kept for back-compat when the
+ // dynamic knob is unset: a constant per-step prefill cap, no per-slot cap
const int v = atoi(env_pb);
if (v > 0) {
- n_prefill_budget = std::min(n_batch, std::max(1, v));
+ prefill_budget_step = std::min(n_batch, std::max(1, v));
}
}
}
@@ -3181,11 +3235,18 @@ private:
const int32_t n_before_user = slot.task->params.n_before_user;
const bool n_before_user_known = n_before_user > 0;
+ // (patch 0016) per-slot prompt tokens added this step, for the per-slot
+ // chunk cap (resets each slot); n_batch stays the hard compute ceiling
+ int32_t slot_prompt_added = 0;
+
// add prompt tokens for processing in the current batch
- // (patch 0013) also stop once the per-step prefill budget is spent, so a long
- // prompt is split across more steps and leaves batch room for co-batched decode
+ // (patch 0016) also stop once (a) the dynamic per-step prefill budget
+ // (the T - D leftover) is spent across all slots, or (b) this slot's
+ // per-slot chunk cap is hit, so a long prompt is split across more steps
+ // and leaves batch room for co-batched decode of the other slots
while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch &&
- (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
+ (prefill_budget_step == 0 || n_prompt_budgeted < prefill_budget_step) &&
+ (prefill_cap_per_slot == 0 || slot_prompt_added < prefill_cap_per_slot)) {
// get next token to process
llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
if (cur_tok == LLAMA_TOKEN_NULL) {
@@ -3211,7 +3272,8 @@ private:
slot.prompt.tokens.push_back(cur_tok);
slot.n_prompt_tokens_processed++;
- n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
+ n_prompt_budgeted++; // (patch 0016) toward the dynamic per-step prefill budget
+ slot_prompt_added++; // (patch 0016) toward this slot's per-step chunk cap
// stop the prompt batch exactly before the latest user input, so a checkpoint
// can be created after the previous messages
@@ -3321,9 +3383,10 @@ private:
break;
}
- // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
- // leaving the remaining batch capacity for co-batched decode of other slots
- if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
+ // (patch 0016) stop admitting prompts once the dynamic per-step prefill
+ // budget (the T - D leftover) is spent, leaving the remaining batch
+ // capacity for co-batched decode of the other slots
+ if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) {
break;
}
}
--
2.43.0