mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 09:57:14 -04:00
The worktree merge bumped LLAMA_VERSION 8be759e6 -> 9d5d882d. This re-syncs the paged patch-stack (0001-0024) to the new tip: the stack was rebased onto 9d5d882d on the DGX dev tree, rebuilt clean (CUDA sm_121), and re-validated bit-exact before re-exporting the LocalAI .patch files. Re-exporting each shipped patch from its rebased commit and diffing body-to-body against the committed files identifies exactly 4 that changed and no longer git-apply to 9d5d882d: - 0008 cross-request prefix share: re-anchored the [paged 0008] commit block to the refactored update_slots() lambda (continue->return, batch.n_tokens-> batch.size()); identical env-guarded logic. - 0013 static prefill budget: budget var-block / while-gate / admission-break re-expressed against the refactored loop (add_ok=false idiom). - 0015 expert-density MoE token-tile auto-select: pure context re-anchor; upstream inserted a test_mul_mat_id case at the hunk anchor in test-backend-ops.cpp. The inserted lines are unchanged. (This one rebased cleanly via 3-way but its committed .patch no longer applies with plain git apply, so it is caught by the per-patch apply-check, not by the rebase conflict count.) - 0016 dynamic decode-first budget: dynamic budget block + n_decode_in_batch = batch.size() + add_ok=false against the refactored loop. All four are byte-faithful format-patch exports of the gate-green rebased commits. Applying the full corrected series to a fresh 9d5d882d reproduces the gate-green tree byte-for-byte across every code file. The other 7 touched patches (0009/0017/0018/0019/0020/0021/0024) are LINENUM-only (hunk bodies byte-identical, only @@ line-numbers shifted) and still apply cleanly, so they are left unchanged. The remaining patches are identical. Validation on the rebased build (NVFP4 Qwen3.6, GB10 sm_121): - test-backend-ops CUDA0: GATED_DELTA_NET 36/36, SSM_CONV 45/45, MUL_MAT 1146/1146, MUL_MAT_ID 806/806 all OK. - greedy md5 (-fa on -n 48 --temp 0 --seed 1): dense q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439 and MoE q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd, both == baseline. - decode S_TG @npl128: dense 366.41 t/s (ref 373.2, -1.8%), MoE 751.11 t/s (ref 745.7, +0.7%), both within noise. Details in backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
192 lines
12 KiB
Diff
192 lines
12 KiB
Diff
From 02fa0473a9324b7e12f9b203d221cc4ac80cfd33 Mon Sep 17 00:00:00 2001
|
|
From: Ettore Di Giacinto <mudler@localai.io>
|
|
Date: Wed, 24 Jun 2026 10:11:48 +0200
|
|
Subject: [PATCH] feat(paged): dynamic decode-first prefill-token budget (patch
|
|
0016, continuous-batch P1)
|
|
|
|
Supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC,
|
|
decode-first token budget: the P1 of the token-granular continuous-batch
|
|
scheduler. POLICY change only inside update_slots(): no new slot states, no
|
|
batch-formation rewrite, zero libllama changes. llama-server already emits one
|
|
unified mixed prefill+decode batch per step (Phase 1 appends every ready decode
|
|
token unconditionally; Phase 2 fills prefill into the same batch). 0016 only
|
|
changes the COUNT of prefill tokens admitted per step.
|
|
|
|
The budget block already sits AFTER Phase 1's decode fill, so batch.n_tokens
|
|
== D (the live decode load) is known there. Instead of 0013's constant
|
|
LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets one
|
|
long prompt monopolise the step), compute a dynamic budget:
|
|
|
|
T = clamp(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_ubatch, n_batch)
|
|
prefill_budget_step = max(n_ubatch, T - D) (leftover after decode,
|
|
auto-shrinks as decode load rises so the step never inflates past T)
|
|
prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch,
|
|
pinned to n_batch when T == n_batch (LLAMA_PREFILL_CAP overrides)
|
|
|
|
Phase 2's inner prompt-fill loop and outer admission break are bounded by
|
|
prefill_budget_step (across slots) and a new per-slot slot_prompt_added
|
|
counter; the n_batch hard ceiling stays as the compute bound. Decode is
|
|
structurally claimed first and never capped (Phase 1), so the decode-first
|
|
guarantee is free.
|
|
|
|
DEFAULT-OFF BYTE-IDENTICAL: with all knobs unset, behaviour is byte-identical
|
|
to stock. The degenerate T == n_batch case is byte-identical to stock/0013 (the
|
|
determinism oracle). The legacy LLAMA_PREFILL_BUDGET path is preserved exactly
|
|
(honoured only when LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly
|
|
subsumed. Orthogonal to LLAMA_KV_PAGED: pure scheduler policy, identical
|
|
decisions paged on or off.
|
|
|
|
Assisted-by: Claude:opus-4.8 [Claude Code]
|
|
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
|
---
|
|
tools/server/server-context.cpp | 107 +++++++++++++++++++++++++-------
|
|
1 file changed, 85 insertions(+), 22 deletions(-)
|
|
|
|
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
|
|
index afcdebe..b8b8f00 100644
|
|
--- a/tools/server/server-context.cpp
|
|
+++ b/tools/server/server-context.cpp
|
|
@@ -3043,24 +3043,78 @@ private:
|
|
int32_t n_batch = llama_n_batch(ctx_tgt);
|
|
int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
|
|
|
|
- // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
|
|
- // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
|
|
- // tokens ingested per update_slots() step at n_batch only; with cont_batching the
|
|
- // sampled decode tokens of every generating slot are appended FIRST, then prompt
|
|
- // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
|
|
- // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
|
|
- // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
|
|
- // tokens added per step independently of n_batch, splitting a long prefill across
|
|
- // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
|
|
- // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
|
|
- // (this is a pure scheduler knob; works with paged off).
|
|
- int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
|
|
+ // PAGED serving lever (patch 0016, supersedes 0013): dynamic decode-first
|
|
+ // per-step prefill-token budget (continuous-batch scheduler P1). llama-server
|
|
+ // already builds ONE mixed batch per update_slots() step: Phase 1 (just above)
|
|
+ // appended every generating slot's sampled token UNCONDITIONALLY, so at this point
|
|
+ // batch.n_tokens == D is the live decode load; Phase 2 (below) fills the remaining
|
|
+ // batch capacity with prompt tokens. Patch 0013 capped Phase 2 with a STATIC
|
|
+ // constant (LLAMA_PREFILL_BUDGET) that ignores D, needs per-workload tuning, and
|
|
+ // lets one long prompt monopolise the step.
|
|
+ //
|
|
+ // This computes a DYNAMIC budget instead, the vLLM v1 token-budget analogue:
|
|
+ // a single total per-step token budget T, decode claims its D tokens first
|
|
+ // (already in the batch), and prefill gets the leftover T - D distributed across
|
|
+ // waiting prompts with a per-slot chunk cap. As decode load D rises the prefill
|
|
+ // leftover auto-shrinks, so the step never inflates past T at any concurrency:
|
|
+ // the budget self-tunes across the npl range and across dense vs MoE without a
|
|
+ // hand-picked constant (the 161/333 tok/s GB10 decode ceiling is held tuning-free
|
|
+ // instead of via 0013's hand-tuned 256). Decode is structurally claimed first and
|
|
+ // never capped (Phase 1), so the decode-first guarantee is free here.
|
|
+ //
|
|
+ // LLAMA_MAX_BATCH_TOKENS (T) total per-step token budget (decode + prefill),
|
|
+ // default n_batch, clamped to [n_ubatch, n_batch] so
|
|
+ // the compute loop stays a single llama_decode and
|
|
+ // prefill keeps an n_ubatch floor of progress.
|
|
+ // LLAMA_PREFILL_CAP per-slot max prompt tokens per step (the
|
|
+ // long_prefill_token_threshold analogue), default
|
|
+ // min(T, ceil(0.04*n_ctx)) floored at n_ubatch, so
|
|
+ // one long prompt cannot eat the whole leftover.
|
|
+ // LLAMA_PREFILL_BUDGET legacy static cap (patch 0013); honoured ONLY when
|
|
+ // LLAMA_MAX_BATCH_TOKENS is unset, for back-compat.
|
|
+ //
|
|
+ // DEFAULT-OFF BYTE-IDENTICAL: with all three knobs unset, and in the degenerate
|
|
+ // T == n_batch case, behaviour is byte-identical to stock. At T == n_batch the
|
|
+ // dynamic leftover max(n_ubatch, n_batch - D) and the n_batch per-slot cap both
|
|
+ // reach the existing `batch.n_tokens < n_batch` ceiling at the SAME point, so no
|
|
+ // new bound fires (the determinism oracle). Orthogonal to LLAMA_KV_PAGED: pure
|
|
+ // scheduler policy, identical decisions with paged on or off.
|
|
+ const int32_t n_decode_in_batch = batch.size(); // D: Phase 1 appended D decode tokens above
|
|
+ int32_t prefill_budget_step = 0; // 0 = disabled (stock n_batch-only chunking)
|
|
+ int32_t prefill_cap_per_slot = 0; // 0 = disabled (no per-slot prompt-chunk cap)
|
|
{
|
|
- const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
|
|
- if (env_pb) {
|
|
+ int32_t mbt = 0;
|
|
+ if (const char * env_mbt = getenv("LLAMA_MAX_BATCH_TOKENS")) {
|
|
+ mbt = atoi(env_mbt);
|
|
+ }
|
|
+ if (mbt > 0) {
|
|
+ // dynamic decode-first budget (P1): T clamped to [n_ubatch, n_batch]
|
|
+ int32_t T = std::min(n_batch, mbt);
|
|
+ T = std::max(T, n_ubatch);
|
|
+ // leftover after decode, floored at n_ubatch so prefill never fully starves
|
|
+ prefill_budget_step = std::max(n_ubatch, T - n_decode_in_batch);
|
|
+ // per-slot prompt-chunk cap (long_prefill_token_threshold analogue)
|
|
+ int32_t cap = 0;
|
|
+ if (const char * env_cap = getenv("LLAMA_PREFILL_CAP")) {
|
|
+ cap = atoi(env_cap);
|
|
+ }
|
|
+ if (cap <= 0) {
|
|
+ const int32_t pct4 = (n_ctx + 24) / 25; // ceil(0.04 * n_ctx)
|
|
+ cap = std::min(T, std::max(n_ubatch, pct4));
|
|
+ }
|
|
+ cap = std::min(n_batch, std::max(n_ubatch, cap));
|
|
+ // at T == n_batch the leftover and cap both reach the n_batch ceiling
|
|
+ // together; pin the cap to n_batch so this case stays byte-identical
|
|
+ if (T >= n_batch) {
|
|
+ cap = n_batch;
|
|
+ }
|
|
+ prefill_cap_per_slot = cap;
|
|
+ } else if (const char * env_pb = getenv("LLAMA_PREFILL_BUDGET")) {
|
|
+ // legacy static budget (patch 0013), kept for back-compat when the
|
|
+ // dynamic knob is unset: a constant per-step prefill cap, no per-slot cap
|
|
const int v = atoi(env_pb);
|
|
if (v > 0) {
|
|
- n_prefill_budget = std::min(n_batch, std::max(1, v));
|
|
+ prefill_budget_step = std::min(n_batch, std::max(1, v));
|
|
}
|
|
}
|
|
}
|
|
@@ -3509,11 +3563,18 @@ private:
|
|
const auto & spans = slot.task->params.message_spans;
|
|
const auto last_user_pos = spans.last_user_message_pos();
|
|
|
|
+ // (patch 0016) per-slot prompt tokens added this step, for the per-slot
|
|
+ // chunk cap (resets each slot); n_batch stays the hard compute ceiling
|
|
+ int32_t slot_prompt_added = 0;
|
|
+
|
|
// add prompt tokens for processing in the current batch
|
|
- // (patch 0013) also stop once the per-step prefill budget is spent, so a long
|
|
- // prompt is split across more steps and leaves batch room for co-batched decode
|
|
+ // (patch 0016) also stop once (a) the dynamic per-step prefill budget
|
|
+ // (the T - D leftover) is spent across all slots, or (b) this slot's
|
|
+ // per-slot chunk cap is hit, so a long prompt is split across more steps
|
|
+ // and leaves batch room for co-batched decode of the other slots
|
|
while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch &&
|
|
- (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
|
|
+ (prefill_budget_step == 0 || n_prompt_budgeted < prefill_budget_step) &&
|
|
+ (prefill_cap_per_slot == 0 || slot_prompt_added < prefill_cap_per_slot)) {
|
|
// get next token to process
|
|
llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
|
|
if (cur_tok == LLAMA_TOKEN_NULL) {
|
|
@@ -3538,7 +3599,8 @@ private:
|
|
slot.prompt.tokens.push_back(cur_tok);
|
|
|
|
slot.n_prompt_tokens_processed++;
|
|
- n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
|
|
+ n_prompt_budgeted++; // (patch 0016) toward the dynamic per-step prefill budget
|
|
+ slot_prompt_added++; // (patch 0016) toward this slot's per-step chunk cap
|
|
|
|
// stop the prompt batch exactly before a user message
|
|
if (spans.is_user_start(slot.prompt.n_tokens())) {
|
|
@@ -3624,9 +3686,10 @@ private:
|
|
if (!slot_batched) {
|
|
slot_batched = &slot;
|
|
}
|
|
- // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
|
|
- // leaving the remaining batch capacity for co-batched decode of other slots
|
|
- if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
|
|
+ // (patch 0016) stop admitting prompts once the dynamic per-step prefill
|
|
+ // budget (the T - D leftover) is spent, leaving the remaining batch
|
|
+ // capacity for co-batched decode of the other slots
|
|
+ if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) {
|
|
add_ok = false;
|
|
}
|
|
});
|
|
--
|
|
2.43.0
|
|
|