feat(paged): add patch 0013 decoupled per-step prefill-token budget

Mirror of the dev-tree paged scheduler patch into the llama.cpp backend's vendored patch series. Adds LLAMA_PREFILL_BUDGET, a per-step prefill-token budget for the inherited update_slots() scheduler, decoupled from n_batch (the analogue of vLLM's --max-num-batched-tokens). It caps how many prompt tokens a single update_slots() step ingests, splitting a long prefill across more steps so co-batched decode keeps advancing instead of freezing for the duration of one fat ~n_batch prefill chunk. Default (env unset or <= 0) = disabled, so stock behaviour is byte-identical; orthogonal to LLAMA_KV_PAGED. Measured on GB10 (dense Qwen3-32B-NVFP4, 8 steady decoders + one injected 6000-token prefill, same binary, only the env differs): worst decode freeze 3380 -> 482 ms (7.0x) and decode_stall 3285 -> 387 ms (8.5x) at budget=256, for a +20% TTFT on the long request; budget=512 gives 4.8x at ~no TTFT cost. This is a latency/fairness lever, not an aggregate-throughput lever (steady decode is NVFP4 weight-read-bound on GB10, which the scheduler cannot lift). Correctness: budget unset or >= n_batch is byte-identical to stock; budget=N is byte-identical to stock -bN while preserving n_batch for decode width; the only deviation on long prompts is intrinsic flash-attn chunk-size FP grouping that pure stock -b exhibits too. Verified applying on the pinned llama.cpp f3e1828 after patch 0008. Productisation follow-up: surface as a grpc-server.cpp options knob (max_prefill_tokens) per CHUNKED_PREFILL_PLAN Phase B. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-23 16:19:07 -04:00 · 2026-06-23 09:55:32 +00:00
parent ba6bd94976
commit 4bc2b4a9b2
1 changed files with 137 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/0013-paged-decoupled-prefill-token-budget.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0013-paged-decoupled-prefill-token-budget.patch
@@ -0,0 +1,137 @@
+From 17d97cb74e3e8c93751afd33f5c183e57056fde9 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 23 Jun 2026 11:52:45 +0200
+Subject: [PATCH] feat(paged): decoupled per-step prefill-token budget (patch
+ 0013)
+
+llama-server already co-batches decode with chunked prefill: update_slots()
+appends every generating slot's sampled token first, then fills the rest of the
+n_batch budget with prompt tokens, deferring the overflow to the next step. But
+the prefill chunk size is hard-wired to n_batch (default 2048): one slot's
+~2048-token prefill chunk lands in a single compute-heavy step, and every decode
+co-batched into that step sees a multi-second inter-token-latency (ITL) spike.
+Lowering n_batch shrinks the chunk but also caps decode-concurrency width and
+prefill throughput, because they are coupled.
+
+Add LLAMA_PREFILL_BUDGET: a per-step prefill-token budget decoupled from n_batch
+(the analogue of vLLM's --max-num-batched-tokens / long_prefill_token_threshold).
+The prompt-fill loop and the outer slot loop now also stop once this many prompt
+tokens have been added in the current update_slots() step, so a long prefill is
+split across more steps that each still advance in-flight decode. Default (env
+unset or <= 0) = disabled, so stock behaviour is byte-identical. Orthogonal to
+LLAMA_KV_PAGED: this is a pure scheduler knob and works with paged off.
+
+Measured on GB10 (sm_121), dense Qwen3-32B-NVFP4, paged build, 8 steady decode
+streams with one 6000-token prefill injected mid-stream; same binary, only
+LLAMA_PREFILL_BUDGET differs:
+
+  metric                        stock(off)  budget=256   budget=512
+  worst decode freeze (ms)         3380      482 (7.0x)   778 (4.3x)
+  median decode ITL in window      2264      411 (5.5x)   689
+  decode_stall (ms)                3285      387 (8.5x)   684 (4.8x)
+  decode steps during prefill        38      201 (5.3x)   108
+  injected-req TTFT (ms)           8493     10172 (+20%)  8432 (~0%)
+  steady-state baseline ITL          94        95          94
+
+This is a LATENCY/fairness lever, not an aggregate-throughput lever: it flattens
+the decode ITL spike a long prefill inflicts on co-batched decoders (8.5x smaller
+worst freeze and 5.3x more decode progress during the prefill at budget=256), in
+exchange for a modest TTFT rise on the long request (the classic chunked-prefill
+trade-off; budget=512 buys 4.8x with ~no TTFT cost). Steady aggregate decode is
+unchanged: it is bandwidth/weight-capped on GB10 (the NVFP4 weight-read floor),
+which the scheduler cannot lift.
+
+Correctness (same model, greedy temp 0, fa on):
+- budget unset or >= n_batch: byte-identical to stock (the added break never
+  fires before the existing n_batch break; the off-path is a no-op by
+  construction).
+- short prompt (<= budget): byte-identical to stock.
+- the knob is exactly equivalent to stock's native -b chunking: budget=512 ==
+  stock -b512 and budget=256 == stock -b256, both BYTE-IDENTICAL, while keeping
+  n_batch=2048 for decode width.
+- on a prompt larger than the budget the chunked greedy output diverges from the
+  single n_batch chunk only by intrinsic flash-attn chunk-size FP grouping: PURE
+  stock -b256 diverges from stock -b2048 the same way with the patch inactive,
+  and the output stays coherent and answers correctly.
+
+Productisation (LocalAI): surface as a model options knob (max_prefill_tokens /
+mpt) parsed in grpc-server.cpp, default 0 = disabled, per CHUNKED_PREFILL_PLAN
+Phase B; the vendored update_slots() hunk here is that plan's scheduler patch and
+stays disjoint from the paged allocation hunks.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ tools/server/server-context.cpp | 35 ++++++++++++++++++++++++++++++++-
+ 1 file changed, 34 insertions(+), 1 deletion(-)
+
+diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
+index 04c6361..5d83b30 100644
+--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
+@@ -2723,6 +2723,29 @@ private:
+         int32_t n_batch  = llama_n_batch(ctx_tgt);
+         int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
+ 
+        // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
+        // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
+        // tokens ingested per update_slots() step at n_batch only; with cont_batching the
+        // sampled decode tokens of every generating slot are appended FIRST, then prompt
+        // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
+        // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
+        // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
+        // tokens added per step independently of n_batch, splitting a long prefill across
+        // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
+        // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
+        // (this is a pure scheduler knob; works with paged off).
+        int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
+        {
+            const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
+            if (env_pb) {
+                const int v = atoi(env_pb);
+                if (v > 0) {
+                    n_prefill_budget = std::min(n_batch, std::max(1, v));
+                }
+            }
+        }
+        int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
+
+         float  alora_scale       = -1.0f;
+         size_t alora_disabled_id = 0;
+ 
+@@ -3159,7 +3182,10 @@ private:
+                     const bool n_before_user_known = n_before_user > 0;
+ 
+                     // add prompt tokens for processing in the current batch
+-                    while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch) {
+                    // (patch 0013) also stop once the per-step prefill budget is spent, so a long
+                    // prompt is split across more steps and leaves batch room for co-batched decode
+                    while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch &&
+                           (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
+                         // get next token to process
+                         llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
+                         if (cur_tok == LLAMA_TOKEN_NULL) {
+@@ -3185,6 +3211,7 @@ private:
+                         slot.prompt.tokens.push_back(cur_tok);
+ 
+                         slot.n_prompt_tokens_processed++;
+                        n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
+ 
+                         // stop the prompt batch exactly before the latest user input, so a checkpoint
+                         // can be created after the previous messages
+@@ -3293,6 +3320,12 @@ private:
+                 if (batch.n_tokens >= n_batch) {
+                     break;
+                 }
+
+                // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
+                // leaving the remaining batch capacity for co-batched decode of other slots
+                if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
+                    break;
+                }
+             }
+         }
+ 
+-- 
+2.43.0
+