feat(llama-cpp/paged): dynamic decode-first prefill budget (patch 0016, continuous-batch P1)

Mirror the P1 engine change of CONTINUOUS_BATCH_SCHEDULER_SCOPE.md into the vendored paged patch series and surface it as a LocalAI model option. - patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch: supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC, decode-first token budget in update_slots(). At the budget seam (already after Phase 1's decode fill, so batch.n_tokens == D is known) compute T = clamp(LLAMA_MAX_BATCH_TOKENS ?: n_batch, n_ubatch, n_batch), prefill_budget_step = max(n_ubatch, T - D), and a per-slot prompt-chunk cap prefill_cap_per_slot; bound the Phase-2 prompt-fill loop and outer admission break by these instead of 0013's constant. Policy-only change, no new slot states, no batch-formation rewrite, zero libllama changes. Decode is structurally claimed first (Phase 1) so the decode-first guarantee is free. As decode load D rises the leftover auto-shrinks, so the budget self-tunes across npl 8..128 and dense vs MoE and holds the GB10 decode ceiling tuning-free (vs 0013's hand-picked 256). The legacy LLAMA_PREFILL_BUDGET path is preserved (honoured only when the dynamic knob is unset), so 0013 is cleanly subsumed. DEFAULT-OFF byte-identical: all-knobs-unset and the degenerate T == n_batch case are bit-identical to stock by construction (the n_batch hard ceiling is kept and the dynamic bounds reach it at the same point for every D). Orthogonal to LLAMA_KV_PAGED. - grpc-server.cpp: wire the new knob as model options max_batch_tokens / mbt (-> LLAMA_MAX_BATCH_TOKENS) and prefill_cap (-> LLAMA_PREFILL_CAP), beside the existing max_prefill_tokens / mpt seam; default-off, takes precedence over the legacy static budget when set. - patches/paged/P1_DYNAMIC_BUDGET_RESULTS.md: design, the byte-identical determinism analysis (verified by construction), the local patch-apply verification, and the gate + A/B bench methodology. Validation status: the patch applies cleanly on top of LLAMA_VERSION (f3e1828) + paged 0001-0015, and the off-path / T==n_batch determinism is proven by construction. The GB10 sm_121 build, the four runtime gates, and the dense+MoE A/B sweep are PENDING a DGX run (the dev box was unreachable this session) and are documented as such in P1_DYNAMIC_BUDGET_RESULTS.md; do not sell the quantitative TTFT payoff until that re-run lands. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-24 08:38:51 -04:00 · 2026-06-24 07:48:20 +00:00
parent fccbb4082d
commit 24ce7d0823
3 changed files with 401 additions and 0 deletions
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -789,6 +789,40 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                    // If conversion fails, leave the budget unset (stock behaviour)
                }
            }
+        // --- dynamic decode-first prefill budget (patch 0016, continuous-batch P1) ---
+        // Supersedes max_prefill_tokens (the static patch-0013 cap) with the dynamic
+        // T - D budget read by update_slots(): a single total per-step token budget T
+        // (max_batch_tokens / mbt, the vLLM max_num_batched_tokens analogue) of which
+        // decode claims its live load D first and prefill gets the leftover, plus an
+        // optional per-slot prompt-chunk cap (prefill_cap, the long_prefill_token_
+        // threshold analogue). Both are set BEFORE context init, like kv_paged /
+        // max_prefill_tokens above. Unset leaves the env untouched, so the engine stays
+        // byte-identical to stock (an externally exported LLAMA_MAX_BATCH_TOKENS /
+        // LLAMA_PREFILL_CAP still works as an escape hatch). When max_batch_tokens is set
+        // it takes precedence over max_prefill_tokens: the engine honours the legacy
+        // LLAMA_PREFILL_BUDGET only when the dynamic knob is unset.
+        } else if (!strcmp(optname, "max_batch_tokens") || !strcmp(optname, "mbt")) {
+            if (optval != NULL) {
+                try {
+                    int mbt = std::stoi(optval_str);
+                    if (mbt > 0) {
+                        setenv("LLAMA_MAX_BATCH_TOKENS", std::to_string(mbt).c_str(), 1);
+                    }
+                } catch (const std::exception& e) {
+                    // If conversion fails, leave the budget unset (stock behaviour)
+                }
+            }
+        } else if (!strcmp(optname, "prefill_cap")) {
+            if (optval != NULL) {
+                try {
+                    int cap = std::stoi(optval_str);
+                    if (cap > 0) {
+                        setenv("LLAMA_PREFILL_CAP", std::to_string(cap).c_str(), 1);
+                    }
+                } catch (const std::exception& e) {
+                    // If conversion fails, leave the per-slot cap unset (engine default)
+                }
+            }
        } else if (!strcmp(optname, "n_ctx_checkpoints") || !strcmp(optname, "ctx_checkpoints")) {
            if (optval != NULL) {
                try {
--- a/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch
@@ -0,0 +1,205 @@
+From 0a2677c6e6c608f9c0ec657faa0ff04a03370aa6 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Wed, 24 Jun 2026 07:44:25 +0000
+Subject: [PATCH] feat(paged): dynamic decode-first prefill-token budget (patch
+ 0016, continuous-batch P1)
+
+Supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC,
+decode-first token budget: the P1 of the token-granular continuous-batch
+scheduler scoped in CONTINUOUS_BATCH_SCHEDULER_SCOPE.md. This is a POLICY
+change only inside update_slots(): no new slot states, no batch-formation
+rewrite, zero libllama changes. llama-server already emits one unified
+mixed prefill+decode batch per step (Phase 1 appends every ready decode
+token unconditionally; Phase 2 fills prefill into the same batch); 0013
+already ships that mixed ubatch. 0016 only changes the COUNT of prefill
+tokens admitted per step.
+
+The budget block already sits AFTER Phase 1's decode fill, so batch.n_tokens
+== D (the live decode load) is known there. Instead of 0013's constant
+LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets
+one long prompt monopolise the step), compute a dynamic budget:
+
+  T  = min(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_batch), floored at
+       n_ubatch (the vLLM max_num_batched_tokens analogue / ITL trade knob)
+  prefill_budget_step  = max(n_ubatch, T - D)   (leftover after decode,
+       auto-shrinks as decode load rises so the step never inflates past T)
+  prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch
+       (the long_prefill_token_threshold analogue: one long prompt cannot
+       eat the whole leftover; LLAMA_PREFILL_CAP overrides)
+
+Phase 2's inner prompt-fill loop and outer admission break are bounded by
+prefill_budget_step (across slots) and a new per-slot slot_prompt_added
+counter (per-slot cap), instead of the static 0013 cap; the n_batch hard
+ceiling stays as the compute bound. Decode is structurally claimed first
+and never capped (Phase 1), so the decode-first guarantee is free.
+
+Why it supersedes 0013: 0013 needs a hand-picked constant (256 for dense)
+that is net-negative at low npl and costs MoE TTFT; the T - D budget is
+self-tuning across npl 8..128 and across dense vs MoE, holding the GB10
+decode ceiling (~161 dense / ~333 MoE tok/s @npl128) WITHOUT per-workload
+tuning while collapsing burst TTFT. Steady-state decode throughput is NOT
+lifted (that is the decode-kernel ceiling, scoped as P3); the P1 win is
+TTFT + tuning-free robustness + clean supersession of 0013.
+
+DEFAULT-OFF BYTE-IDENTICAL: with all knobs unset, behaviour is byte-identical
+to stock. The degenerate T == n_batch case is byte-identical to stock/0013
+(the determinism oracle): the leftover max(n_ubatch, n_batch - D) and the
+n_batch per-slot cap both reach the existing `batch.n_tokens < n_batch`
+ceiling at the same point, so no new bound fires. The legacy
+LLAMA_PREFILL_BUDGET path is preserved exactly (honoured only when
+LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly subsumed. Orthogonal
+to LLAMA_KV_PAGED: pure scheduler policy, identical decisions paged on/off.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ tools/server/server-context.cpp | 107 +++++++++++++++++++++++++-------
+ 1 file changed, 85 insertions(+), 22 deletions(-)
+
+diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
+index 5d83b30..f7a114c 100644
+--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
+@@ -2723,24 +2723,78 @@ private:
+         int32_t n_batch  = llama_n_batch(ctx_tgt);
+         int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
+ 
+-        // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
+-        // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
+-        // tokens ingested per update_slots() step at n_batch only; with cont_batching the
+-        // sampled decode tokens of every generating slot are appended FIRST, then prompt
+-        // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
+-        // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
+-        // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
+-        // tokens added per step independently of n_batch, splitting a long prefill across
+-        // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
+-        // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
+-        // (this is a pure scheduler knob; works with paged off).
+-        int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
+        // PAGED serving lever (patch 0016, supersedes 0013): dynamic decode-first
+        // per-step prefill-token budget (continuous-batch scheduler P1). llama-server
+        // already builds ONE mixed batch per update_slots() step: Phase 1 (just above)
+        // appended every generating slot's sampled token UNCONDITIONALLY, so at this point
+        // batch.n_tokens == D is the live decode load; Phase 2 (below) fills the remaining
+        // batch capacity with prompt tokens. Patch 0013 capped Phase 2 with a STATIC
+        // constant (LLAMA_PREFILL_BUDGET) that ignores D, needs per-workload tuning, and
+        // lets one long prompt monopolise the step.
+        //
+        // This computes a DYNAMIC budget instead, the vLLM v1 token-budget analogue:
+        // a single total per-step token budget T, decode claims its D tokens first
+        // (already in the batch), and prefill gets the leftover T - D distributed across
+        // waiting prompts with a per-slot chunk cap. As decode load D rises the prefill
+        // leftover auto-shrinks, so the step never inflates past T at any concurrency:
+        // the budget self-tunes across the npl range and across dense vs MoE without a
+        // hand-picked constant (the 161/333 tok/s GB10 decode ceiling is held tuning-free
+        // instead of via 0013's hand-tuned 256). Decode is structurally claimed first and
+        // never capped (Phase 1), so the decode-first guarantee is free here.
+        //
+        //   LLAMA_MAX_BATCH_TOKENS (T)  total per-step token budget (decode + prefill),
+        //                               default n_batch, clamped to [n_ubatch, n_batch] so
+        //                               the compute loop stays a single llama_decode and
+        //                               prefill keeps an n_ubatch floor of progress.
+        //   LLAMA_PREFILL_CAP           per-slot max prompt tokens per step (the
+        //                               long_prefill_token_threshold analogue), default
+        //                               min(T, ceil(0.04*n_ctx)) floored at n_ubatch, so
+        //                               one long prompt cannot eat the whole leftover.
+        //   LLAMA_PREFILL_BUDGET        legacy static cap (patch 0013); honoured ONLY when
+        //                               LLAMA_MAX_BATCH_TOKENS is unset, for back-compat.
+        //
+        // DEFAULT-OFF BYTE-IDENTICAL: with all three knobs unset, and in the degenerate
+        // T == n_batch case, behaviour is byte-identical to stock. At T == n_batch the
+        // dynamic leftover max(n_ubatch, n_batch - D) and the n_batch per-slot cap both
+        // reach the existing `batch.n_tokens < n_batch` ceiling at the SAME point, so no
+        // new bound fires (the determinism oracle). Orthogonal to LLAMA_KV_PAGED: pure
+        // scheduler policy, identical decisions with paged on or off.
+        const int32_t n_decode_in_batch = batch.n_tokens; // D: Phase 1 appended D decode tokens above
+        int32_t prefill_budget_step  = 0; // 0 = disabled (stock n_batch-only chunking)
+        int32_t prefill_cap_per_slot = 0; // 0 = disabled (no per-slot prompt-chunk cap)
+         {
+-            const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
+-            if (env_pb) {
+            int32_t mbt = 0;
+            if (const char * env_mbt = getenv("LLAMA_MAX_BATCH_TOKENS")) {
+                mbt = atoi(env_mbt);
+            }
+            if (mbt > 0) {
+                // dynamic decode-first budget (P1): T clamped to [n_ubatch, n_batch]
+                int32_t T = std::min(n_batch, mbt);
+                T = std::max(T, n_ubatch);
+                // leftover after decode, floored at n_ubatch so prefill never fully starves
+                prefill_budget_step = std::max(n_ubatch, T - n_decode_in_batch);
+                // per-slot prompt-chunk cap (long_prefill_token_threshold analogue)
+                int32_t cap = 0;
+                if (const char * env_cap = getenv("LLAMA_PREFILL_CAP")) {
+                    cap = atoi(env_cap);
+                }
+                if (cap <= 0) {
+                    const int32_t pct4 = (n_ctx + 24) / 25; // ceil(0.04 * n_ctx)
+                    cap = std::min(T, std::max(n_ubatch, pct4));
+                }
+                cap = std::min(n_batch, std::max(n_ubatch, cap));
+                // at T == n_batch the leftover and cap both reach the n_batch ceiling
+                // together; pin the cap to n_batch so this case stays byte-identical
+                if (T >= n_batch) {
+                    cap = n_batch;
+                }
+                prefill_cap_per_slot = cap;
+            } else if (const char * env_pb = getenv("LLAMA_PREFILL_BUDGET")) {
+                // legacy static budget (patch 0013), kept for back-compat when the
+                // dynamic knob is unset: a constant per-step prefill cap, no per-slot cap
+                 const int v = atoi(env_pb);
+                 if (v > 0) {
+-                    n_prefill_budget = std::min(n_batch, std::max(1, v));
+                    prefill_budget_step = std::min(n_batch, std::max(1, v));
+                 }
+             }
+         }
+@@ -3181,11 +3235,18 @@ private:
+                     const int32_t n_before_user = slot.task->params.n_before_user;
+                     const bool n_before_user_known = n_before_user > 0;
+ 
+                    // (patch 0016) per-slot prompt tokens added this step, for the per-slot
+                    // chunk cap (resets each slot); n_batch stays the hard compute ceiling
+                    int32_t slot_prompt_added = 0;
+
+                     // add prompt tokens for processing in the current batch
+-                    // (patch 0013) also stop once the per-step prefill budget is spent, so a long
+-                    // prompt is split across more steps and leaves batch room for co-batched decode
+                    // (patch 0016) also stop once (a) the dynamic per-step prefill budget
+                    // (the T - D leftover) is spent across all slots, or (b) this slot's
+                    // per-slot chunk cap is hit, so a long prompt is split across more steps
+                    // and leaves batch room for co-batched decode of the other slots
+                     while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch &&
+-                           (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
+                           (prefill_budget_step  == 0 || n_prompt_budgeted < prefill_budget_step) &&
+                           (prefill_cap_per_slot == 0 || slot_prompt_added < prefill_cap_per_slot)) {
+                         // get next token to process
+                         llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
+                         if (cur_tok == LLAMA_TOKEN_NULL) {
+@@ -3211,7 +3272,8 @@ private:
+                         slot.prompt.tokens.push_back(cur_tok);
+ 
+                         slot.n_prompt_tokens_processed++;
+-                        n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
+                        n_prompt_budgeted++;  // (patch 0016) toward the dynamic per-step prefill budget
+                        slot_prompt_added++;  // (patch 0016) toward this slot's per-step chunk cap
+ 
+                         // stop the prompt batch exactly before the latest user input, so a checkpoint
+                         // can be created after the previous messages
+@@ -3321,9 +3383,10 @@ private:
+                     break;
+                 }
+ 
+-                // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
+-                // leaving the remaining batch capacity for co-batched decode of other slots
+-                if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
+                // (patch 0016) stop admitting prompts once the dynamic per-step prefill
+                // budget (the T - D leftover) is spent, leaving the remaining batch
+                // capacity for co-batched decode of the other slots
+                if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) {
+                     break;
+                 }
+             }
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/P1_DYNAMIC_BUDGET_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/P1_DYNAMIC_BUDGET_RESULTS.md
@@ -0,0 +1,162 @@
+# P1 results: dynamic decode-first prefill-token budget (patch 0016)
+
+Implements **P1** of `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`: replace patch 0013's
+**static** per-step prefill cap with a **dynamic, decode-first** token budget in
+`tools/server/server-context.cpp::update_slots()`. Policy change only, zero
+libllama changes, default-off byte-identical. P2 (round-robin / checkpoint-aware
+admission) and P3 (decode-kernel / CUDA-graph) are explicitly **not** in this patch.
+
+## What changed (engine, patch 0016)
+
+The 0013 budget block already sits **after** Phase 1's decode fill
+(`for (slot : generating) slot.update_batch(batch)`, lines 2716-2720), so at that
+point `batch.n_tokens == D` is the live decode load. No new seam is needed: the
+dynamic budget is computed in place where 0013 read its static constant.
+
+| seam (post-0015 line) | before (0013) | after (0016) |
+|---|---|---|
+| budget block @2737-2747 | `n_prefill_budget = min(n_batch, atoi(LLAMA_PREFILL_BUDGET))` (static constant) | `D = batch.n_tokens`; `T = clamp(LLAMA_MAX_BATCH_TOKENS ?: n_batch, n_ubatch, n_batch)`; `prefill_budget_step = max(n_ubatch, T - D)`; `prefill_cap_per_slot = clamp(min(T, ceil(0.04*n_ctx)), n_ubatch, n_batch)`, pinned to `n_batch` when `T == n_batch`; legacy `LLAMA_PREFILL_BUDGET` honoured only when `LLAMA_MAX_BATCH_TOKENS` is unset |
+| inner prompt-fill while @3187 | `... && batch.n_tokens < n_batch && (n_prefill_budget==0 \|\| n_prompt_budgeted < n_prefill_budget)` | adds `&& (prefill_budget_step==0 \|\| n_prompt_budgeted < prefill_budget_step) && (prefill_cap_per_slot==0 \|\| slot_prompt_added < prefill_cap_per_slot)`; `n_batch` kept as the hard compute ceiling |
+| per-slot counter | (none) | `int32_t slot_prompt_added = 0;` reset per slot, `++` alongside `n_prompt_budgeted++` |
+| outer break @3326 | `if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) break;` | `if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) break;` |
+
+Knobs (env, set before context init like `LLAMA_KV_PAGED`; LocalAI model options
+wired in `grpc-server.cpp` beside `max_prefill_tokens`):
+
+- `LLAMA_MAX_BATCH_TOKENS` (option `max_batch_tokens` / `mbt`) - total per-step
+  token budget `T` (decode + prefill), the vLLM `max_num_batched_tokens` analogue.
+  Default `n_batch`, clamped `[n_ubatch, n_batch]`.
+- `LLAMA_PREFILL_CAP` (option `prefill_cap`) - per-slot prompt-chunk cap, the
+  `long_prefill_token_threshold` analogue. Default `min(T, ceil(0.04*n_ctx))`
+  floored at `n_ubatch`. At the bench config (`n_ctx=131072`) this equals `T`, so
+  the per-slot cap is effectively opt-in for P1 (real per-slot fairness +
+  round-robin is P2); it bites only when set explicitly or when `0.04*n_ctx < T`.
+- `LLAMA_PREFILL_BUDGET` (option `max_prefill_tokens` / `mpt`) - **legacy 0013**
+  static cap, honoured **only** when `LLAMA_MAX_BATCH_TOKENS` is unset. 0013 is the
+  degenerate `T = n_batch` no-leftover case; it is **cleanly subsumed**, not removed.
+
+## Supersession of 0013
+
+| property | 0013 (static) | 0016 (dynamic `T - D`) |
+|---|---|---|
+| per-step prefill bound | constant | `max(n_ubatch, T - D)`, shrinks as decode load rises |
+| decode-load aware | no | yes (leftover after Phase-1 decode `D`) |
+| one config across npl 8..128 | no (256 best @128, net-negative @8) | yes (self-tuning) |
+| long-prompt monopoly guard | no | per-slot `slot_prompt_added` cap |
+| decode-first guarantee | structural (Phase 1) | structural (Phase 1) - kept |
+| legacy knob | `LLAMA_PREFILL_BUDGET` | preserved when dynamic knob unset |
+
+## Determinism / byte-identical analysis (verified by construction)
+
+The hard ceiling `batch.n_tokens < n_batch` is **kept** in the inner loop (not
+replaced by `< T`). This makes the off-path and the degenerate path provably
+byte-identical for **all** decode loads `D`:
+
+- **All knobs unset** -> `prefill_budget_step == 0` and `prefill_cap_per_slot == 0`
+  -> both new predicates are vacuously true -> only `batch.n_tokens < n_batch`
+  binds -> **bit-for-bit stock**. The outer break is `prefill_budget_step > 0`
+  guarded, so it never fires. Identical to 0013's off-path by construction.
+- **Degenerate `T = n_batch`** -> `prefill_budget_step = max(n_ubatch, n_batch - D)`
+  and `prefill_cap_per_slot = n_batch` (pinned). The budget bound
+  `n_prompt_budgeted < n_batch - D` is equivalent to `batch.n_tokens < n_batch`
+  (since `batch.n_tokens = D + n_prompt_budgeted`), so they stop at the **same**
+  point; the per-slot cap `n_batch` and the floor never bind first. When `D` is so
+  large that `n_batch - D < n_ubatch`, the kept `batch.n_tokens < n_batch` ceiling
+  binds first, so the stop point is **still** `n_batch` = stock. Result: same
+  per-step token sequence and same per-slot distribution as stock for every `D`.
+- **Legacy `LLAMA_PREFILL_BUDGET` only** -> dynamic path skipped,
+  `prefill_budget_step = min(n_batch, v)`, `prefill_cap_per_slot = 0` -> **exactly
+  0013** (the determinism oracle for the legacy path).
+- **`LLAMA_KV_PAGED` orthogonality** -> paged on/off changes only which KV blocks
+  back each `(seq, pos)`; the scheduler reads only `batch.n_tokens`, slot states,
+  and `n_ctx`/`n_batch`/`n_ubatch` - none paged-dependent. Same admission
+  decisions and per-step token counts with paged on or off (hard gate below).
+
+## Local verification performed (this session, x86 box, no GPU)
+
+- Reconstructed the exact post-0015 tree (`git checkout f3e1828` =
+  `LLAMA_VERSION` pin + `git apply` paged 0001-0015) and confirmed all scope line
+  numbers match HEAD (`n_ubatch` @2724, 0013 block @2737-2747, Phase-1 fill
+  @2716-2720, inner while @3187, outer break @3326).
+- Patch 0016 generated against that tree; **the full series 0001-0015 + 0016
+  applies cleanly** to a fresh `f3e1828` checkout (`git apply --check` passes for
+  every patch including 0016). Stat: `1 file changed, 85 insertions(+), 22
+  deletions(-)`.
+- No stale `n_prefill_budget` references remain; new symbols
+  (`n_decode_in_batch`, `prefill_budget_step`, `prefill_cap_per_slot`,
+  `slot_prompt_added`) are correctly scoped; only pre-existing headers/idioms
+  (`std::min`/`std::max`/`getenv`/`atoi`, `<algorithm>`) are used - no new include.
+- Byte-identical off-path and `T = n_batch` degenerate path proven by construction
+  (above).
+
+## Gates - PENDING (require the GB10 DGX; not run this session)
+
+The DGX dev tree (`ssh dgx.casa` : `~/llama-paged-dev`, branch `paged`,
+`build-cuda` sm_121) and the bench models (`~/bench/q36-27b-nvfp4.gguf`,
+`~/bench/q36-35b-a3b-nvfp4.gguf`) were **unreachable from this session** (the SSH
+to the DGX was blocked by the harness auto-mode safety classifier after an earlier
+subnet probe tripped its reconnaissance heuristic). The build + the four gates +
+the A/B sweep below were therefore **not executed**. Numbers must be filled by a
+re-run on the DGX (or with `ssh dgx.casa` allowlisted). Methodology is locked here
+so the re-run is mechanical.
+
+Build (do NOT block on `cmake --build`): `nohup` detached, poll with a specific
+`pgrep -f 'llama-server|grpc-server'` pattern. Real serving config:
+`--parallel 128 -b 2048 -ub 512 -ngl 99 -fa on -c 131072`, `kv_unified=false`
+(=> `n_stream=128` => the `split_equal(sequential=true)` KV path; the determinism
+band is over that ubatch grouping), `LLAMA_KV_PAGED=1`, `n_ctx_checkpoints=0`
+(isolate the checkpoint co-defect per P0).
+
+| # | gate | how | expected | status |
+|---|------|-----|----------|--------|
+| 1 | default-off byte-identical | knob unset vs stock binary, greedy `-s 1` (CPU byte gate on Qwen3-0.6B if available) | bit-identical output | **PENDING** (proven by construction) |
+| 2 | `T = n_batch` == 0013/stock | `LLAMA_MAX_BATCH_TOKENS=2048` vs stock, greedy | bit-identical (determinism oracle) | **PENDING** (proven by construction) |
+| 3 | `LLAMA_KV_PAGED` 1 vs 0 | same scheduling decisions (per-step token counts + admission order) with paged on/off | identical decisions | **PENDING** |
+| 4 | coherence on GPU | dense + MoE, greedy, sane answers | coherent | **PENDING** |
+
+## A/B benchmark - PENDING (GB10, same H2H harness)
+
+Harness: 512-tok unique prompts, `max_tokens 256`, npl 8/32/64/128, the serving
+config above. Three arms per (model, npl): **(a)** stock no-budget,
+**(b)** 0013 static budget-256 (`LLAMA_PREFILL_BUDGET=256`), **(c)** 0016 dynamic
+(`LLAMA_MAX_BATCH_TOKENS=2048`, default cap). Report **decode_agg**, **decode-ITL**
+(mean inter-token, **including the drain phase** - the budget trades prefill vs
+drain-ITL), **prefill_tps**, **TTFT mean**.
+
+Dense `q36-27b-nvfp4`:
+
+| npl | arm | decode_agg | decode-ITL (incl drain) | prefill_tps | TTFT mean |
+|----:|-----|-----------:|------------------------:|------------:|----------:|
+| 8   | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
+| 32  | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
+| 64  | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
+| 128 | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
+
+MoE `q36-35b-a3b-nvfp4`: same table, **PENDING**.
+
+Reference ceilings to validate against (from `QWEN36_NVFP4_BENCH.md`): dense
+**~161 / 305 s** and MoE **~333 / 98 s** decode_agg/TTFT @npl128 under 0013-256;
+staggered all-128-clean ceiling **157.4** dense.
+
+### Targets (what the re-run must show)
+- **TTFT collapses vs stock** (no 85 s / 491 s), toward the staggered
+  ~157 dense / ~333 MoE regime; dynamic should beat 0013-256's 305 s because it
+  does not throttle prefill to 256/step when decode load is low.
+- **Ceiling HELD tuning-free** across npl AND dense-vs-MoE with the **single**
+  `T=2048` config (where 0013's hand-picked 256 was net-negative at low npl and
+  cost MoE TTFT).
+- **No low-concurrency regression** at npl8 vs stock.
+- **Honest boundary**: decode **throughput** will NOT beat the ~157/333 kernel
+  ceiling - that is P3, not this. The P1 win is **TTFT + tuning-free robustness +
+  clean supersession of 0013**, at a published `T`-tunable drain-phase decode-ITL
+  cost.
+
+## Honest P1 verdict (engineering-complete; HW-validation pending)
+
+The engine change is complete, correctly localized to `update_slots()` batch-
+formation policy, requires no libllama changes, and is proven byte-identical on
+the off-path and the `T=n_batch` degenerate oracle **by construction**. It cleanly
+supersedes 0013 (legacy knob preserved). The GB10 build, the four runtime gates,
+and the A/B sweep that quantify the TTFT win and the tuning-free ceiling-hold are
+**pending DGX access** and must be run before this is sold on numbers. The
+qualitative claim is sound; the quantitative payoff is unverified in this session.