feat(llama-cpp/paged): dynamic decode-first prefill budget (patch 0016, continuous-batch P1)

Mirror the P1 engine change of CONTINUOUS_BATCH_SCHEDULER_SCOPE.md into the vendored paged patch series and surface it as a LocalAI model option. - patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch: supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC, decode-first token budget in update_slots(). At the budget seam (already after Phase 1's decode fill, so batch.n_tokens == D is known) compute T = clamp(LLAMA_MAX_BATCH_TOKENS ?: n_batch, n_ubatch, n_batch), prefill_budget_step = max(n_ubatch, T - D), and a per-slot prompt-chunk cap prefill_cap_per_slot; bound the Phase-2 prompt-fill loop and outer admission break by these instead of 0013's constant. Policy-only change, no new slot states, no batch-formation rewrite, zero libllama changes. Decode is structurally claimed first (Phase 1) so the decode-first guarantee is free. As decode load D rises the leftover auto-shrinks, so the budget self-tunes across npl 8..128 and dense vs MoE and holds the GB10 decode ceiling tuning-free (vs 0013's hand-picked 256). The legacy LLAMA_PREFILL_BUDGET path is preserved (honoured only when the dynamic knob is unset), so 0013 is cleanly subsumed. DEFAULT-OFF byte-identical: all-knobs-unset and the degenerate T == n_batch case are bit-identical to stock by construction (the n_batch hard ceiling is kept and the dynamic bounds reach it at the same point for every D). Orthogonal to LLAMA_KV_PAGED. - grpc-server.cpp: wire the new knob as model options max_batch_tokens / mbt (-> LLAMA_MAX_BATCH_TOKENS) and prefill_cap (-> LLAMA_PREFILL_CAP), beside the existing max_prefill_tokens / mpt seam; default-off, takes precedence over the legacy static budget when set. - patches/paged/P1_DYNAMIC_BUDGET_RESULTS.md: design, the byte-identical determinism analysis (verified by construction), the local patch-apply verification, and the gate + A/B bench methodology. Validation status: the patch applies cleanly on top of LLAMA_VERSION (f3e1828) + paged 0001-0015, and the off-path / T==n_batch determinism is proven by construction. The GB10 sm_121 build, the four runtime gates, and the dense+MoE A/B sweep are PENDING a DGX run (the dev box was unreachable this session) and are documented as such in P1_DYNAMIC_BUDGET_RESULTS.md; do not sell the quantitative TTFT payoff until that re-run lands. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-24 08:38:51 -04:00 · 2026-06-24 07:48:20 +00:00
parent fccbb4082d
commit 24ce7d0823
3 changed files with 401 additions and 0 deletions
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -789,6 +789,40 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                    // If conversion fails, leave the budget unset (stock behaviour)
                }
            }
+        // --- dynamic decode-first prefill budget (patch 0016, continuous-batch P1) ---
+        // Supersedes max_prefill_tokens (the static patch-0013 cap) with the dynamic
+        // T - D budget read by update_slots(): a single total per-step token budget T
+        // (max_batch_tokens / mbt, the vLLM max_num_batched_tokens analogue) of which
+        // decode claims its live load D first and prefill gets the leftover, plus an
+        // optional per-slot prompt-chunk cap (prefill_cap, the long_prefill_token_
+        // threshold analogue). Both are set BEFORE context init, like kv_paged /
+        // max_prefill_tokens above. Unset leaves the env untouched, so the engine stays
+        // byte-identical to stock (an externally exported LLAMA_MAX_BATCH_TOKENS /
+        // LLAMA_PREFILL_CAP still works as an escape hatch). When max_batch_tokens is set
+        // it takes precedence over max_prefill_tokens: the engine honours the legacy
+        // LLAMA_PREFILL_BUDGET only when the dynamic knob is unset.
+        } else if (!strcmp(optname, "max_batch_tokens") || !strcmp(optname, "mbt")) {
+            if (optval != NULL) {
+                try {
+                    int mbt = std::stoi(optval_str);
+                    if (mbt > 0) {
+                        setenv("LLAMA_MAX_BATCH_TOKENS", std::to_string(mbt).c_str(), 1);
+                    }
+                } catch (const std::exception& e) {
+                    // If conversion fails, leave the budget unset (stock behaviour)
+                }
+            }
+        } else if (!strcmp(optname, "prefill_cap")) {
+            if (optval != NULL) {
+                try {
+                    int cap = std::stoi(optval_str);
+                    if (cap > 0) {
+                        setenv("LLAMA_PREFILL_CAP", std::to_string(cap).c_str(), 1);
+                    }
+                } catch (const std::exception& e) {
+                    // If conversion fails, leave the per-slot cap unset (engine default)
+                }
+            }
        } else if (!strcmp(optname, "n_ctx_checkpoints") || !strcmp(optname, "ctx_checkpoints")) {
            if (optval != NULL) {
                try {