feat(llama-cpp/paged): dynamic decode-first prefill budget (patch 0016, continuous-batch P1)

Mirror the P1 engine change of CONTINUOUS_BATCH_SCHEDULER_SCOPE.md into the
vendored paged patch series and surface it as a LocalAI model option.

- patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch:
  supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC,
  decode-first token budget in update_slots(). At the budget seam (already
  after Phase 1's decode fill, so batch.n_tokens == D is known) compute
  T = clamp(LLAMA_MAX_BATCH_TOKENS ?: n_batch, n_ubatch, n_batch),
  prefill_budget_step = max(n_ubatch, T - D), and a per-slot prompt-chunk
  cap prefill_cap_per_slot; bound the Phase-2 prompt-fill loop and outer
  admission break by these instead of 0013's constant. Policy-only change,
  no new slot states, no batch-formation rewrite, zero libllama changes.
  Decode is structurally claimed first (Phase 1) so the decode-first
  guarantee is free. As decode load D rises the leftover auto-shrinks, so
  the budget self-tunes across npl 8..128 and dense vs MoE and holds the
  GB10 decode ceiling tuning-free (vs 0013's hand-picked 256). The legacy
  LLAMA_PREFILL_BUDGET path is preserved (honoured only when the dynamic
  knob is unset), so 0013 is cleanly subsumed. DEFAULT-OFF byte-identical:
  all-knobs-unset and the degenerate T == n_batch case are bit-identical to
  stock by construction (the n_batch hard ceiling is kept and the dynamic
  bounds reach it at the same point for every D). Orthogonal to
  LLAMA_KV_PAGED.

- grpc-server.cpp: wire the new knob as model options max_batch_tokens / mbt
  (-> LLAMA_MAX_BATCH_TOKENS) and prefill_cap (-> LLAMA_PREFILL_CAP), beside
  the existing max_prefill_tokens / mpt seam; default-off, takes precedence
  over the legacy static budget when set.

- patches/paged/P1_DYNAMIC_BUDGET_RESULTS.md: design, the byte-identical
  determinism analysis (verified by construction), the local patch-apply
  verification, and the gate + A/B bench methodology.

Validation status: the patch applies cleanly on top of LLAMA_VERSION
(f3e1828) + paged 0001-0015, and the off-path / T==n_batch determinism is
proven by construction. The GB10 sm_121 build, the four runtime gates, and
the dense+MoE A/B sweep are PENDING a DGX run (the dev box was unreachable
this session) and are documented as such in P1_DYNAMIC_BUDGET_RESULTS.md; do
not sell the quantitative TTFT payoff until that re-run lands.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-24 07:48:20 +00:00
parent fccbb4082d
commit 24ce7d0823
3 changed files with 401 additions and 0 deletions

View File

@@ -789,6 +789,40 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// If conversion fails, leave the budget unset (stock behaviour)
}
}
// --- dynamic decode-first prefill budget (patch 0016, continuous-batch P1) ---
// Supersedes max_prefill_tokens (the static patch-0013 cap) with the dynamic
// T - D budget read by update_slots(): a single total per-step token budget T
// (max_batch_tokens / mbt, the vLLM max_num_batched_tokens analogue) of which
// decode claims its live load D first and prefill gets the leftover, plus an
// optional per-slot prompt-chunk cap (prefill_cap, the long_prefill_token_
// threshold analogue). Both are set BEFORE context init, like kv_paged /
// max_prefill_tokens above. Unset leaves the env untouched, so the engine stays
// byte-identical to stock (an externally exported LLAMA_MAX_BATCH_TOKENS /
// LLAMA_PREFILL_CAP still works as an escape hatch). When max_batch_tokens is set
// it takes precedence over max_prefill_tokens: the engine honours the legacy
// LLAMA_PREFILL_BUDGET only when the dynamic knob is unset.
} else if (!strcmp(optname, "max_batch_tokens") || !strcmp(optname, "mbt")) {
if (optval != NULL) {
try {
int mbt = std::stoi(optval_str);
if (mbt > 0) {
setenv("LLAMA_MAX_BATCH_TOKENS", std::to_string(mbt).c_str(), 1);
}
} catch (const std::exception& e) {
// If conversion fails, leave the budget unset (stock behaviour)
}
}
} else if (!strcmp(optname, "prefill_cap")) {
if (optval != NULL) {
try {
int cap = std::stoi(optval_str);
if (cap > 0) {
setenv("LLAMA_PREFILL_CAP", std::to_string(cap).c_str(), 1);
}
} catch (const std::exception& e) {
// If conversion fails, leave the per-slot cap unset (engine default)
}
}
} else if (!strcmp(optname, "n_ctx_checkpoints") || !strcmp(optname, "ctx_checkpoints")) {
if (optval != NULL) {
try {