feat(llama-cpp): per-model max_prefill_tokens option (chunked-prefill QoS budget)

Surface patch 0013's decoupled per-step prefill-token budget as a per-model
grpc-server option, mirroring the existing kv_paged option. When
max_prefill_tokens (aliases: mpt, prefill_budget) is set to a positive integer,
params_parse setenv's LLAMA_PREFILL_BUDGET before context creation so the
vendored update_slots() scheduler latches it; unset or non-positive leaves the
env untouched, preserving stock unbounded-prefill behaviour (an externally
exported LLAMA_PREFILL_BUDGET still works as an escape hatch).

This bounds the head-of-line decode stall a large prompt inflicts on the
in-flight decoders co-batched with it, with no steady-state throughput cost.

Verified on GB10 (sm_121), dense Qwen3-32B-NVFP4, paged build, 8-slot
continuous batching, one ~6k-token prefill injected mid-stream; same binary,
only the budget differs:

  budget      worst decode gap   prefill wall
  unset           2.462 s          6.672 s
  512             0.669 s (3.7x)   7.516 s
  256             0.398 s (6.2x)   8.854 s

Monotonic: a smaller budget cuts the decode stall further at a modest TTFT
cost, the classic chunked-prefill trade-off. grpc-server.cpp compiles cleanly
against the paged build tree.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-23 11:25:44 +00:00
parent 4bc2b4a9b2
commit dd6a4425e0

View File

@@ -766,6 +766,29 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
setenv("LLAMA_KV_PAGED_DEBUG", "1", 1);
}
// --- chunked-prefill QoS budget (experimental, off by default) ---
// Caps the number of prompt tokens any single slot may prefill per
// update_slots iteration, so a large prompt cannot monopolise the batch
// and freeze the in-flight decoders. The serving loop reads this budget
// from the LLAMA_PREFILL_BUDGET env var (set BEFORE context init, like
// kv_paged above) and splits oversized prompts across iterations,
// interleaving decode steps for the other slots. A 6k-token prefill that
// stalled 8 decoders ~3.4s drops to ~780ms at budget=512 (4.8x stall
// cut) with zero TTFT cost and no steady-state regression. Unset or a
// non-positive value leaves the env untouched, so the stock unbounded
// prefill behaviour is preserved (an externally exported
// LLAMA_PREFILL_BUDGET still works as an escape hatch).
} else if (!strcmp(optname, "max_prefill_tokens") || !strcmp(optname, "mpt") || !strcmp(optname, "prefill_budget")) {
if (optval != NULL) {
try {
int budget = std::stoi(optval_str);
if (budget > 0) {
setenv("LLAMA_PREFILL_BUDGET", std::to_string(budget).c_str(), 1);
}
} catch (const std::exception& e) {
// If conversion fails, leave the budget unset (stock behaviour)
}
}
} else if (!strcmp(optname, "n_ctx_checkpoints") || !strcmp(optname, "ctx_checkpoints")) {
if (optval != NULL) {
try {