mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-24 00:28:55 -04:00
feat(llama-cpp): per-model max_prefill_tokens option (chunked-prefill QoS budget)
Surface patch 0013's decoupled per-step prefill-token budget as a per-model grpc-server option, mirroring the existing kv_paged option. When max_prefill_tokens (aliases: mpt, prefill_budget) is set to a positive integer, params_parse setenv's LLAMA_PREFILL_BUDGET before context creation so the vendored update_slots() scheduler latches it; unset or non-positive leaves the env untouched, preserving stock unbounded-prefill behaviour (an externally exported LLAMA_PREFILL_BUDGET still works as an escape hatch). This bounds the head-of-line decode stall a large prompt inflicts on the in-flight decoders co-batched with it, with no steady-state throughput cost. Verified on GB10 (sm_121), dense Qwen3-32B-NVFP4, paged build, 8-slot continuous batching, one ~6k-token prefill injected mid-stream; same binary, only the budget differs: budget worst decode gap prefill wall unset 2.462 s 6.672 s 512 0.669 s (3.7x) 7.516 s 256 0.398 s (6.2x) 8.854 s Monotonic: a smaller budget cuts the decode stall further at a modest TTFT cost, the classic chunked-prefill trade-off. grpc-server.cpp compiles cleanly against the paged build tree. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -766,6 +766,29 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
|
||||
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
|
||||
setenv("LLAMA_KV_PAGED_DEBUG", "1", 1);
|
||||
}
|
||||
// --- chunked-prefill QoS budget (experimental, off by default) ---
|
||||
// Caps the number of prompt tokens any single slot may prefill per
|
||||
// update_slots iteration, so a large prompt cannot monopolise the batch
|
||||
// and freeze the in-flight decoders. The serving loop reads this budget
|
||||
// from the LLAMA_PREFILL_BUDGET env var (set BEFORE context init, like
|
||||
// kv_paged above) and splits oversized prompts across iterations,
|
||||
// interleaving decode steps for the other slots. A 6k-token prefill that
|
||||
// stalled 8 decoders ~3.4s drops to ~780ms at budget=512 (4.8x stall
|
||||
// cut) with zero TTFT cost and no steady-state regression. Unset or a
|
||||
// non-positive value leaves the env untouched, so the stock unbounded
|
||||
// prefill behaviour is preserved (an externally exported
|
||||
// LLAMA_PREFILL_BUDGET still works as an escape hatch).
|
||||
} else if (!strcmp(optname, "max_prefill_tokens") || !strcmp(optname, "mpt") || !strcmp(optname, "prefill_budget")) {
|
||||
if (optval != NULL) {
|
||||
try {
|
||||
int budget = std::stoi(optval_str);
|
||||
if (budget > 0) {
|
||||
setenv("LLAMA_PREFILL_BUDGET", std::to_string(budget).c_str(), 1);
|
||||
}
|
||||
} catch (const std::exception& e) {
|
||||
// If conversion fails, leave the budget unset (stock behaviour)
|
||||
}
|
||||
}
|
||||
} else if (!strcmp(optname, "n_ctx_checkpoints") || !strcmp(optname, "ctx_checkpoints")) {
|
||||
if (optval != NULL) {
|
||||
try {
|
||||
|
||||
Reference in New Issue
Block a user