Files
LocalAI/backend/cpp/llama-cpp
Ettore Di Giacinto dd6a4425e0 feat(llama-cpp): per-model max_prefill_tokens option (chunked-prefill QoS budget)
Surface patch 0013's decoupled per-step prefill-token budget as a per-model
grpc-server option, mirroring the existing kv_paged option. When
max_prefill_tokens (aliases: mpt, prefill_budget) is set to a positive integer,
params_parse setenv's LLAMA_PREFILL_BUDGET before context creation so the
vendored update_slots() scheduler latches it; unset or non-positive leaves the
env untouched, preserving stock unbounded-prefill behaviour (an externally
exported LLAMA_PREFILL_BUDGET still works as an escape hatch).

This bounds the head-of-line decode stall a large prompt inflicts on the
in-flight decoders co-batched with it, with no steady-state throughput cost.

Verified on GB10 (sm_121), dense Qwen3-32B-NVFP4, paged build, 8-slot
continuous batching, one ~6k-token prefill injected mid-stream; same binary,
only the budget differs:

  budget      worst decode gap   prefill wall
  unset           2.462 s          6.672 s
  512             0.669 s (3.7x)   7.516 s
  256             0.398 s (6.2x)   8.854 s

Monotonic: a smaller budget cuts the decode stall further at a modest TTFT
cost, the classic chunked-prefill trade-off. grpc-server.cpp compiles cleanly
against the paged build tree.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-23 11:25:44 +00:00
..
2026-04-12 08:51:30 +02:00