Add CHUNKED_PREFILL_PLAN.md for the llama.cpp backend. Key finding: the vendored llama.cpp server scheduler (update_slots) already implements chunked prefill with prefill/decode interleaving on the pinned version - decode tokens are seated first each iteration, prefill fills the leftover n_batch budget, both share one llama_decode. The draft upstream PR #10718 goal is already absorbed; no re-implementation needed. The real LocalAI gap is the n_batch/n_ubatch coupling at grpc-server.cpp (both set to nbatch()), which pins the logical scheduling window to the physical ubatch width. The plan scopes the decouple (C++ option + proto NUBatch + options.go), an optional decode-headroom prefill cap as a vendored patch, a token-identical verification harness, and keeps the work orthogonal to paged KV. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
18 KiB
Chunked prefill + n_batch/n_ubatch decouple — implementation plan
Scope: LocalAI's llama.cpp backend (backend/cpp/llama-cpp/). Companion to
PHASED_VLLM_PARITY_PLAN.md Phase 3. This document is the concrete, file-cited
plan for what the brief called "chunked prefill".
Line numbers below are from two trees:
- LocalAI:
backend/cpp/llama-cpp/grpc-server.cpp,core/backend/options.go,backend/backend.proto,core/backend/hardware_defaults.go— exact. - Vendored upstream scheduler:
llama.cpp/tools/server/server-context.cpp. The build copiesllama.cpp/tools/server/*intotools/grpc-server/(prepare.shlines 15-17) and only overridesgrpc-server.cpp+CMakeLists.txt. Soupdate_slots()is inherited upstream code, not LocalAI code. Line numbers cited for it are from a same-era checkout (d12cc3d, 2026-04-09); the pin isf3e1828(Makefile line 2). The structure is identical; exact lines may drift a few rows at the pin — match on the quoted comment strings, not the integers.
TL;DR — the headline finding
Chunked prefill with prefill/decode interleaving is ALREADY implemented in the
llama.cpp server scheduler that LocalAI vendors. It is not a missing feature on
this version. update_slots() in server-context.cpp:
- Adds ongoing decode tokens first — "first, add sampled tokens from any
ongoing sequences" (≈ line 2088). Every
SLOT_STATE_GENERATINGslot gets its one sampled token into the sharedllama_batchbefore any prefill is added. - Then fills the remaining
n_batchbudget with prompt (prefill) tokens — "next, batch any pending prompts without exceeding n_batch" (≈ line 2166), gated byparams_base.cont_batching(LocalAI setscont_batching = trueby default,grpc-server.cpp:547). The per-slot prefill fill loop (≈ line 2552) iswhile (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch)— i.e. it caps each slot's prefill contribution to the remaining budget and defers the rest to the next iteration. - Decodes the combined batch in one pass (≈ line 2728-2741): decode tokens
and prefill-chunk tokens go through the same
llama_decode, which then splits internally inton_ubatchphysical sub-batches.
This is exactly the behavior the abandoned-looking draft upstream PR #10718
("server : chunked prefill support") asked for — "the first task is no longer
blocked by the second long prompt processing task." That PR is still marked OPEN
but its goal was absorbed into the natural evolution of update_slots(); we do
not need to port it. A long prefill no longer stalls the decode batch: decode
slots are serviced first every iteration, prefill consumes only the leftover
budget.
Therefore: do not re-implement chunked prefill. The real LocalAI gap is narrow and is the rest of this plan:
- Phase A (the actual gap): the
n_batch/n_ubatchdecouple. LocalAI ties the scheduler token budget (n_batch) to the physical forward width (n_ubatch) atgrpc-server.cpp:515+:519. This forcesn_batch == n_ubatch, so the logical scheduling window can never be wider than one physical ubatch. You cannot keepn_ubatchat the Blackwell GEMM sweet spot (2048) while wideningn_batchso concurrent prefills + decodes co-batch into a larger logical window. There is no first-classbatch:/ubatch:split on the Go side, and there is only a one-directionalubatchoverride on the C++ side (you can shrink ubatch below the coupled value, never grow n_batch above it). - Phase B (optional policy lever): a decode-headroom prefill cap. Upstream
caps prefill at the full
n_batchshared with decode. Under heavy mixed load one fat prefill chunk per iteration still adds inter-token latency (ITL) jitter to the decoders sharing that forward. vLLM exposeslong_prefill_token_threshold/max_num_partial_prefillsfor this. A LocalAI-specific per-iteration prefill cap (a patch to vendoredupdate_slots) bounds that jitter. This is genuinely not in upstream and is the only place a scheduler-policy change is warranted.
1. Current behavior — precise citations
1.1 The scheduler is upstream, inherited verbatim
prepare.sh:15-17copies all ofllama.cpp/tools/server/*into thegrpc-serverbuild dir;grpc-server.cpp(LocalAI) replaces only the HTTP/gRPC service +params_parse+parse_options.update_slots(), the slot state machine, and the batch builder are upstreamserver-context.cpp, untouched by LocalAI today.- Slot states:
server-context.cpp:36-42—SLOT_STATE_IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT / GENERATING.
1.2 Decode-first, then prefill-fill, one shared batch
common_batch_clear(batch)(≈ 2078) — one batch perupdate_slotsiteration.- Decode phase (≈ 2088-2156): for each
SLOT_STATE_GENERATINGslot,common_batch_add(batch, slot.sampled, …, /*logits=*/true)adds exactly one token. Decode is guaranteed a seat before prefill runs. - Budget fetch (≈ 2158-2160):
n_batch = llama_n_batch(ctx),n_ubatch = llama_n_ubatch(ctx). - Prefill phase (≈ 2166):
if (params_base.cont_batching || batch.n_tokens == 0)→ with cont_batching ON, prefill is added to the same batch as decode. - Per-slot prefill fill (≈ 2552-2597):
while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch)— adds prompt tokens until the slot is done or the shared budget is hit. Whatever does not fit stays for the next iteration (the slot remainsSLOT_STATE_PROCESSING_PROMPT). - Whole-prompt completion (≈ 2603-2615): when the slot's prompt is fully consumed
it flips to
SLOT_STATE_DONE_PROMPT, setsbatch.logits[last] = true, inits the sampler. Next iteration it becomesGENERATING. - Budget break (≈ 2693-2695):
if (batch.n_tokens >= n_batch) break;. - Decode (≈ 2728-2741): loops
batch_viewslices ofmin(n_batch, remaining)and callsllama_decode; the physicaln_ubatchsplit happens insidellama_decode.
1.3 The chunking is gated by can_split()
server-context.cpp:225-231:can_split()returns true unless the task needs embeddings with non-LAST pooling. So completion/generation tasks always chunk-and-interleave; only embeddings/rerank force the whole prompt into one ubatch (≈ 2234-2244 raises "input is too large… increase the physical batch size" — this is exactly why LocalAI bumpedn_ubatchfor rerank, see below).
1.4 LocalAI ties n_batch to n_ubatch (the gap)
grpc-server.cpp:515—params.n_batch = request->nbatch();grpc-server.cpp:519—params.n_ubatch = request->nbatch();with the comment that this fixes reranking being capped at the 512 defaultn_ubatch.grpc-server.cpp:781-784— the only decouple knob today: ann_ubatch/ubatchoption that overridesn_ubatchalone (added for embeddings/rerank). There is nobatch/n_batchoption parse, son_batchcannot be raised above the coupled value from a model config. Confirmed:grep '"n_batch"|"batch"'ingrpc-server.cppreturns nothing.- Options arrive via
request->options(i)parsed asoptname:optval(grpc-server.cpp:584-585); these come fromModelOptions.Options⟵c.Options(core/backend/options.go:221).
1.5 Go side sends a single batch number
backend/backend.proto:341—int32 NBatch = 4;is the only batch field; there is noNUBatch.core/backend/options.go:108-129EffectiveBatchSize: returnsc.Batchif set, else context size for single-pass (score/embed/rerank), elsehardwareDefaultBatchSize(512).core/backend/options.go:228—NBatch: int32(b)(single value to the backend; becomes bothn_batchandn_ubatchvia 1.4).core/backend/hardware_defaults.go:28,37-40—BlackwellBatchSize = 2048; on Blackwell an unset batch defaults to 2048, so todayn_batch == n_ubatch == 2048there.
2. Why the decouple matters for serving (not just rerank)
Invariant: n_ubatch <= n_batch. n_ubatch is the physical forward-pass GEMM
width (compute efficiency; GB10 sweet spot ≈ 2048). n_batch is the per-iteration
scheduler token budget — the logical window shared by decode + prefill chunks,
analogous to vLLM's max_num_batched_tokens.
With n_batch == n_ubatch (today), the scheduling window cannot exceed one
physical ubatch. Consequences:
- Under concurrency, the combined (decode + multiple prefill chunks) logical batch is capped at the physical ubatch, so aggregate prefill cannot grow past one ubatch worth of tokens per iteration even when more slots have prompts queued.
- A user who shrinks
batch:for memory also shrinks the physical ubatch, degrading prefill GEMM efficiency — and vice versa.
Decoupling lets us hold n_ubatch = 2048 (efficient GEMM) while setting a larger
n_batch (e.g. 4096) so more concurrent prefill+decode tokens co-schedule into one
logical window, lifting aggregate prefill under mixed load — llama_decode still
tiles the physical work at 2048.
3. Phased implementation
Phase 0 — Verification harness (do first; TDD red)
Bite-sized, no code change to the scheduler.
- 0.1 Token-identical greedy under mixed load. Script: start the backend with
n_parallel >= 4, greedy sampling (temp 0, fixed seed). Fire (a) several short decode streams and (b) one ~8k-token prompt concurrently (the exact repro from PR #10718's body works). Capture each stream's full token id sequence. Re-run with the prefill request absent. Assert the short streams' token ids are byte-identical in both runs — proves interleaving does not perturb decode numerics (KV/position correctness across chunk boundaries). Wire as a Ginkgo spec under the backend e2e suite. - 0.2 Mixed-workload throughput baseline. Use
llama-batched-bench(built from the same tree) or a small driver hitting/v1/chat/completions: measure aggregate prefill tok/s and decode tok/s, and p50/p99 ITL of the decode streams, under the mixed workload. Record numbers for the currentn_batch==n_ubatchconfig. This is the before of Phase A/B.
Expected result of Phase 0: 0.1 already passes (interleave is correct today); 0.2 gives the baseline the decouple must beat.
Phase A — Decouple n_batch from n_ubatch
Goal: let model config set the physical ubatch independently of the logical batch, defaulting to today's behavior (no regression).
-
A.1 C++: accept a
batch/n_batchoption (and keepubatch). Ingrpc-server.cpp, after the existingubatchbranch (:781-784), add a sibling branch:} else if (!strcmp(optname, "n_batch") || !strcmp(optname, "batch")) { if (optval != NULL) { try { params.n_batch = std::stoi(optval_str); } catch (...) {} }This is the missing direction (raise
n_batchabove the coupled value). Order matters: both:515/:519run first (coupling as default), then option parsing overrides either independently. Add a clamp note: if a user setsn_ubatch > n_batch, llama.cpp will clamp/upbatch; log a warning. Keep the:519aliasing for backward compat (rerank still works with no options). -
A.2 Proto: add an explicit physical ubatch field.
backend/backend.proto:341addint32 NUBatch = <next free tag>;(do not reuse 4). Regenerate withmake protogen-go+ the C++ proto build. -
A.3 C++: honor
NUBatchwhen present. Ingrpc-server.cppparams_parse, after:519, add:if (request->nubatch() > 0) { params.n_ubatch = request->nubatch(); }so an explicit physical ubatch wins over the
n_batchalias, with theubatchstring-option as a third path for users who only editoptions:. -
A.4 Go: config surface + plumbing.
- Add
UBatch *int(yamlubatch) to the llama config struct alongsideBatch(searchcore/configfor theBatchfield; mirror it). - In
core/backend/options.go: addEffectiveUBatchSize(c)mirroringEffectiveBatchSize(returnc.UBatchif set, elsemin(EffectiveBatchSize(c), BlackwellBatchSize-or-512)so the physical ubatch stays at the hardware sweet spot whilen_batchmay be larger). SetNUBatch: int32(EffectiveUBatchSize(c))next toNBatch:(:228). - Keep the default such that when neither is set,
NUBatch == NBatch⇒ byte-identical to today.
- Add
-
A.5 Serving default (the lever). In
hardware_defaults.go, introduceBlackwellLogicalBatch = 4096(or a measured value) and letEffectiveBatchSizereturn it for multi-slot serving configs (whenn_parallel > 1and the model is a completion model), whileEffectiveUBatchSizestays atBlackwellBatchSize = 2048. Gate behind the same Blackwell detection already used at:37-40. Single-stream/embedding/rerank paths keepn_batch == n_ubatch. This is the only behavioral change shipped by Phase A; Phase 0.2 must show it is net-positive before defaulting it on. -
A.6 Tests. Extend
hardware_defaults_internal_test.gowithEffectiveUBatchSizecases; add agrpcModelOptstest assertingNUBatch <= NBatchand that unset config yieldsNUBatch == NBatch. Re-run 0.1 (must still be token-identical) and 0.2 (must show aggregate-prefill gain or neutral ITL) atn_batch=4096, n_ubatch=2048.
Phase B — Decode-headroom prefill cap (optional policy, vendored patch)
Only if Phase 0.2 / A shows decode ITL jitter from fat prefill chunks. This is the
one change that touches the inherited scheduler, so it lives as a patch in
backend/cpp/llama-cpp/patches/ (applied by prepare.sh:6-11 / Makefile
:141-145), never as an edit to a checked-in upstream file.
Policy (pseudocode; insert into update_slots() prefill fill loop, the
while (… && batch.n_tokens < n_batch) at ≈ server-context.cpp:2552):
# token budget for THIS iteration, decode already seated:
n_decode_in_batch = batch.n_tokens # set after the decode phase
prefill_budget = n_batch # default == today
if serving_mode and n_decode_in_batch > 0:
# leave room so decoders are not starved/jittered by one giant prefill chunk
# max_prefill_per_iter defaults to n_ubatch (one physical tile) when decode active
prefill_budget = min(n_batch, n_decode_in_batch + max_prefill_per_iter)
# fill loop guard becomes:
while slot.prompt.n_tokens() < slot.task->n_tokens()
and batch.n_tokens < prefill_budget:
...
max_prefill_per_iteris a newcommon_paramsfield surfaced as anoptions:knob (max_prefill_tokens/mpt) parsed ingrpc-server.cppexactly like A.1, default0= disabled = today's behavior.- Semantics mirror vLLM
long_prefill_token_threshold: cap the prefill share so ongoing decodes keep a steady cadence; the remaining prompt rides the next iteration (already supported by the state machine — slot staysPROCESSING_PROMPT). - Correctness: unchanged KV/position path — chunk boundaries already advance
slot.prompt.tokens.pos_next()per added token (≈ 2570) and the slot resumes fromslot.prompt.n_tokens()next iteration. Capping the budget only changes how many tokens are added this iteration, not which positions, so 0.1 must remain token-identical.
Phase C — Docs + defaults rollout
- Document
batch/ubatch(andmax_prefill_tokensif B ships) indocs/content/model-config reference, with the serving recipe (n_parallel>1,n_batch=4096,ubatch=2048). - Note the orthogonality to paged KV (below) in
PHASED_VLLM_PARITY_PLAN.mdPhase 3.
4. Risk / correctness
- KV-cache & positions across chunks: already handled upstream. Each prefill
token added advances
pos_next()(≈ 2570) and is pushed toslot.prompt.tokens(≈ 2573); the next iteration resumes fromslot.prompt.n_tokens(). Chunk boundaries are transparent to the KV cache because positions are absolute, not per-chunk. Phase A changes only budgets, not positions; Phase B changes only the per-iteration count. The 0.1 token-identical test is the guardrail. - Unified KV cache (LocalAI default,
n_parallelslots share one cache): unaffected — co-batching prefill+decode across slots is what the unified cache is for; positions are per-seq_id({ slot.id }incommon_batch_add). n_ubatch > n_batch: invalid; A.4 clampsEffectiveUBatchSize <= EffectiveBatchSizeand A.1 logs a warning if options violate it.- Embeddings / rerank: must keep
n_ubatch >= prompt length(single pass,can_split()==false). The existing:519alias +EffectiveBatchSizecontext-sizing for single-pass usecases (options.go:119-124) must be preserved — do not let the servingBlackwellLogicalBatchdefault leak into single-pass configs (A.5 gates on completion +n_parallel>1). - Turboquant fork: the fork lacks some
common_paramsfields (seeLOCALAI_LEGACY_LLAMA_CPP_SPECprecedent atgrpc-server.cpp:755).n_batch/n_ubatchare ancient fields and safe; if Phase B addsmax_prefill_per_iter, guard the new field behind a#ifndeflike the checkpoint block does.
5. Orthogonality to paged KV (Phase 2)
Keep them independent. Paged KV (the -kvp / block-manager effort, draft #22569,
and paged/) changes where KV blocks live (allocation/utilization). Chunked
prefill / this decouple changes how many tokens per iteration the scheduler
batches (the n_batch budget and decode/prefill interleave). They compose: paged
KV raises the concurrency ceiling (more slots), the decouple widens the per-iter
scheduling window to feed those slots; neither touches the other's data structures.
The only contact point is update_slots() — if both ship a vendored patch to it,
land them as separate, ordered patches in patches/ and keep the hunks disjoint
(paged touches allocation/seq_rm; chunked-prefill Phase B touches the prefill fill
budget).
6. Bottom line
- Chunked prefill + decode interleave: already present and correct on the pinned llama.cpp — verify (Phase 0.1), do not rebuild.
- Real work: the n_batch/n_ubatch decouple (Phase A) — small, additive,
default-preserving — plus an optional decode-headroom prefill cap (Phase B)
if measurements show ITL jitter. Both are LocalAI-side: A in
grpc-server.cpp- proto +
options.go; B as a vendoredpatches/hunk.
- proto +