docs(llama-cpp): scope chunked prefill + n_batch/n_ubatch decouple

Add CHUNKED_PREFILL_PLAN.md for the llama.cpp backend. Key finding: the
vendored llama.cpp server scheduler (update_slots) already implements
chunked prefill with prefill/decode interleaving on the pinned version -
decode tokens are seated first each iteration, prefill fills the leftover
n_batch budget, both share one llama_decode. The draft upstream PR #10718
goal is already absorbed; no re-implementation needed.

The real LocalAI gap is the n_batch/n_ubatch coupling at grpc-server.cpp
(both set to nbatch()), which pins the logical scheduling window to the
physical ubatch width. The plan scopes the decouple (C++ option + proto
NUBatch + options.go), an optional decode-headroom prefill cap as a
vendored patch, a token-identical verification harness, and keeps the
work orthogonal to paged KV.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-21 12:54:22 +00:00
parent 07985ba45b
commit fdb7f56bb7

View File

@@ -0,0 +1,334 @@
# Chunked prefill + n_batch/n_ubatch decouple — implementation plan
Scope: LocalAI's llama.cpp backend (`backend/cpp/llama-cpp/`). Companion to
`PHASED_VLLM_PARITY_PLAN.md` Phase 3. This document is the concrete, file-cited
plan for what the brief called "chunked prefill".
Line numbers below are from two trees:
- LocalAI: `backend/cpp/llama-cpp/grpc-server.cpp`, `core/backend/options.go`,
`backend/backend.proto`, `core/backend/hardware_defaults.go` — exact.
- Vendored upstream scheduler: `llama.cpp/tools/server/server-context.cpp`. The
build copies `llama.cpp/tools/server/*` into `tools/grpc-server/` (`prepare.sh`
lines 15-17) and only overrides `grpc-server.cpp` + `CMakeLists.txt`. So
`update_slots()` is **inherited upstream code, not LocalAI code**. Line numbers
cited for it are from a same-era checkout (`d12cc3d`, 2026-04-09); the pin is
`f3e1828` (Makefile line 2). The structure is identical; exact lines may drift
a few rows at the pin — match on the quoted comment strings, not the integers.
---
## TL;DR — the headline finding
**Chunked prefill with prefill/decode interleaving is ALREADY implemented** in the
llama.cpp server scheduler that LocalAI vendors. It is not a missing feature on
this version. `update_slots()` in `server-context.cpp`:
1. **Adds ongoing decode tokens first** — "first, add sampled tokens from any
ongoing sequences" (≈ line 2088). Every `SLOT_STATE_GENERATING` slot gets its
one sampled token into the shared `llama_batch` before any prefill is added.
2. **Then fills the remaining `n_batch` budget with prompt (prefill) tokens**
"next, batch any pending prompts without exceeding n_batch" (≈ line 2166),
gated by `params_base.cont_batching` (LocalAI sets `cont_batching = true` by
default, `grpc-server.cpp:547`). The per-slot prefill fill loop
(≈ line 2552) is `while (slot.prompt.n_tokens() < slot.task->n_tokens() &&
batch.n_tokens < n_batch)` — i.e. it caps each slot's prefill contribution to
the **remaining** budget and defers the rest to the next iteration.
3. **Decodes the combined batch in one pass** (≈ line 2728-2741): decode tokens
and prefill-chunk tokens go through the **same `llama_decode`**, which then
splits internally into `n_ubatch` physical sub-batches.
This is exactly the behavior the abandoned-looking draft **upstream PR #10718**
("server : chunked prefill support") asked for — "the first task is no longer
blocked by the second long prompt processing task." That PR is still marked OPEN
but its goal was absorbed into the natural evolution of `update_slots()`; we do
**not** need to port it. A long prefill no longer stalls the decode batch: decode
slots are serviced first every iteration, prefill consumes only the leftover
budget.
**Therefore: do not re-implement chunked prefill.** The real LocalAI gap is
narrow and is the rest of this plan:
- **Phase A (the actual gap): the `n_batch`/`n_ubatch` decouple.** LocalAI ties
the scheduler token budget (`n_batch`) to the physical forward width
(`n_ubatch`) at `grpc-server.cpp:515` + `:519`. This forces
`n_batch == n_ubatch`, so the logical scheduling window can never be wider than
one physical ubatch. You cannot keep `n_ubatch` at the Blackwell GEMM sweet
spot (2048) while widening `n_batch` so concurrent prefills + decodes co-batch
into a larger logical window. There is no first-class `batch:`/`ubatch:` split
on the Go side, and there is only a one-directional `ubatch` override on the C++
side (you can shrink ubatch below the coupled value, never grow n_batch above
it).
- **Phase B (optional policy lever): a decode-headroom prefill cap.** Upstream
caps prefill at the full `n_batch` shared with decode. Under heavy mixed load
one fat prefill chunk per iteration still adds inter-token latency (ITL) jitter
to the decoders sharing that forward. vLLM exposes
`long_prefill_token_threshold` / `max_num_partial_prefills` for this. A
LocalAI-specific per-iteration prefill cap (a patch to vendored `update_slots`)
bounds that jitter. This is genuinely not in upstream and is the only place a
scheduler-policy change is warranted.
---
## 1. Current behavior — precise citations
### 1.1 The scheduler is upstream, inherited verbatim
- `prepare.sh:15-17` copies all of `llama.cpp/tools/server/*` into the
`grpc-server` build dir; `grpc-server.cpp` (LocalAI) replaces only the HTTP/gRPC
service + `params_parse` + `parse_options`. `update_slots()`, the slot state
machine, and the batch builder are **upstream `server-context.cpp`**, untouched
by LocalAI today.
- Slot states: `server-context.cpp:36-42`
`SLOT_STATE_IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT /
GENERATING`.
### 1.2 Decode-first, then prefill-fill, one shared batch
- `common_batch_clear(batch)` (≈ 2078) — one batch per `update_slots` iteration.
- Decode phase (≈ 2088-2156): for each `SLOT_STATE_GENERATING` slot,
`common_batch_add(batch, slot.sampled, …, /*logits=*/true)` adds exactly one
token. Decode is guaranteed a seat before prefill runs.
- Budget fetch (≈ 2158-2160): `n_batch = llama_n_batch(ctx)`,
`n_ubatch = llama_n_ubatch(ctx)`.
- Prefill phase (≈ 2166): `if (params_base.cont_batching || batch.n_tokens == 0)`
→ with cont_batching ON, prefill is added to the **same** batch as decode.
- Per-slot prefill fill (≈ 2552-2597):
`while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch)`
— adds prompt tokens until the slot is done **or** the shared budget is hit.
Whatever does not fit stays for the next iteration (the slot remains
`SLOT_STATE_PROCESSING_PROMPT`).
- Whole-prompt completion (≈ 2603-2615): when the slot's prompt is fully consumed
it flips to `SLOT_STATE_DONE_PROMPT`, sets `batch.logits[last] = true`, inits
the sampler. Next iteration it becomes `GENERATING`.
- Budget break (≈ 2693-2695): `if (batch.n_tokens >= n_batch) break;`.
- Decode (≈ 2728-2741): loops `batch_view` slices of `min(n_batch, remaining)` and
calls `llama_decode`; the physical `n_ubatch` split happens inside
`llama_decode`.
### 1.3 The chunking is gated by `can_split()`
- `server-context.cpp:225-231`: `can_split()` returns true unless the task needs
embeddings with non-LAST pooling. So **completion/generation tasks always
chunk-and-interleave**; only embeddings/rerank force the whole prompt into one
ubatch (≈ 2234-2244 raises "input is too large… increase the physical batch
size" — this is exactly why LocalAI bumped `n_ubatch` for rerank, see below).
### 1.4 LocalAI ties n_batch to n_ubatch (the gap)
- `grpc-server.cpp:515``params.n_batch = request->nbatch();`
- `grpc-server.cpp:519``params.n_ubatch = request->nbatch();` with the comment
that this fixes reranking being capped at the 512 default `n_ubatch`.
- `grpc-server.cpp:781-784` — the **only** decouple knob today: an `n_ubatch` /
`ubatch` option that overrides `n_ubatch` alone (added for embeddings/rerank).
There is **no** `batch` / `n_batch` option parse, so `n_batch` cannot be raised
above the coupled value from a model config. Confirmed: `grep '"n_batch"|"batch"'`
in `grpc-server.cpp` returns nothing.
- Options arrive via `request->options(i)` parsed as `optname:optval`
(`grpc-server.cpp:584-585`); these come from `ModelOptions.Options`
`c.Options` (`core/backend/options.go:221`).
### 1.5 Go side sends a single batch number
- `backend/backend.proto:341``int32 NBatch = 4;` is the only batch field; there
is **no** `NUBatch`.
- `core/backend/options.go:108-129` `EffectiveBatchSize`: returns `c.Batch` if set,
else context size for single-pass (score/embed/rerank), else
`hardwareDefaultBatchSize(512)`.
- `core/backend/options.go:228``NBatch: int32(b)` (single value to the
backend; becomes both `n_batch` and `n_ubatch` via 1.4).
- `core/backend/hardware_defaults.go:28,37-40``BlackwellBatchSize = 2048`;
on Blackwell an unset batch defaults to 2048, so today
`n_batch == n_ubatch == 2048` there.
---
## 2. Why the decouple matters for serving (not just rerank)
Invariant: `n_ubatch <= n_batch`. `n_ubatch` is the physical forward-pass GEMM
width (compute efficiency; GB10 sweet spot ≈ 2048). `n_batch` is the per-iteration
**scheduler token budget** — the logical window shared by decode + prefill chunks,
analogous to vLLM's `max_num_batched_tokens`.
With `n_batch == n_ubatch` (today), the scheduling window cannot exceed one
physical ubatch. Consequences:
- Under concurrency, the combined (decode + multiple prefill chunks) logical batch
is capped at the physical ubatch, so aggregate prefill cannot grow past one
ubatch worth of tokens per iteration even when more slots have prompts queued.
- A user who shrinks `batch:` for memory also shrinks the physical ubatch,
degrading prefill GEMM efficiency — and vice versa.
Decoupling lets us hold `n_ubatch = 2048` (efficient GEMM) while setting a larger
`n_batch` (e.g. 4096) so more concurrent prefill+decode tokens co-schedule into one
logical window, lifting aggregate prefill under mixed load — `llama_decode` still
tiles the physical work at 2048.
---
## 3. Phased implementation
### Phase 0 — Verification harness (do first; TDD red)
Bite-sized, no code change to the scheduler.
- **0.1 Token-identical greedy under mixed load.** Script: start the backend with
`n_parallel >= 4`, greedy sampling (temp 0, fixed seed). Fire (a) several short
decode streams and (b) one ~8k-token prompt concurrently (the exact repro from
PR #10718's body works). Capture each stream's full token id sequence. Re-run
with the prefill request absent. **Assert the short streams' token ids are
byte-identical** in both runs — proves interleaving does not perturb decode
numerics (KV/position correctness across chunk boundaries). Wire as a Ginkgo
spec under the backend e2e suite.
- **0.2 Mixed-workload throughput baseline.** Use `llama-batched-bench` (built from
the same tree) or a small driver hitting `/v1/chat/completions`: measure
aggregate prefill tok/s and decode tok/s, and p50/p99 ITL of the decode streams,
under the mixed workload. Record numbers for the current `n_batch==n_ubatch`
config. This is the before of Phase A/B.
Expected result of Phase 0: 0.1 already passes (interleave is correct today);
0.2 gives the baseline the decouple must beat.
### Phase A — Decouple n_batch from n_ubatch
Goal: let model config set the physical ubatch independently of the logical batch,
defaulting to today's behavior (no regression).
- **A.1 C++: accept a `batch`/`n_batch` option (and keep `ubatch`).**
In `grpc-server.cpp`, after the existing `ubatch` branch (`:781-784`), add a
sibling branch:
```cpp
} else if (!strcmp(optname, "n_batch") || !strcmp(optname, "batch")) {
if (optval != NULL) {
try { params.n_batch = std::stoi(optval_str); } catch (...) {}
}
```
This is the missing direction (raise `n_batch` above the coupled value). Order
matters: both `:515/:519` run first (coupling as default), then option parsing
overrides either independently. Add a clamp note: if a user sets
`n_ubatch > n_batch`, llama.cpp will clamp/upbatch; log a warning. Keep the
`:519` aliasing for backward compat (rerank still works with no options).
- **A.2 Proto: add an explicit physical ubatch field.**
`backend/backend.proto:341` add `int32 NUBatch = <next free tag>;` (do not reuse
4). Regenerate with `make protogen-go` + the C++ proto build.
- **A.3 C++: honor `NUBatch` when present.**
In `grpc-server.cpp` `params_parse`, after `:519`, add:
```cpp
if (request->nubatch() > 0) {
params.n_ubatch = request->nubatch();
}
```
so an explicit physical ubatch wins over the `n_batch` alias, with the `ubatch`
string-option as a third path for users who only edit `options:`.
- **A.4 Go: config surface + plumbing.**
- Add `UBatch *int` (yaml `ubatch`) to the llama config struct alongside `Batch`
(search `core/config` for the `Batch` field; mirror it).
- In `core/backend/options.go`: add `EffectiveUBatchSize(c)` mirroring
`EffectiveBatchSize` (return `c.UBatch` if set, else
`min(EffectiveBatchSize(c), BlackwellBatchSize-or-512)` so the physical ubatch
stays at the hardware sweet spot while `n_batch` may be larger). Set
`NUBatch: int32(EffectiveUBatchSize(c))` next to `NBatch:` (`:228`).
- Keep the default such that when neither is set, `NUBatch == NBatch` ⇒
byte-identical to today.
- **A.5 Serving default (the lever).**
In `hardware_defaults.go`, introduce `BlackwellLogicalBatch = 4096` (or a
measured value) and let `EffectiveBatchSize` return it for **multi-slot serving**
configs (when `n_parallel > 1` and the model is a completion model), while
`EffectiveUBatchSize` stays at `BlackwellBatchSize = 2048`. Gate behind the same
Blackwell detection already used at `:37-40`. Single-stream/embedding/rerank
paths keep `n_batch == n_ubatch`. This is the only behavioral change shipped by
Phase A; Phase 0.2 must show it is net-positive before defaulting it on.
- **A.6 Tests.** Extend `hardware_defaults_internal_test.go` with
`EffectiveUBatchSize` cases; add a `grpcModelOpts` test asserting
`NUBatch <= NBatch` and that unset config yields `NUBatch == NBatch`. Re-run
0.1 (must still be token-identical) and 0.2 (must show aggregate-prefill gain or
neutral ITL) at `n_batch=4096, n_ubatch=2048`.
### Phase B — Decode-headroom prefill cap (optional policy, vendored patch)
Only if Phase 0.2 / A shows decode ITL jitter from fat prefill chunks. This is the
one change that touches the inherited scheduler, so it lives as a patch in
`backend/cpp/llama-cpp/patches/` (applied by `prepare.sh:6-11` / Makefile
`:141-145`), never as an edit to a checked-in upstream file.
Policy (pseudocode; insert into `update_slots()` prefill fill loop, the
`while (… && batch.n_tokens < n_batch)` at ≈ `server-context.cpp:2552`):
```
# token budget for THIS iteration, decode already seated:
n_decode_in_batch = batch.n_tokens # set after the decode phase
prefill_budget = n_batch # default == today
if serving_mode and n_decode_in_batch > 0:
# leave room so decoders are not starved/jittered by one giant prefill chunk
# max_prefill_per_iter defaults to n_ubatch (one physical tile) when decode active
prefill_budget = min(n_batch, n_decode_in_batch + max_prefill_per_iter)
# fill loop guard becomes:
while slot.prompt.n_tokens() < slot.task->n_tokens()
and batch.n_tokens < prefill_budget:
...
```
- `max_prefill_per_iter` is a new `common_params` field surfaced as an
`options:` knob (`max_prefill_tokens` / `mpt`) parsed in `grpc-server.cpp`
exactly like A.1, default `0` = disabled = today's behavior.
- Semantics mirror vLLM `long_prefill_token_threshold`: cap the prefill share so
ongoing decodes keep a steady cadence; the remaining prompt rides the next
iteration (already supported by the state machine — slot stays
`PROCESSING_PROMPT`).
- **Correctness:** unchanged KV/position path — chunk boundaries already advance
`slot.prompt.tokens.pos_next()` per added token (≈ 2570) and the slot resumes
from `slot.prompt.n_tokens()` next iteration. Capping the budget only changes
*how many* tokens are added this iteration, not *which* positions, so 0.1 must
remain token-identical.
### Phase C — Docs + defaults rollout
- Document `batch` / `ubatch` (and `max_prefill_tokens` if B ships) in
`docs/content/` model-config reference, with the serving recipe
(`n_parallel>1`, `n_batch=4096`, `ubatch=2048`).
- Note the orthogonality to paged KV (below) in
`PHASED_VLLM_PARITY_PLAN.md` Phase 3.
---
## 4. Risk / correctness
- **KV-cache & positions across chunks:** already handled upstream. Each prefill
token added advances `pos_next()` (≈ 2570) and is pushed to `slot.prompt.tokens`
(≈ 2573); the next iteration resumes from `slot.prompt.n_tokens()`. Chunk
boundaries are transparent to the KV cache because positions are absolute, not
per-chunk. Phase A changes only budgets, not positions; Phase B changes only the
per-iteration count. The 0.1 token-identical test is the guardrail.
- **Unified KV cache (LocalAI default, `n_parallel` slots share one cache):**
unaffected — co-batching prefill+decode across slots is what the unified cache is
for; positions are per-`seq_id` (`{ slot.id }` in `common_batch_add`).
- **`n_ubatch > n_batch`:** invalid; A.4 clamps `EffectiveUBatchSize <=
EffectiveBatchSize` and A.1 logs a warning if options violate it.
- **Embeddings / rerank:** must keep `n_ubatch >= prompt length` (single pass,
`can_split()==false`). The existing `:519` alias + `EffectiveBatchSize`
context-sizing for single-pass usecases (`options.go:119-124`) must be preserved
— do not let the serving `BlackwellLogicalBatch` default leak into single-pass
configs (A.5 gates on completion + `n_parallel>1`).
- **Turboquant fork:** the fork lacks some `common_params` fields (see
`LOCALAI_LEGACY_LLAMA_CPP_SPEC` precedent at `grpc-server.cpp:755`). `n_batch` /
`n_ubatch` are ancient fields and safe; if Phase B adds `max_prefill_per_iter`,
guard the new field behind a `#ifndef` like the checkpoint block does.
## 5. Orthogonality to paged KV (Phase 2)
Keep them independent. Paged KV (the `-kvp` / block-manager effort, draft #22569,
and `paged/`) changes **where** KV blocks live (allocation/utilization). Chunked
prefill / this decouple changes **how many tokens per iteration** the scheduler
batches (the `n_batch` budget and decode/prefill interleave). They compose: paged
KV raises the concurrency ceiling (more slots), the decouple widens the per-iter
scheduling window to feed those slots; neither touches the other's data structures.
The only contact point is `update_slots()` — if both ship a vendored patch to it,
land them as separate, ordered patches in `patches/` and keep the hunks disjoint
(paged touches allocation/seq_rm; chunked-prefill Phase B touches the prefill fill
budget).
---
## 6. Bottom line
- Chunked prefill + decode interleave: **already present and correct** on the
pinned llama.cpp — verify (Phase 0.1), do not rebuild.
- Real work: the **n_batch/n_ubatch decouple** (Phase A) — small, additive,
default-preserving — plus an **optional decode-headroom prefill cap** (Phase B)
if measurements show ITL jitter. Both are LocalAI-side: A in `grpc-server.cpp`
+ proto + `options.go`; B as a vendored `patches/` hunk.