mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
docs(llama-cpp): scope chunked prefill + n_batch/n_ubatch decouple
Add CHUNKED_PREFILL_PLAN.md for the llama.cpp backend. Key finding: the vendored llama.cpp server scheduler (update_slots) already implements chunked prefill with prefill/decode interleaving on the pinned version - decode tokens are seated first each iteration, prefill fills the leftover n_batch budget, both share one llama_decode. The draft upstream PR #10718 goal is already absorbed; no re-implementation needed. The real LocalAI gap is the n_batch/n_ubatch coupling at grpc-server.cpp (both set to nbatch()), which pins the logical scheduling window to the physical ubatch width. The plan scopes the decouple (C++ option + proto NUBatch + options.go), an optional decode-headroom prefill cap as a vendored patch, a token-identical verification harness, and keeps the work orthogonal to paged KV. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
334
backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md
Normal file
334
backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md
Normal file
@@ -0,0 +1,334 @@
|
||||
# Chunked prefill + n_batch/n_ubatch decouple — implementation plan
|
||||
|
||||
Scope: LocalAI's llama.cpp backend (`backend/cpp/llama-cpp/`). Companion to
|
||||
`PHASED_VLLM_PARITY_PLAN.md` Phase 3. This document is the concrete, file-cited
|
||||
plan for what the brief called "chunked prefill".
|
||||
|
||||
Line numbers below are from two trees:
|
||||
- LocalAI: `backend/cpp/llama-cpp/grpc-server.cpp`, `core/backend/options.go`,
|
||||
`backend/backend.proto`, `core/backend/hardware_defaults.go` — exact.
|
||||
- Vendored upstream scheduler: `llama.cpp/tools/server/server-context.cpp`. The
|
||||
build copies `llama.cpp/tools/server/*` into `tools/grpc-server/` (`prepare.sh`
|
||||
lines 15-17) and only overrides `grpc-server.cpp` + `CMakeLists.txt`. So
|
||||
`update_slots()` is **inherited upstream code, not LocalAI code**. Line numbers
|
||||
cited for it are from a same-era checkout (`d12cc3d`, 2026-04-09); the pin is
|
||||
`f3e1828` (Makefile line 2). The structure is identical; exact lines may drift
|
||||
a few rows at the pin — match on the quoted comment strings, not the integers.
|
||||
|
||||
---
|
||||
|
||||
## TL;DR — the headline finding
|
||||
|
||||
**Chunked prefill with prefill/decode interleaving is ALREADY implemented** in the
|
||||
llama.cpp server scheduler that LocalAI vendors. It is not a missing feature on
|
||||
this version. `update_slots()` in `server-context.cpp`:
|
||||
|
||||
1. **Adds ongoing decode tokens first** — "first, add sampled tokens from any
|
||||
ongoing sequences" (≈ line 2088). Every `SLOT_STATE_GENERATING` slot gets its
|
||||
one sampled token into the shared `llama_batch` before any prefill is added.
|
||||
2. **Then fills the remaining `n_batch` budget with prompt (prefill) tokens** —
|
||||
"next, batch any pending prompts without exceeding n_batch" (≈ line 2166),
|
||||
gated by `params_base.cont_batching` (LocalAI sets `cont_batching = true` by
|
||||
default, `grpc-server.cpp:547`). The per-slot prefill fill loop
|
||||
(≈ line 2552) is `while (slot.prompt.n_tokens() < slot.task->n_tokens() &&
|
||||
batch.n_tokens < n_batch)` — i.e. it caps each slot's prefill contribution to
|
||||
the **remaining** budget and defers the rest to the next iteration.
|
||||
3. **Decodes the combined batch in one pass** (≈ line 2728-2741): decode tokens
|
||||
and prefill-chunk tokens go through the **same `llama_decode`**, which then
|
||||
splits internally into `n_ubatch` physical sub-batches.
|
||||
|
||||
This is exactly the behavior the abandoned-looking draft **upstream PR #10718**
|
||||
("server : chunked prefill support") asked for — "the first task is no longer
|
||||
blocked by the second long prompt processing task." That PR is still marked OPEN
|
||||
but its goal was absorbed into the natural evolution of `update_slots()`; we do
|
||||
**not** need to port it. A long prefill no longer stalls the decode batch: decode
|
||||
slots are serviced first every iteration, prefill consumes only the leftover
|
||||
budget.
|
||||
|
||||
**Therefore: do not re-implement chunked prefill.** The real LocalAI gap is
|
||||
narrow and is the rest of this plan:
|
||||
|
||||
- **Phase A (the actual gap): the `n_batch`/`n_ubatch` decouple.** LocalAI ties
|
||||
the scheduler token budget (`n_batch`) to the physical forward width
|
||||
(`n_ubatch`) at `grpc-server.cpp:515` + `:519`. This forces
|
||||
`n_batch == n_ubatch`, so the logical scheduling window can never be wider than
|
||||
one physical ubatch. You cannot keep `n_ubatch` at the Blackwell GEMM sweet
|
||||
spot (2048) while widening `n_batch` so concurrent prefills + decodes co-batch
|
||||
into a larger logical window. There is no first-class `batch:`/`ubatch:` split
|
||||
on the Go side, and there is only a one-directional `ubatch` override on the C++
|
||||
side (you can shrink ubatch below the coupled value, never grow n_batch above
|
||||
it).
|
||||
- **Phase B (optional policy lever): a decode-headroom prefill cap.** Upstream
|
||||
caps prefill at the full `n_batch` shared with decode. Under heavy mixed load
|
||||
one fat prefill chunk per iteration still adds inter-token latency (ITL) jitter
|
||||
to the decoders sharing that forward. vLLM exposes
|
||||
`long_prefill_token_threshold` / `max_num_partial_prefills` for this. A
|
||||
LocalAI-specific per-iteration prefill cap (a patch to vendored `update_slots`)
|
||||
bounds that jitter. This is genuinely not in upstream and is the only place a
|
||||
scheduler-policy change is warranted.
|
||||
|
||||
---
|
||||
|
||||
## 1. Current behavior — precise citations
|
||||
|
||||
### 1.1 The scheduler is upstream, inherited verbatim
|
||||
- `prepare.sh:15-17` copies all of `llama.cpp/tools/server/*` into the
|
||||
`grpc-server` build dir; `grpc-server.cpp` (LocalAI) replaces only the HTTP/gRPC
|
||||
service + `params_parse` + `parse_options`. `update_slots()`, the slot state
|
||||
machine, and the batch builder are **upstream `server-context.cpp`**, untouched
|
||||
by LocalAI today.
|
||||
- Slot states: `server-context.cpp:36-42` —
|
||||
`SLOT_STATE_IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT /
|
||||
GENERATING`.
|
||||
|
||||
### 1.2 Decode-first, then prefill-fill, one shared batch
|
||||
- `common_batch_clear(batch)` (≈ 2078) — one batch per `update_slots` iteration.
|
||||
- Decode phase (≈ 2088-2156): for each `SLOT_STATE_GENERATING` slot,
|
||||
`common_batch_add(batch, slot.sampled, …, /*logits=*/true)` adds exactly one
|
||||
token. Decode is guaranteed a seat before prefill runs.
|
||||
- Budget fetch (≈ 2158-2160): `n_batch = llama_n_batch(ctx)`,
|
||||
`n_ubatch = llama_n_ubatch(ctx)`.
|
||||
- Prefill phase (≈ 2166): `if (params_base.cont_batching || batch.n_tokens == 0)`
|
||||
→ with cont_batching ON, prefill is added to the **same** batch as decode.
|
||||
- Per-slot prefill fill (≈ 2552-2597):
|
||||
`while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch)`
|
||||
— adds prompt tokens until the slot is done **or** the shared budget is hit.
|
||||
Whatever does not fit stays for the next iteration (the slot remains
|
||||
`SLOT_STATE_PROCESSING_PROMPT`).
|
||||
- Whole-prompt completion (≈ 2603-2615): when the slot's prompt is fully consumed
|
||||
it flips to `SLOT_STATE_DONE_PROMPT`, sets `batch.logits[last] = true`, inits
|
||||
the sampler. Next iteration it becomes `GENERATING`.
|
||||
- Budget break (≈ 2693-2695): `if (batch.n_tokens >= n_batch) break;`.
|
||||
- Decode (≈ 2728-2741): loops `batch_view` slices of `min(n_batch, remaining)` and
|
||||
calls `llama_decode`; the physical `n_ubatch` split happens inside
|
||||
`llama_decode`.
|
||||
|
||||
### 1.3 The chunking is gated by `can_split()`
|
||||
- `server-context.cpp:225-231`: `can_split()` returns true unless the task needs
|
||||
embeddings with non-LAST pooling. So **completion/generation tasks always
|
||||
chunk-and-interleave**; only embeddings/rerank force the whole prompt into one
|
||||
ubatch (≈ 2234-2244 raises "input is too large… increase the physical batch
|
||||
size" — this is exactly why LocalAI bumped `n_ubatch` for rerank, see below).
|
||||
|
||||
### 1.4 LocalAI ties n_batch to n_ubatch (the gap)
|
||||
- `grpc-server.cpp:515` — `params.n_batch = request->nbatch();`
|
||||
- `grpc-server.cpp:519` — `params.n_ubatch = request->nbatch();` with the comment
|
||||
that this fixes reranking being capped at the 512 default `n_ubatch`.
|
||||
- `grpc-server.cpp:781-784` — the **only** decouple knob today: an `n_ubatch` /
|
||||
`ubatch` option that overrides `n_ubatch` alone (added for embeddings/rerank).
|
||||
There is **no** `batch` / `n_batch` option parse, so `n_batch` cannot be raised
|
||||
above the coupled value from a model config. Confirmed: `grep '"n_batch"|"batch"'`
|
||||
in `grpc-server.cpp` returns nothing.
|
||||
- Options arrive via `request->options(i)` parsed as `optname:optval`
|
||||
(`grpc-server.cpp:584-585`); these come from `ModelOptions.Options` ⟵
|
||||
`c.Options` (`core/backend/options.go:221`).
|
||||
|
||||
### 1.5 Go side sends a single batch number
|
||||
- `backend/backend.proto:341` — `int32 NBatch = 4;` is the only batch field; there
|
||||
is **no** `NUBatch`.
|
||||
- `core/backend/options.go:108-129` `EffectiveBatchSize`: returns `c.Batch` if set,
|
||||
else context size for single-pass (score/embed/rerank), else
|
||||
`hardwareDefaultBatchSize(512)`.
|
||||
- `core/backend/options.go:228` — `NBatch: int32(b)` (single value to the
|
||||
backend; becomes both `n_batch` and `n_ubatch` via 1.4).
|
||||
- `core/backend/hardware_defaults.go:28,37-40` — `BlackwellBatchSize = 2048`;
|
||||
on Blackwell an unset batch defaults to 2048, so today
|
||||
`n_batch == n_ubatch == 2048` there.
|
||||
|
||||
---
|
||||
|
||||
## 2. Why the decouple matters for serving (not just rerank)
|
||||
|
||||
Invariant: `n_ubatch <= n_batch`. `n_ubatch` is the physical forward-pass GEMM
|
||||
width (compute efficiency; GB10 sweet spot ≈ 2048). `n_batch` is the per-iteration
|
||||
**scheduler token budget** — the logical window shared by decode + prefill chunks,
|
||||
analogous to vLLM's `max_num_batched_tokens`.
|
||||
|
||||
With `n_batch == n_ubatch` (today), the scheduling window cannot exceed one
|
||||
physical ubatch. Consequences:
|
||||
- Under concurrency, the combined (decode + multiple prefill chunks) logical batch
|
||||
is capped at the physical ubatch, so aggregate prefill cannot grow past one
|
||||
ubatch worth of tokens per iteration even when more slots have prompts queued.
|
||||
- A user who shrinks `batch:` for memory also shrinks the physical ubatch,
|
||||
degrading prefill GEMM efficiency — and vice versa.
|
||||
|
||||
Decoupling lets us hold `n_ubatch = 2048` (efficient GEMM) while setting a larger
|
||||
`n_batch` (e.g. 4096) so more concurrent prefill+decode tokens co-schedule into one
|
||||
logical window, lifting aggregate prefill under mixed load — `llama_decode` still
|
||||
tiles the physical work at 2048.
|
||||
|
||||
---
|
||||
|
||||
## 3. Phased implementation
|
||||
|
||||
### Phase 0 — Verification harness (do first; TDD red)
|
||||
Bite-sized, no code change to the scheduler.
|
||||
- **0.1 Token-identical greedy under mixed load.** Script: start the backend with
|
||||
`n_parallel >= 4`, greedy sampling (temp 0, fixed seed). Fire (a) several short
|
||||
decode streams and (b) one ~8k-token prompt concurrently (the exact repro from
|
||||
PR #10718's body works). Capture each stream's full token id sequence. Re-run
|
||||
with the prefill request absent. **Assert the short streams' token ids are
|
||||
byte-identical** in both runs — proves interleaving does not perturb decode
|
||||
numerics (KV/position correctness across chunk boundaries). Wire as a Ginkgo
|
||||
spec under the backend e2e suite.
|
||||
- **0.2 Mixed-workload throughput baseline.** Use `llama-batched-bench` (built from
|
||||
the same tree) or a small driver hitting `/v1/chat/completions`: measure
|
||||
aggregate prefill tok/s and decode tok/s, and p50/p99 ITL of the decode streams,
|
||||
under the mixed workload. Record numbers for the current `n_batch==n_ubatch`
|
||||
config. This is the before of Phase A/B.
|
||||
|
||||
Expected result of Phase 0: 0.1 already passes (interleave is correct today);
|
||||
0.2 gives the baseline the decouple must beat.
|
||||
|
||||
### Phase A — Decouple n_batch from n_ubatch
|
||||
Goal: let model config set the physical ubatch independently of the logical batch,
|
||||
defaulting to today's behavior (no regression).
|
||||
|
||||
- **A.1 C++: accept a `batch`/`n_batch` option (and keep `ubatch`).**
|
||||
In `grpc-server.cpp`, after the existing `ubatch` branch (`:781-784`), add a
|
||||
sibling branch:
|
||||
```cpp
|
||||
} else if (!strcmp(optname, "n_batch") || !strcmp(optname, "batch")) {
|
||||
if (optval != NULL) {
|
||||
try { params.n_batch = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
```
|
||||
This is the missing direction (raise `n_batch` above the coupled value). Order
|
||||
matters: both `:515/:519` run first (coupling as default), then option parsing
|
||||
overrides either independently. Add a clamp note: if a user sets
|
||||
`n_ubatch > n_batch`, llama.cpp will clamp/upbatch; log a warning. Keep the
|
||||
`:519` aliasing for backward compat (rerank still works with no options).
|
||||
|
||||
- **A.2 Proto: add an explicit physical ubatch field.**
|
||||
`backend/backend.proto:341` add `int32 NUBatch = <next free tag>;` (do not reuse
|
||||
4). Regenerate with `make protogen-go` + the C++ proto build.
|
||||
|
||||
- **A.3 C++: honor `NUBatch` when present.**
|
||||
In `grpc-server.cpp` `params_parse`, after `:519`, add:
|
||||
```cpp
|
||||
if (request->nubatch() > 0) {
|
||||
params.n_ubatch = request->nubatch();
|
||||
}
|
||||
```
|
||||
so an explicit physical ubatch wins over the `n_batch` alias, with the `ubatch`
|
||||
string-option as a third path for users who only edit `options:`.
|
||||
|
||||
- **A.4 Go: config surface + plumbing.**
|
||||
- Add `UBatch *int` (yaml `ubatch`) to the llama config struct alongside `Batch`
|
||||
(search `core/config` for the `Batch` field; mirror it).
|
||||
- In `core/backend/options.go`: add `EffectiveUBatchSize(c)` mirroring
|
||||
`EffectiveBatchSize` (return `c.UBatch` if set, else
|
||||
`min(EffectiveBatchSize(c), BlackwellBatchSize-or-512)` so the physical ubatch
|
||||
stays at the hardware sweet spot while `n_batch` may be larger). Set
|
||||
`NUBatch: int32(EffectiveUBatchSize(c))` next to `NBatch:` (`:228`).
|
||||
- Keep the default such that when neither is set, `NUBatch == NBatch` ⇒
|
||||
byte-identical to today.
|
||||
|
||||
- **A.5 Serving default (the lever).**
|
||||
In `hardware_defaults.go`, introduce `BlackwellLogicalBatch = 4096` (or a
|
||||
measured value) and let `EffectiveBatchSize` return it for **multi-slot serving**
|
||||
configs (when `n_parallel > 1` and the model is a completion model), while
|
||||
`EffectiveUBatchSize` stays at `BlackwellBatchSize = 2048`. Gate behind the same
|
||||
Blackwell detection already used at `:37-40`. Single-stream/embedding/rerank
|
||||
paths keep `n_batch == n_ubatch`. This is the only behavioral change shipped by
|
||||
Phase A; Phase 0.2 must show it is net-positive before defaulting it on.
|
||||
|
||||
- **A.6 Tests.** Extend `hardware_defaults_internal_test.go` with
|
||||
`EffectiveUBatchSize` cases; add a `grpcModelOpts` test asserting
|
||||
`NUBatch <= NBatch` and that unset config yields `NUBatch == NBatch`. Re-run
|
||||
0.1 (must still be token-identical) and 0.2 (must show aggregate-prefill gain or
|
||||
neutral ITL) at `n_batch=4096, n_ubatch=2048`.
|
||||
|
||||
### Phase B — Decode-headroom prefill cap (optional policy, vendored patch)
|
||||
Only if Phase 0.2 / A shows decode ITL jitter from fat prefill chunks. This is the
|
||||
one change that touches the inherited scheduler, so it lives as a patch in
|
||||
`backend/cpp/llama-cpp/patches/` (applied by `prepare.sh:6-11` / Makefile
|
||||
`:141-145`), never as an edit to a checked-in upstream file.
|
||||
|
||||
Policy (pseudocode; insert into `update_slots()` prefill fill loop, the
|
||||
`while (… && batch.n_tokens < n_batch)` at ≈ `server-context.cpp:2552`):
|
||||
|
||||
```
|
||||
# token budget for THIS iteration, decode already seated:
|
||||
n_decode_in_batch = batch.n_tokens # set after the decode phase
|
||||
prefill_budget = n_batch # default == today
|
||||
|
||||
if serving_mode and n_decode_in_batch > 0:
|
||||
# leave room so decoders are not starved/jittered by one giant prefill chunk
|
||||
# max_prefill_per_iter defaults to n_ubatch (one physical tile) when decode active
|
||||
prefill_budget = min(n_batch, n_decode_in_batch + max_prefill_per_iter)
|
||||
|
||||
# fill loop guard becomes:
|
||||
while slot.prompt.n_tokens() < slot.task->n_tokens()
|
||||
and batch.n_tokens < prefill_budget:
|
||||
...
|
||||
```
|
||||
|
||||
- `max_prefill_per_iter` is a new `common_params` field surfaced as an
|
||||
`options:` knob (`max_prefill_tokens` / `mpt`) parsed in `grpc-server.cpp`
|
||||
exactly like A.1, default `0` = disabled = today's behavior.
|
||||
- Semantics mirror vLLM `long_prefill_token_threshold`: cap the prefill share so
|
||||
ongoing decodes keep a steady cadence; the remaining prompt rides the next
|
||||
iteration (already supported by the state machine — slot stays
|
||||
`PROCESSING_PROMPT`).
|
||||
- **Correctness:** unchanged KV/position path — chunk boundaries already advance
|
||||
`slot.prompt.tokens.pos_next()` per added token (≈ 2570) and the slot resumes
|
||||
from `slot.prompt.n_tokens()` next iteration. Capping the budget only changes
|
||||
*how many* tokens are added this iteration, not *which* positions, so 0.1 must
|
||||
remain token-identical.
|
||||
|
||||
### Phase C — Docs + defaults rollout
|
||||
- Document `batch` / `ubatch` (and `max_prefill_tokens` if B ships) in
|
||||
`docs/content/` model-config reference, with the serving recipe
|
||||
(`n_parallel>1`, `n_batch=4096`, `ubatch=2048`).
|
||||
- Note the orthogonality to paged KV (below) in
|
||||
`PHASED_VLLM_PARITY_PLAN.md` Phase 3.
|
||||
|
||||
---
|
||||
|
||||
## 4. Risk / correctness
|
||||
|
||||
- **KV-cache & positions across chunks:** already handled upstream. Each prefill
|
||||
token added advances `pos_next()` (≈ 2570) and is pushed to `slot.prompt.tokens`
|
||||
(≈ 2573); the next iteration resumes from `slot.prompt.n_tokens()`. Chunk
|
||||
boundaries are transparent to the KV cache because positions are absolute, not
|
||||
per-chunk. Phase A changes only budgets, not positions; Phase B changes only the
|
||||
per-iteration count. The 0.1 token-identical test is the guardrail.
|
||||
- **Unified KV cache (LocalAI default, `n_parallel` slots share one cache):**
|
||||
unaffected — co-batching prefill+decode across slots is what the unified cache is
|
||||
for; positions are per-`seq_id` (`{ slot.id }` in `common_batch_add`).
|
||||
- **`n_ubatch > n_batch`:** invalid; A.4 clamps `EffectiveUBatchSize <=
|
||||
EffectiveBatchSize` and A.1 logs a warning if options violate it.
|
||||
- **Embeddings / rerank:** must keep `n_ubatch >= prompt length` (single pass,
|
||||
`can_split()==false`). The existing `:519` alias + `EffectiveBatchSize`
|
||||
context-sizing for single-pass usecases (`options.go:119-124`) must be preserved
|
||||
— do not let the serving `BlackwellLogicalBatch` default leak into single-pass
|
||||
configs (A.5 gates on completion + `n_parallel>1`).
|
||||
- **Turboquant fork:** the fork lacks some `common_params` fields (see
|
||||
`LOCALAI_LEGACY_LLAMA_CPP_SPEC` precedent at `grpc-server.cpp:755`). `n_batch` /
|
||||
`n_ubatch` are ancient fields and safe; if Phase B adds `max_prefill_per_iter`,
|
||||
guard the new field behind a `#ifndef` like the checkpoint block does.
|
||||
|
||||
## 5. Orthogonality to paged KV (Phase 2)
|
||||
|
||||
Keep them independent. Paged KV (the `-kvp` / block-manager effort, draft #22569,
|
||||
and `paged/`) changes **where** KV blocks live (allocation/utilization). Chunked
|
||||
prefill / this decouple changes **how many tokens per iteration** the scheduler
|
||||
batches (the `n_batch` budget and decode/prefill interleave). They compose: paged
|
||||
KV raises the concurrency ceiling (more slots), the decouple widens the per-iter
|
||||
scheduling window to feed those slots; neither touches the other's data structures.
|
||||
The only contact point is `update_slots()` — if both ship a vendored patch to it,
|
||||
land them as separate, ordered patches in `patches/` and keep the hunks disjoint
|
||||
(paged touches allocation/seq_rm; chunked-prefill Phase B touches the prefill fill
|
||||
budget).
|
||||
|
||||
---
|
||||
|
||||
## 6. Bottom line
|
||||
|
||||
- Chunked prefill + decode interleave: **already present and correct** on the
|
||||
pinned llama.cpp — verify (Phase 0.1), do not rebuild.
|
||||
- Real work: the **n_batch/n_ubatch decouple** (Phase A) — small, additive,
|
||||
default-preserving — plus an **optional decode-headroom prefill cap** (Phase B)
|
||||
if measurements show ITL jitter. Both are LocalAI-side: A in `grpc-server.cpp`
|
||||
+ proto + `options.go`; B as a vendored `patches/` hunk.
|
||||
Reference in New Issue
Block a user