From fdb7f56bb7c266f6fb02533f1cbfa6e24c3853f5 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Sun, 21 Jun 2026 12:54:22 +0000 Subject: [PATCH] docs(llama-cpp): scope chunked prefill + n_batch/n_ubatch decouple Add CHUNKED_PREFILL_PLAN.md for the llama.cpp backend. Key finding: the vendored llama.cpp server scheduler (update_slots) already implements chunked prefill with prefill/decode interleaving on the pinned version - decode tokens are seated first each iteration, prefill fills the leftover n_batch budget, both share one llama_decode. The draft upstream PR #10718 goal is already absorbed; no re-implementation needed. The real LocalAI gap is the n_batch/n_ubatch coupling at grpc-server.cpp (both set to nbatch()), which pins the logical scheduling window to the physical ubatch width. The plan scopes the decouple (C++ option + proto NUBatch + options.go), an optional decode-headroom prefill cap as a vendored patch, a token-identical verification harness, and keeps the work orthogonal to paged KV. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../llama-cpp/paged/CHUNKED_PREFILL_PLAN.md | 334 ++++++++++++++++++ 1 file changed, 334 insertions(+) create mode 100644 backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md diff --git a/backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md b/backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md new file mode 100644 index 000000000..4dc90f97b --- /dev/null +++ b/backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md @@ -0,0 +1,334 @@ +# Chunked prefill + n_batch/n_ubatch decouple — implementation plan + +Scope: LocalAI's llama.cpp backend (`backend/cpp/llama-cpp/`). Companion to +`PHASED_VLLM_PARITY_PLAN.md` Phase 3. This document is the concrete, file-cited +plan for what the brief called "chunked prefill". + +Line numbers below are from two trees: +- LocalAI: `backend/cpp/llama-cpp/grpc-server.cpp`, `core/backend/options.go`, + `backend/backend.proto`, `core/backend/hardware_defaults.go` — exact. +- Vendored upstream scheduler: `llama.cpp/tools/server/server-context.cpp`. The + build copies `llama.cpp/tools/server/*` into `tools/grpc-server/` (`prepare.sh` + lines 15-17) and only overrides `grpc-server.cpp` + `CMakeLists.txt`. So + `update_slots()` is **inherited upstream code, not LocalAI code**. Line numbers + cited for it are from a same-era checkout (`d12cc3d`, 2026-04-09); the pin is + `f3e1828` (Makefile line 2). The structure is identical; exact lines may drift + a few rows at the pin — match on the quoted comment strings, not the integers. + +--- + +## TL;DR — the headline finding + +**Chunked prefill with prefill/decode interleaving is ALREADY implemented** in the +llama.cpp server scheduler that LocalAI vendors. It is not a missing feature on +this version. `update_slots()` in `server-context.cpp`: + +1. **Adds ongoing decode tokens first** — "first, add sampled tokens from any + ongoing sequences" (≈ line 2088). Every `SLOT_STATE_GENERATING` slot gets its + one sampled token into the shared `llama_batch` before any prefill is added. +2. **Then fills the remaining `n_batch` budget with prompt (prefill) tokens** — + "next, batch any pending prompts without exceeding n_batch" (≈ line 2166), + gated by `params_base.cont_batching` (LocalAI sets `cont_batching = true` by + default, `grpc-server.cpp:547`). The per-slot prefill fill loop + (≈ line 2552) is `while (slot.prompt.n_tokens() < slot.task->n_tokens() && + batch.n_tokens < n_batch)` — i.e. it caps each slot's prefill contribution to + the **remaining** budget and defers the rest to the next iteration. +3. **Decodes the combined batch in one pass** (≈ line 2728-2741): decode tokens + and prefill-chunk tokens go through the **same `llama_decode`**, which then + splits internally into `n_ubatch` physical sub-batches. + +This is exactly the behavior the abandoned-looking draft **upstream PR #10718** +("server : chunked prefill support") asked for — "the first task is no longer +blocked by the second long prompt processing task." That PR is still marked OPEN +but its goal was absorbed into the natural evolution of `update_slots()`; we do +**not** need to port it. A long prefill no longer stalls the decode batch: decode +slots are serviced first every iteration, prefill consumes only the leftover +budget. + +**Therefore: do not re-implement chunked prefill.** The real LocalAI gap is +narrow and is the rest of this plan: + +- **Phase A (the actual gap): the `n_batch`/`n_ubatch` decouple.** LocalAI ties + the scheduler token budget (`n_batch`) to the physical forward width + (`n_ubatch`) at `grpc-server.cpp:515` + `:519`. This forces + `n_batch == n_ubatch`, so the logical scheduling window can never be wider than + one physical ubatch. You cannot keep `n_ubatch` at the Blackwell GEMM sweet + spot (2048) while widening `n_batch` so concurrent prefills + decodes co-batch + into a larger logical window. There is no first-class `batch:`/`ubatch:` split + on the Go side, and there is only a one-directional `ubatch` override on the C++ + side (you can shrink ubatch below the coupled value, never grow n_batch above + it). +- **Phase B (optional policy lever): a decode-headroom prefill cap.** Upstream + caps prefill at the full `n_batch` shared with decode. Under heavy mixed load + one fat prefill chunk per iteration still adds inter-token latency (ITL) jitter + to the decoders sharing that forward. vLLM exposes + `long_prefill_token_threshold` / `max_num_partial_prefills` for this. A + LocalAI-specific per-iteration prefill cap (a patch to vendored `update_slots`) + bounds that jitter. This is genuinely not in upstream and is the only place a + scheduler-policy change is warranted. + +--- + +## 1. Current behavior — precise citations + +### 1.1 The scheduler is upstream, inherited verbatim +- `prepare.sh:15-17` copies all of `llama.cpp/tools/server/*` into the + `grpc-server` build dir; `grpc-server.cpp` (LocalAI) replaces only the HTTP/gRPC + service + `params_parse` + `parse_options`. `update_slots()`, the slot state + machine, and the batch builder are **upstream `server-context.cpp`**, untouched + by LocalAI today. +- Slot states: `server-context.cpp:36-42` — + `SLOT_STATE_IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT / + GENERATING`. + +### 1.2 Decode-first, then prefill-fill, one shared batch +- `common_batch_clear(batch)` (≈ 2078) — one batch per `update_slots` iteration. +- Decode phase (≈ 2088-2156): for each `SLOT_STATE_GENERATING` slot, + `common_batch_add(batch, slot.sampled, …, /*logits=*/true)` adds exactly one + token. Decode is guaranteed a seat before prefill runs. +- Budget fetch (≈ 2158-2160): `n_batch = llama_n_batch(ctx)`, + `n_ubatch = llama_n_ubatch(ctx)`. +- Prefill phase (≈ 2166): `if (params_base.cont_batching || batch.n_tokens == 0)` + → with cont_batching ON, prefill is added to the **same** batch as decode. +- Per-slot prefill fill (≈ 2552-2597): + `while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch)` + — adds prompt tokens until the slot is done **or** the shared budget is hit. + Whatever does not fit stays for the next iteration (the slot remains + `SLOT_STATE_PROCESSING_PROMPT`). +- Whole-prompt completion (≈ 2603-2615): when the slot's prompt is fully consumed + it flips to `SLOT_STATE_DONE_PROMPT`, sets `batch.logits[last] = true`, inits + the sampler. Next iteration it becomes `GENERATING`. +- Budget break (≈ 2693-2695): `if (batch.n_tokens >= n_batch) break;`. +- Decode (≈ 2728-2741): loops `batch_view` slices of `min(n_batch, remaining)` and + calls `llama_decode`; the physical `n_ubatch` split happens inside + `llama_decode`. + +### 1.3 The chunking is gated by `can_split()` +- `server-context.cpp:225-231`: `can_split()` returns true unless the task needs + embeddings with non-LAST pooling. So **completion/generation tasks always + chunk-and-interleave**; only embeddings/rerank force the whole prompt into one + ubatch (≈ 2234-2244 raises "input is too large… increase the physical batch + size" — this is exactly why LocalAI bumped `n_ubatch` for rerank, see below). + +### 1.4 LocalAI ties n_batch to n_ubatch (the gap) +- `grpc-server.cpp:515` — `params.n_batch = request->nbatch();` +- `grpc-server.cpp:519` — `params.n_ubatch = request->nbatch();` with the comment + that this fixes reranking being capped at the 512 default `n_ubatch`. +- `grpc-server.cpp:781-784` — the **only** decouple knob today: an `n_ubatch` / + `ubatch` option that overrides `n_ubatch` alone (added for embeddings/rerank). + There is **no** `batch` / `n_batch` option parse, so `n_batch` cannot be raised + above the coupled value from a model config. Confirmed: `grep '"n_batch"|"batch"'` + in `grpc-server.cpp` returns nothing. +- Options arrive via `request->options(i)` parsed as `optname:optval` + (`grpc-server.cpp:584-585`); these come from `ModelOptions.Options` ⟵ + `c.Options` (`core/backend/options.go:221`). + +### 1.5 Go side sends a single batch number +- `backend/backend.proto:341` — `int32 NBatch = 4;` is the only batch field; there + is **no** `NUBatch`. +- `core/backend/options.go:108-129` `EffectiveBatchSize`: returns `c.Batch` if set, + else context size for single-pass (score/embed/rerank), else + `hardwareDefaultBatchSize(512)`. +- `core/backend/options.go:228` — `NBatch: int32(b)` (single value to the + backend; becomes both `n_batch` and `n_ubatch` via 1.4). +- `core/backend/hardware_defaults.go:28,37-40` — `BlackwellBatchSize = 2048`; + on Blackwell an unset batch defaults to 2048, so today + `n_batch == n_ubatch == 2048` there. + +--- + +## 2. Why the decouple matters for serving (not just rerank) + +Invariant: `n_ubatch <= n_batch`. `n_ubatch` is the physical forward-pass GEMM +width (compute efficiency; GB10 sweet spot ≈ 2048). `n_batch` is the per-iteration +**scheduler token budget** — the logical window shared by decode + prefill chunks, +analogous to vLLM's `max_num_batched_tokens`. + +With `n_batch == n_ubatch` (today), the scheduling window cannot exceed one +physical ubatch. Consequences: +- Under concurrency, the combined (decode + multiple prefill chunks) logical batch + is capped at the physical ubatch, so aggregate prefill cannot grow past one + ubatch worth of tokens per iteration even when more slots have prompts queued. +- A user who shrinks `batch:` for memory also shrinks the physical ubatch, + degrading prefill GEMM efficiency — and vice versa. + +Decoupling lets us hold `n_ubatch = 2048` (efficient GEMM) while setting a larger +`n_batch` (e.g. 4096) so more concurrent prefill+decode tokens co-schedule into one +logical window, lifting aggregate prefill under mixed load — `llama_decode` still +tiles the physical work at 2048. + +--- + +## 3. Phased implementation + +### Phase 0 — Verification harness (do first; TDD red) +Bite-sized, no code change to the scheduler. +- **0.1 Token-identical greedy under mixed load.** Script: start the backend with + `n_parallel >= 4`, greedy sampling (temp 0, fixed seed). Fire (a) several short + decode streams and (b) one ~8k-token prompt concurrently (the exact repro from + PR #10718's body works). Capture each stream's full token id sequence. Re-run + with the prefill request absent. **Assert the short streams' token ids are + byte-identical** in both runs — proves interleaving does not perturb decode + numerics (KV/position correctness across chunk boundaries). Wire as a Ginkgo + spec under the backend e2e suite. +- **0.2 Mixed-workload throughput baseline.** Use `llama-batched-bench` (built from + the same tree) or a small driver hitting `/v1/chat/completions`: measure + aggregate prefill tok/s and decode tok/s, and p50/p99 ITL of the decode streams, + under the mixed workload. Record numbers for the current `n_batch==n_ubatch` + config. This is the before of Phase A/B. + +Expected result of Phase 0: 0.1 already passes (interleave is correct today); +0.2 gives the baseline the decouple must beat. + +### Phase A — Decouple n_batch from n_ubatch +Goal: let model config set the physical ubatch independently of the logical batch, +defaulting to today's behavior (no regression). + +- **A.1 C++: accept a `batch`/`n_batch` option (and keep `ubatch`).** + In `grpc-server.cpp`, after the existing `ubatch` branch (`:781-784`), add a + sibling branch: + ```cpp + } else if (!strcmp(optname, "n_batch") || !strcmp(optname, "batch")) { + if (optval != NULL) { + try { params.n_batch = std::stoi(optval_str); } catch (...) {} + } + ``` + This is the missing direction (raise `n_batch` above the coupled value). Order + matters: both `:515/:519` run first (coupling as default), then option parsing + overrides either independently. Add a clamp note: if a user sets + `n_ubatch > n_batch`, llama.cpp will clamp/upbatch; log a warning. Keep the + `:519` aliasing for backward compat (rerank still works with no options). + +- **A.2 Proto: add an explicit physical ubatch field.** + `backend/backend.proto:341` add `int32 NUBatch = ;` (do not reuse + 4). Regenerate with `make protogen-go` + the C++ proto build. + +- **A.3 C++: honor `NUBatch` when present.** + In `grpc-server.cpp` `params_parse`, after `:519`, add: + ```cpp + if (request->nubatch() > 0) { + params.n_ubatch = request->nubatch(); + } + ``` + so an explicit physical ubatch wins over the `n_batch` alias, with the `ubatch` + string-option as a third path for users who only edit `options:`. + +- **A.4 Go: config surface + plumbing.** + - Add `UBatch *int` (yaml `ubatch`) to the llama config struct alongside `Batch` + (search `core/config` for the `Batch` field; mirror it). + - In `core/backend/options.go`: add `EffectiveUBatchSize(c)` mirroring + `EffectiveBatchSize` (return `c.UBatch` if set, else + `min(EffectiveBatchSize(c), BlackwellBatchSize-or-512)` so the physical ubatch + stays at the hardware sweet spot while `n_batch` may be larger). Set + `NUBatch: int32(EffectiveUBatchSize(c))` next to `NBatch:` (`:228`). + - Keep the default such that when neither is set, `NUBatch == NBatch` ⇒ + byte-identical to today. + +- **A.5 Serving default (the lever).** + In `hardware_defaults.go`, introduce `BlackwellLogicalBatch = 4096` (or a + measured value) and let `EffectiveBatchSize` return it for **multi-slot serving** + configs (when `n_parallel > 1` and the model is a completion model), while + `EffectiveUBatchSize` stays at `BlackwellBatchSize = 2048`. Gate behind the same + Blackwell detection already used at `:37-40`. Single-stream/embedding/rerank + paths keep `n_batch == n_ubatch`. This is the only behavioral change shipped by + Phase A; Phase 0.2 must show it is net-positive before defaulting it on. + +- **A.6 Tests.** Extend `hardware_defaults_internal_test.go` with + `EffectiveUBatchSize` cases; add a `grpcModelOpts` test asserting + `NUBatch <= NBatch` and that unset config yields `NUBatch == NBatch`. Re-run + 0.1 (must still be token-identical) and 0.2 (must show aggregate-prefill gain or + neutral ITL) at `n_batch=4096, n_ubatch=2048`. + +### Phase B — Decode-headroom prefill cap (optional policy, vendored patch) +Only if Phase 0.2 / A shows decode ITL jitter from fat prefill chunks. This is the +one change that touches the inherited scheduler, so it lives as a patch in +`backend/cpp/llama-cpp/patches/` (applied by `prepare.sh:6-11` / Makefile +`:141-145`), never as an edit to a checked-in upstream file. + +Policy (pseudocode; insert into `update_slots()` prefill fill loop, the +`while (… && batch.n_tokens < n_batch)` at ≈ `server-context.cpp:2552`): + +``` +# token budget for THIS iteration, decode already seated: +n_decode_in_batch = batch.n_tokens # set after the decode phase +prefill_budget = n_batch # default == today + +if serving_mode and n_decode_in_batch > 0: + # leave room so decoders are not starved/jittered by one giant prefill chunk + # max_prefill_per_iter defaults to n_ubatch (one physical tile) when decode active + prefill_budget = min(n_batch, n_decode_in_batch + max_prefill_per_iter) + +# fill loop guard becomes: +while slot.prompt.n_tokens() < slot.task->n_tokens() + and batch.n_tokens < prefill_budget: + ... +``` + +- `max_prefill_per_iter` is a new `common_params` field surfaced as an + `options:` knob (`max_prefill_tokens` / `mpt`) parsed in `grpc-server.cpp` + exactly like A.1, default `0` = disabled = today's behavior. +- Semantics mirror vLLM `long_prefill_token_threshold`: cap the prefill share so + ongoing decodes keep a steady cadence; the remaining prompt rides the next + iteration (already supported by the state machine — slot stays + `PROCESSING_PROMPT`). +- **Correctness:** unchanged KV/position path — chunk boundaries already advance + `slot.prompt.tokens.pos_next()` per added token (≈ 2570) and the slot resumes + from `slot.prompt.n_tokens()` next iteration. Capping the budget only changes + *how many* tokens are added this iteration, not *which* positions, so 0.1 must + remain token-identical. + +### Phase C — Docs + defaults rollout +- Document `batch` / `ubatch` (and `max_prefill_tokens` if B ships) in + `docs/content/` model-config reference, with the serving recipe + (`n_parallel>1`, `n_batch=4096`, `ubatch=2048`). +- Note the orthogonality to paged KV (below) in + `PHASED_VLLM_PARITY_PLAN.md` Phase 3. + +--- + +## 4. Risk / correctness + +- **KV-cache & positions across chunks:** already handled upstream. Each prefill + token added advances `pos_next()` (≈ 2570) and is pushed to `slot.prompt.tokens` + (≈ 2573); the next iteration resumes from `slot.prompt.n_tokens()`. Chunk + boundaries are transparent to the KV cache because positions are absolute, not + per-chunk. Phase A changes only budgets, not positions; Phase B changes only the + per-iteration count. The 0.1 token-identical test is the guardrail. +- **Unified KV cache (LocalAI default, `n_parallel` slots share one cache):** + unaffected — co-batching prefill+decode across slots is what the unified cache is + for; positions are per-`seq_id` (`{ slot.id }` in `common_batch_add`). +- **`n_ubatch > n_batch`:** invalid; A.4 clamps `EffectiveUBatchSize <= + EffectiveBatchSize` and A.1 logs a warning if options violate it. +- **Embeddings / rerank:** must keep `n_ubatch >= prompt length` (single pass, + `can_split()==false`). The existing `:519` alias + `EffectiveBatchSize` + context-sizing for single-pass usecases (`options.go:119-124`) must be preserved + — do not let the serving `BlackwellLogicalBatch` default leak into single-pass + configs (A.5 gates on completion + `n_parallel>1`). +- **Turboquant fork:** the fork lacks some `common_params` fields (see + `LOCALAI_LEGACY_LLAMA_CPP_SPEC` precedent at `grpc-server.cpp:755`). `n_batch` / + `n_ubatch` are ancient fields and safe; if Phase B adds `max_prefill_per_iter`, + guard the new field behind a `#ifndef` like the checkpoint block does. + +## 5. Orthogonality to paged KV (Phase 2) + +Keep them independent. Paged KV (the `-kvp` / block-manager effort, draft #22569, +and `paged/`) changes **where** KV blocks live (allocation/utilization). Chunked +prefill / this decouple changes **how many tokens per iteration** the scheduler +batches (the `n_batch` budget and decode/prefill interleave). They compose: paged +KV raises the concurrency ceiling (more slots), the decouple widens the per-iter +scheduling window to feed those slots; neither touches the other's data structures. +The only contact point is `update_slots()` — if both ship a vendored patch to it, +land them as separate, ordered patches in `patches/` and keep the hunks disjoint +(paged touches allocation/seq_rm; chunked-prefill Phase B touches the prefill fill +budget). + +--- + +## 6. Bottom line + +- Chunked prefill + decode interleave: **already present and correct** on the + pinned llama.cpp — verify (Phase 0.1), do not rebuild. +- Real work: the **n_batch/n_ubatch decouple** (Phase A) — small, additive, + default-preserving — plus an **optional decode-headroom prefill cap** (Phase B) + if measurements show ITL jitter. Both are LocalAI-side: A in `grpc-server.cpp` + + proto + `options.go`; B as a vendored `patches/` hunk.