From fdb7f56bb7c266f6fb02533f1cbfa6e24c3853f5 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Sun, 21 Jun 2026 12:54:22 +0000
Subject: [PATCH] docs(llama-cpp): scope chunked prefill + n_batch/n_ubatch
 decouple

Add CHUNKED_PREFILL_PLAN.md for the llama.cpp backend. Key finding: the
vendored llama.cpp server scheduler (update_slots) already implements
chunked prefill with prefill/decode interleaving on the pinned version -
decode tokens are seated first each iteration, prefill fills the leftover
n_batch budget, both share one llama_decode. The draft upstream PR #10718
goal is already absorbed; no re-implementation needed.

The real LocalAI gap is the n_batch/n_ubatch coupling at grpc-server.cpp
(both set to nbatch()), which pins the logical scheduling window to the
physical ubatch width. The plan scopes the decouple (C++ option + proto
NUBatch + options.go), an optional decode-headroom prefill cap as a
vendored patch, a token-identical verification harness, and keeps the
work orthogonal to paged KV.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 .../llama-cpp/paged/CHUNKED_PREFILL_PLAN.md   | 334 ++++++++++++++++++
 1 file changed, 334 insertions(+)
 create mode 100644 backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md

diff --git a/backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md b/backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md
new file mode 100644
index 000000000..4dc90f97b
--- /dev/null
+++ b/backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md
@@ -0,0 +1,334 @@
+# Chunked prefill + n_batch/n_ubatch decouple — implementation plan
+
+Scope: LocalAI's llama.cpp backend (`backend/cpp/llama-cpp/`). Companion to
+`PHASED_VLLM_PARITY_PLAN.md` Phase 3. This document is the concrete, file-cited
+plan for what the brief called "chunked prefill".
+
+Line numbers below are from two trees:
+- LocalAI: `backend/cpp/llama-cpp/grpc-server.cpp`, `core/backend/options.go`,
+  `backend/backend.proto`, `core/backend/hardware_defaults.go` — exact.
+- Vendored upstream scheduler: `llama.cpp/tools/server/server-context.cpp`. The
+  build copies `llama.cpp/tools/server/*` into `tools/grpc-server/` (`prepare.sh`
+  lines 15-17) and only overrides `grpc-server.cpp` + `CMakeLists.txt`. So
+  `update_slots()` is **inherited upstream code, not LocalAI code**. Line numbers
+  cited for it are from a same-era checkout (`d12cc3d`, 2026-04-09); the pin is
+  `f3e1828` (Makefile line 2). The structure is identical; exact lines may drift
+  a few rows at the pin — match on the quoted comment strings, not the integers.
+
+---
+
+## TL;DR — the headline finding
+
+**Chunked prefill with prefill/decode interleaving is ALREADY implemented** in the
+llama.cpp server scheduler that LocalAI vendors. It is not a missing feature on
+this version. `update_slots()` in `server-context.cpp`:
+
+1. **Adds ongoing decode tokens first** — "first, add sampled tokens from any
+   ongoing sequences" (≈ line 2088). Every `SLOT_STATE_GENERATING` slot gets its
+   one sampled token into the shared `llama_batch` before any prefill is added.
+2. **Then fills the remaining `n_batch` budget with prompt (prefill) tokens** —
+   "next, batch any pending prompts without exceeding n_batch" (≈ line 2166),
+   gated by `params_base.cont_batching` (LocalAI sets `cont_batching = true` by
+   default, `grpc-server.cpp:547`). The per-slot prefill fill loop
+   (≈ line 2552) is `while (slot.prompt.n_tokens() < slot.task->n_tokens() &&
+   batch.n_tokens < n_batch)` — i.e. it caps each slot's prefill contribution to
+   the **remaining** budget and defers the rest to the next iteration.
+3. **Decodes the combined batch in one pass** (≈ line 2728-2741): decode tokens
+   and prefill-chunk tokens go through the **same `llama_decode`**, which then
+   splits internally into `n_ubatch` physical sub-batches.
+
+This is exactly the behavior the abandoned-looking draft **upstream PR #10718**
+("server : chunked prefill support") asked for — "the first task is no longer
+blocked by the second long prompt processing task." That PR is still marked OPEN
+but its goal was absorbed into the natural evolution of `update_slots()`; we do
+**not** need to port it. A long prefill no longer stalls the decode batch: decode
+slots are serviced first every iteration, prefill consumes only the leftover
+budget.
+
+**Therefore: do not re-implement chunked prefill.** The real LocalAI gap is
+narrow and is the rest of this plan:
+
+- **Phase A (the actual gap): the `n_batch`/`n_ubatch` decouple.** LocalAI ties
+  the scheduler token budget (`n_batch`) to the physical forward width
+  (`n_ubatch`) at `grpc-server.cpp:515` + `:519`. This forces
+  `n_batch == n_ubatch`, so the logical scheduling window can never be wider than
+  one physical ubatch. You cannot keep `n_ubatch` at the Blackwell GEMM sweet
+  spot (2048) while widening `n_batch` so concurrent prefills + decodes co-batch
+  into a larger logical window. There is no first-class `batch:`/`ubatch:` split
+  on the Go side, and there is only a one-directional `ubatch` override on the C++
+  side (you can shrink ubatch below the coupled value, never grow n_batch above
+  it).
+- **Phase B (optional policy lever): a decode-headroom prefill cap.** Upstream
+  caps prefill at the full `n_batch` shared with decode. Under heavy mixed load
+  one fat prefill chunk per iteration still adds inter-token latency (ITL) jitter
+  to the decoders sharing that forward. vLLM exposes
+  `long_prefill_token_threshold` / `max_num_partial_prefills` for this. A
+  LocalAI-specific per-iteration prefill cap (a patch to vendored `update_slots`)
+  bounds that jitter. This is genuinely not in upstream and is the only place a
+  scheduler-policy change is warranted.
+
+---
+
+## 1. Current behavior — precise citations
+
+### 1.1 The scheduler is upstream, inherited verbatim
+- `prepare.sh:15-17` copies all of `llama.cpp/tools/server/*` into the
+  `grpc-server` build dir; `grpc-server.cpp` (LocalAI) replaces only the HTTP/gRPC
+  service + `params_parse` + `parse_options`. `update_slots()`, the slot state
+  machine, and the batch builder are **upstream `server-context.cpp`**, untouched
+  by LocalAI today.
+- Slot states: `server-context.cpp:36-42` —
+  `SLOT_STATE_IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT /
+  GENERATING`.
+
+### 1.2 Decode-first, then prefill-fill, one shared batch
+- `common_batch_clear(batch)` (≈ 2078) — one batch per `update_slots` iteration.
+- Decode phase (≈ 2088-2156): for each `SLOT_STATE_GENERATING` slot,
+  `common_batch_add(batch, slot.sampled, …, /*logits=*/true)` adds exactly one
+  token. Decode is guaranteed a seat before prefill runs.
+- Budget fetch (≈ 2158-2160): `n_batch = llama_n_batch(ctx)`,
+  `n_ubatch = llama_n_ubatch(ctx)`.
+- Prefill phase (≈ 2166): `if (params_base.cont_batching || batch.n_tokens == 0)`
+  → with cont_batching ON, prefill is added to the **same** batch as decode.
+- Per-slot prefill fill (≈ 2552-2597):
+  `while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch)`
+  — adds prompt tokens until the slot is done **or** the shared budget is hit.
+  Whatever does not fit stays for the next iteration (the slot remains
+  `SLOT_STATE_PROCESSING_PROMPT`).
+- Whole-prompt completion (≈ 2603-2615): when the slot's prompt is fully consumed
+  it flips to `SLOT_STATE_DONE_PROMPT`, sets `batch.logits[last] = true`, inits
+  the sampler. Next iteration it becomes `GENERATING`.
+- Budget break (≈ 2693-2695): `if (batch.n_tokens >= n_batch) break;`.
+- Decode (≈ 2728-2741): loops `batch_view` slices of `min(n_batch, remaining)` and
+  calls `llama_decode`; the physical `n_ubatch` split happens inside
+  `llama_decode`.
+
+### 1.3 The chunking is gated by `can_split()`
+- `server-context.cpp:225-231`: `can_split()` returns true unless the task needs
+  embeddings with non-LAST pooling. So **completion/generation tasks always
+  chunk-and-interleave**; only embeddings/rerank force the whole prompt into one
+  ubatch (≈ 2234-2244 raises "input is too large… increase the physical batch
+  size" — this is exactly why LocalAI bumped `n_ubatch` for rerank, see below).
+
+### 1.4 LocalAI ties n_batch to n_ubatch (the gap)
+- `grpc-server.cpp:515` — `params.n_batch  = request->nbatch();`
+- `grpc-server.cpp:519` — `params.n_ubatch = request->nbatch();` with the comment
+  that this fixes reranking being capped at the 512 default `n_ubatch`.
+- `grpc-server.cpp:781-784` — the **only** decouple knob today: an `n_ubatch` /
+  `ubatch` option that overrides `n_ubatch` alone (added for embeddings/rerank).
+  There is **no** `batch` / `n_batch` option parse, so `n_batch` cannot be raised
+  above the coupled value from a model config. Confirmed: `grep '"n_batch"|"batch"'`
+  in `grpc-server.cpp` returns nothing.
+- Options arrive via `request->options(i)` parsed as `optname:optval`
+  (`grpc-server.cpp:584-585`); these come from `ModelOptions.Options` ⟵
+  `c.Options` (`core/backend/options.go:221`).
+
+### 1.5 Go side sends a single batch number
+- `backend/backend.proto:341` — `int32 NBatch = 4;` is the only batch field; there
+  is **no** `NUBatch`.
+- `core/backend/options.go:108-129` `EffectiveBatchSize`: returns `c.Batch` if set,
+  else context size for single-pass (score/embed/rerank), else
+  `hardwareDefaultBatchSize(512)`.
+- `core/backend/options.go:228` — `NBatch: int32(b)` (single value to the
+  backend; becomes both `n_batch` and `n_ubatch` via 1.4).
+- `core/backend/hardware_defaults.go:28,37-40` — `BlackwellBatchSize = 2048`;
+  on Blackwell an unset batch defaults to 2048, so today
+  `n_batch == n_ubatch == 2048` there.
+
+---
+
+## 2. Why the decouple matters for serving (not just rerank)
+
+Invariant: `n_ubatch <= n_batch`. `n_ubatch` is the physical forward-pass GEMM
+width (compute efficiency; GB10 sweet spot ≈ 2048). `n_batch` is the per-iteration
+**scheduler token budget** — the logical window shared by decode + prefill chunks,
+analogous to vLLM's `max_num_batched_tokens`.
+
+With `n_batch == n_ubatch` (today), the scheduling window cannot exceed one
+physical ubatch. Consequences:
+- Under concurrency, the combined (decode + multiple prefill chunks) logical batch
+  is capped at the physical ubatch, so aggregate prefill cannot grow past one
+  ubatch worth of tokens per iteration even when more slots have prompts queued.
+- A user who shrinks `batch:` for memory also shrinks the physical ubatch,
+  degrading prefill GEMM efficiency — and vice versa.
+
+Decoupling lets us hold `n_ubatch = 2048` (efficient GEMM) while setting a larger
+`n_batch` (e.g. 4096) so more concurrent prefill+decode tokens co-schedule into one
+logical window, lifting aggregate prefill under mixed load — `llama_decode` still
+tiles the physical work at 2048.
+
+---
+
+## 3. Phased implementation
+
+### Phase 0 — Verification harness (do first; TDD red)
+Bite-sized, no code change to the scheduler.
+- **0.1 Token-identical greedy under mixed load.** Script: start the backend with
+  `n_parallel >= 4`, greedy sampling (temp 0, fixed seed). Fire (a) several short
+  decode streams and (b) one ~8k-token prompt concurrently (the exact repro from
+  PR #10718's body works). Capture each stream's full token id sequence. Re-run
+  with the prefill request absent. **Assert the short streams' token ids are
+  byte-identical** in both runs — proves interleaving does not perturb decode
+  numerics (KV/position correctness across chunk boundaries). Wire as a Ginkgo
+  spec under the backend e2e suite.
+- **0.2 Mixed-workload throughput baseline.** Use `llama-batched-bench` (built from
+  the same tree) or a small driver hitting `/v1/chat/completions`: measure
+  aggregate prefill tok/s and decode tok/s, and p50/p99 ITL of the decode streams,
+  under the mixed workload. Record numbers for the current `n_batch==n_ubatch`
+  config. This is the before of Phase A/B.
+
+Expected result of Phase 0: 0.1 already passes (interleave is correct today);
+0.2 gives the baseline the decouple must beat.
+
+### Phase A — Decouple n_batch from n_ubatch
+Goal: let model config set the physical ubatch independently of the logical batch,
+defaulting to today's behavior (no regression).
+
+- **A.1 C++: accept a `batch`/`n_batch` option (and keep `ubatch`).**
+  In `grpc-server.cpp`, after the existing `ubatch` branch (`:781-784`), add a
+  sibling branch:
+  ```cpp
+  } else if (!strcmp(optname, "n_batch") || !strcmp(optname, "batch")) {
+      if (optval != NULL) {
+          try { params.n_batch = std::stoi(optval_str); } catch (...) {}
+      }
+  ```
+  This is the missing direction (raise `n_batch` above the coupled value). Order
+  matters: both `:515/:519` run first (coupling as default), then option parsing
+  overrides either independently. Add a clamp note: if a user sets
+  `n_ubatch > n_batch`, llama.cpp will clamp/upbatch; log a warning. Keep the
+  `:519` aliasing for backward compat (rerank still works with no options).
+
+- **A.2 Proto: add an explicit physical ubatch field.**
+  `backend/backend.proto:341` add `int32 NUBatch = <next free tag>;` (do not reuse
+  4). Regenerate with `make protogen-go` + the C++ proto build.
+
+- **A.3 C++: honor `NUBatch` when present.**
+  In `grpc-server.cpp` `params_parse`, after `:519`, add:
+  ```cpp
+  if (request->nubatch() > 0) {
+      params.n_ubatch = request->nubatch();
+  }
+  ```
+  so an explicit physical ubatch wins over the `n_batch` alias, with the `ubatch`
+  string-option as a third path for users who only edit `options:`.
+
+- **A.4 Go: config surface + plumbing.**
+  - Add `UBatch *int` (yaml `ubatch`) to the llama config struct alongside `Batch`
+    (search `core/config` for the `Batch` field; mirror it).
+  - In `core/backend/options.go`: add `EffectiveUBatchSize(c)` mirroring
+    `EffectiveBatchSize` (return `c.UBatch` if set, else
+    `min(EffectiveBatchSize(c), BlackwellBatchSize-or-512)` so the physical ubatch
+    stays at the hardware sweet spot while `n_batch` may be larger). Set
+    `NUBatch: int32(EffectiveUBatchSize(c))` next to `NBatch:` (`:228`).
+  - Keep the default such that when neither is set, `NUBatch == NBatch` ⇒
+    byte-identical to today.
+
+- **A.5 Serving default (the lever).**
+  In `hardware_defaults.go`, introduce `BlackwellLogicalBatch = 4096` (or a
+  measured value) and let `EffectiveBatchSize` return it for **multi-slot serving**
+  configs (when `n_parallel > 1` and the model is a completion model), while
+  `EffectiveUBatchSize` stays at `BlackwellBatchSize = 2048`. Gate behind the same
+  Blackwell detection already used at `:37-40`. Single-stream/embedding/rerank
+  paths keep `n_batch == n_ubatch`. This is the only behavioral change shipped by
+  Phase A; Phase 0.2 must show it is net-positive before defaulting it on.
+
+- **A.6 Tests.** Extend `hardware_defaults_internal_test.go` with
+  `EffectiveUBatchSize` cases; add a `grpcModelOpts` test asserting
+  `NUBatch <= NBatch` and that unset config yields `NUBatch == NBatch`. Re-run
+  0.1 (must still be token-identical) and 0.2 (must show aggregate-prefill gain or
+  neutral ITL) at `n_batch=4096, n_ubatch=2048`.
+
+### Phase B — Decode-headroom prefill cap (optional policy, vendored patch)
+Only if Phase 0.2 / A shows decode ITL jitter from fat prefill chunks. This is the
+one change that touches the inherited scheduler, so it lives as a patch in
+`backend/cpp/llama-cpp/patches/` (applied by `prepare.sh:6-11` / Makefile
+`:141-145`), never as an edit to a checked-in upstream file.
+
+Policy (pseudocode; insert into `update_slots()` prefill fill loop, the
+`while (… && batch.n_tokens < n_batch)` at ≈ `server-context.cpp:2552`):
+
+```
+# token budget for THIS iteration, decode already seated:
+n_decode_in_batch = batch.n_tokens            # set after the decode phase
+prefill_budget    = n_batch                    # default == today
+
+if serving_mode and n_decode_in_batch > 0:
+    # leave room so decoders are not starved/jittered by one giant prefill chunk
+    # max_prefill_per_iter defaults to n_ubatch (one physical tile) when decode active
+    prefill_budget = min(n_batch, n_decode_in_batch + max_prefill_per_iter)
+
+# fill loop guard becomes:
+while slot.prompt.n_tokens() < slot.task->n_tokens()
+      and batch.n_tokens < prefill_budget:
+      ...
+```
+
+- `max_prefill_per_iter` is a new `common_params` field surfaced as an
+  `options:` knob (`max_prefill_tokens` / `mpt`) parsed in `grpc-server.cpp`
+  exactly like A.1, default `0` = disabled = today's behavior.
+- Semantics mirror vLLM `long_prefill_token_threshold`: cap the prefill share so
+  ongoing decodes keep a steady cadence; the remaining prompt rides the next
+  iteration (already supported by the state machine — slot stays
+  `PROCESSING_PROMPT`).
+- **Correctness:** unchanged KV/position path — chunk boundaries already advance
+  `slot.prompt.tokens.pos_next()` per added token (≈ 2570) and the slot resumes
+  from `slot.prompt.n_tokens()` next iteration. Capping the budget only changes
+  *how many* tokens are added this iteration, not *which* positions, so 0.1 must
+  remain token-identical.
+
+### Phase C — Docs + defaults rollout
+- Document `batch` / `ubatch` (and `max_prefill_tokens` if B ships) in
+  `docs/content/` model-config reference, with the serving recipe
+  (`n_parallel>1`, `n_batch=4096`, `ubatch=2048`).
+- Note the orthogonality to paged KV (below) in
+  `PHASED_VLLM_PARITY_PLAN.md` Phase 3.
+
+---
+
+## 4. Risk / correctness
+
+- **KV-cache & positions across chunks:** already handled upstream. Each prefill
+  token added advances `pos_next()` (≈ 2570) and is pushed to `slot.prompt.tokens`
+  (≈ 2573); the next iteration resumes from `slot.prompt.n_tokens()`. Chunk
+  boundaries are transparent to the KV cache because positions are absolute, not
+  per-chunk. Phase A changes only budgets, not positions; Phase B changes only the
+  per-iteration count. The 0.1 token-identical test is the guardrail.
+- **Unified KV cache (LocalAI default, `n_parallel` slots share one cache):**
+  unaffected — co-batching prefill+decode across slots is what the unified cache is
+  for; positions are per-`seq_id` (`{ slot.id }` in `common_batch_add`).
+- **`n_ubatch > n_batch`:** invalid; A.4 clamps `EffectiveUBatchSize <=
+  EffectiveBatchSize` and A.1 logs a warning if options violate it.
+- **Embeddings / rerank:** must keep `n_ubatch >= prompt length` (single pass,
+  `can_split()==false`). The existing `:519` alias + `EffectiveBatchSize`
+  context-sizing for single-pass usecases (`options.go:119-124`) must be preserved
+  — do not let the serving `BlackwellLogicalBatch` default leak into single-pass
+  configs (A.5 gates on completion + `n_parallel>1`).
+- **Turboquant fork:** the fork lacks some `common_params` fields (see
+  `LOCALAI_LEGACY_LLAMA_CPP_SPEC` precedent at `grpc-server.cpp:755`). `n_batch` /
+  `n_ubatch` are ancient fields and safe; if Phase B adds `max_prefill_per_iter`,
+  guard the new field behind a `#ifndef` like the checkpoint block does.
+
+## 5. Orthogonality to paged KV (Phase 2)
+
+Keep them independent. Paged KV (the `-kvp` / block-manager effort, draft #22569,
+and `paged/`) changes **where** KV blocks live (allocation/utilization). Chunked
+prefill / this decouple changes **how many tokens per iteration** the scheduler
+batches (the `n_batch` budget and decode/prefill interleave). They compose: paged
+KV raises the concurrency ceiling (more slots), the decouple widens the per-iter
+scheduling window to feed those slots; neither touches the other's data structures.
+The only contact point is `update_slots()` — if both ship a vendored patch to it,
+land them as separate, ordered patches in `patches/` and keep the hunks disjoint
+(paged touches allocation/seq_rm; chunked-prefill Phase B touches the prefill fill
+budget).
+
+---
+
+## 6. Bottom line
+
+- Chunked prefill + decode interleave: **already present and correct** on the
+  pinned llama.cpp — verify (Phase 0.1), do not rebuild.
+- Real work: the **n_batch/n_ubatch decouple** (Phase A) — small, additive,
+  default-preserving — plus an **optional decode-headroom prefill cap** (Phase B)
+  if measurements show ITL jitter. Both are LocalAI-side: A in `grpc-server.cpp`
+  + proto + `options.go`; B as a vendored `patches/` hunk.