mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Files

Ettore Di Giacinto fdb7f56bb7 docs(llama-cpp): scope chunked prefill + n_batch/n_ubatch decouple

Add CHUNKED_PREFILL_PLAN.md for the llama.cpp backend. Key finding: the
vendored llama.cpp server scheduler (update_slots) already implements
chunked prefill with prefill/decode interleaving on the pinned version -
decode tokens are seated first each iteration, prefill fills the leftover
n_batch budget, both share one llama_decode. The draft upstream PR #10718
goal is already absorbed; no re-implementation needed.

The real LocalAI gap is the n_batch/n_ubatch coupling at grpc-server.cpp
(both set to nbatch()), which pins the logical scheduling window to the
physical ubatch width. The plan scopes the decouple (C++ option + proto
NUBatch + options.go), an optional decode-headroom prefill cap as a
vendored patch, a token-identical verification harness, and keeps the
work orthogonal to paged KV.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-21 12:54:22 +00:00

18 KiB

Raw Blame History

Chunked prefill + n_batch/n_ubatch decouple — implementation plan

Scope: LocalAI's llama.cpp backend (backend/cpp/llama-cpp/). Companion to PHASED_VLLM_PARITY_PLAN.md Phase 3. This document is the concrete, file-cited plan for what the brief called "chunked prefill".

Line numbers below are from two trees:

LocalAI: backend/cpp/llama-cpp/grpc-server.cpp, core/backend/options.go, backend/backend.proto, core/backend/hardware_defaults.go — exact.
Vendored upstream scheduler: llama.cpp/tools/server/server-context.cpp. The build copies llama.cpp/tools/server/* into tools/grpc-server/ (prepare.sh lines 15-17) and only overrides grpc-server.cpp + CMakeLists.txt. So update_slots() is inherited upstream code, not LocalAI code. Line numbers cited for it are from a same-era checkout (d12cc3d, 2026-04-09); the pin is f3e1828 (Makefile line 2). The structure is identical; exact lines may drift a few rows at the pin — match on the quoted comment strings, not the integers.

TL;DR — the headline finding

Chunked prefill with prefill/decode interleaving is ALREADY implemented in the llama.cpp server scheduler that LocalAI vendors. It is not a missing feature on this version. update_slots() in server-context.cpp:

Adds ongoing decode tokens first — "first, add sampled tokens from any ongoing sequences" (≈ line 2088). Every SLOT_STATE_GENERATING slot gets its one sampled token into the shared llama_batch before any prefill is added.
Then fills the remaining n_batch budget with prompt (prefill) tokens — "next, batch any pending prompts without exceeding n_batch" (≈ line 2166), gated by params_base.cont_batching (LocalAI sets cont_batching = true by default, grpc-server.cpp:547). The per-slot prefill fill loop (≈ line 2552) is while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch) — i.e. it caps each slot's prefill contribution to the remaining budget and defers the rest to the next iteration.
Decodes the combined batch in one pass (≈ line 2728-2741): decode tokens and prefill-chunk tokens go through the same llama_decode, which then splits internally into n_ubatch physical sub-batches.

This is exactly the behavior the abandoned-looking draft upstream PR #10718 ("server : chunked prefill support") asked for — "the first task is no longer blocked by the second long prompt processing task." That PR is still marked OPEN but its goal was absorbed into the natural evolution of update_slots(); we do not need to port it. A long prefill no longer stalls the decode batch: decode slots are serviced first every iteration, prefill consumes only the leftover budget.

Therefore: do not re-implement chunked prefill. The real LocalAI gap is narrow and is the rest of this plan:

Phase A (the actual gap): the n_batch/n_ubatch decouple. LocalAI ties the scheduler token budget (n_batch) to the physical forward width (n_ubatch) at grpc-server.cpp:515 + :519. This forces n_batch == n_ubatch, so the logical scheduling window can never be wider than one physical ubatch. You cannot keep n_ubatch at the Blackwell GEMM sweet spot (2048) while widening n_batch so concurrent prefills + decodes co-batch into a larger logical window. There is no first-class batch:/ubatch: split on the Go side, and there is only a one-directional ubatch override on the C++ side (you can shrink ubatch below the coupled value, never grow n_batch above it).
Phase B (optional policy lever): a decode-headroom prefill cap. Upstream caps prefill at the full n_batch shared with decode. Under heavy mixed load one fat prefill chunk per iteration still adds inter-token latency (ITL) jitter to the decoders sharing that forward. vLLM exposes long_prefill_token_threshold / max_num_partial_prefills for this. A LocalAI-specific per-iteration prefill cap (a patch to vendored update_slots) bounds that jitter. This is genuinely not in upstream and is the only place a scheduler-policy change is warranted.

1. Current behavior — precise citations

1.1 The scheduler is upstream, inherited verbatim

prepare.sh:15-17 copies all of llama.cpp/tools/server/* into the grpc-server build dir; grpc-server.cpp (LocalAI) replaces only the HTTP/gRPC service + params_parse + parse_options. update_slots(), the slot state machine, and the batch builder are upstream server-context.cpp, untouched by LocalAI today.
Slot states: server-context.cpp:36-42 — SLOT_STATE_IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT / GENERATING.

1.2 Decode-first, then prefill-fill, one shared batch

common_batch_clear(batch) (≈ 2078) — one batch per update_slots iteration.
Decode phase (≈ 2088-2156): for each SLOT_STATE_GENERATING slot, common_batch_add(batch, slot.sampled, …, /*logits=*/true) adds exactly one token. Decode is guaranteed a seat before prefill runs.
Budget fetch (≈ 2158-2160): n_batch = llama_n_batch(ctx), n_ubatch = llama_n_ubatch(ctx).
Prefill phase (≈ 2166): if (params_base.cont_batching || batch.n_tokens == 0) → with cont_batching ON, prefill is added to the same batch as decode.
Per-slot prefill fill (≈ 2552-2597): while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch) — adds prompt tokens until the slot is done or the shared budget is hit. Whatever does not fit stays for the next iteration (the slot remains SLOT_STATE_PROCESSING_PROMPT).
Whole-prompt completion (≈ 2603-2615): when the slot's prompt is fully consumed it flips to SLOT_STATE_DONE_PROMPT, sets batch.logits[last] = true, inits the sampler. Next iteration it becomes GENERATING.
Budget break (≈ 2693-2695): if (batch.n_tokens >= n_batch) break;.
Decode (≈ 2728-2741): loops batch_view slices of min(n_batch, remaining) and calls llama_decode; the physical n_ubatch split happens inside llama_decode.

1.3 The chunking is gated by `can_split()`

server-context.cpp:225-231: can_split() returns true unless the task needs embeddings with non-LAST pooling. So completion/generation tasks always chunk-and-interleave; only embeddings/rerank force the whole prompt into one ubatch (≈ 2234-2244 raises "input is too large… increase the physical batch size" — this is exactly why LocalAI bumped n_ubatch for rerank, see below).

1.4 LocalAI ties n_batch to n_ubatch (the gap)

grpc-server.cpp:515 — params.n_batch = request->nbatch();
grpc-server.cpp:519 — params.n_ubatch = request->nbatch(); with the comment that this fixes reranking being capped at the 512 default n_ubatch.
grpc-server.cpp:781-784 — the only decouple knob today: an n_ubatch / ubatch option that overrides n_ubatch alone (added for embeddings/rerank). There is no batch / n_batch option parse, so n_batch cannot be raised above the coupled value from a model config. Confirmed: grep '"n_batch"|"batch"' in grpc-server.cpp returns nothing.
Options arrive via request->options(i) parsed as optname:optval (grpc-server.cpp:584-585); these come from ModelOptions.Options ⟵ c.Options (core/backend/options.go:221).

1.5 Go side sends a single batch number

backend/backend.proto:341 — int32 NBatch = 4; is the only batch field; there is no NUBatch.
core/backend/options.go:108-129 EffectiveBatchSize: returns c.Batch if set, else context size for single-pass (score/embed/rerank), else hardwareDefaultBatchSize(512).
core/backend/options.go:228 — NBatch: int32(b) (single value to the backend; becomes both n_batch and n_ubatch via 1.4).
core/backend/hardware_defaults.go:28,37-40 — BlackwellBatchSize = 2048; on Blackwell an unset batch defaults to 2048, so today n_batch == n_ubatch == 2048 there.

2. Why the decouple matters for serving (not just rerank)

Invariant: n_ubatch <= n_batch. n_ubatch is the physical forward-pass GEMM width (compute efficiency; GB10 sweet spot ≈ 2048). n_batch is the per-iteration scheduler token budget — the logical window shared by decode + prefill chunks, analogous to vLLM's max_num_batched_tokens.

With n_batch == n_ubatch (today), the scheduling window cannot exceed one physical ubatch. Consequences:

Under concurrency, the combined (decode + multiple prefill chunks) logical batch is capped at the physical ubatch, so aggregate prefill cannot grow past one ubatch worth of tokens per iteration even when more slots have prompts queued.
A user who shrinks batch: for memory also shrinks the physical ubatch, degrading prefill GEMM efficiency — and vice versa.

Decoupling lets us hold n_ubatch = 2048 (efficient GEMM) while setting a larger n_batch (e.g. 4096) so more concurrent prefill+decode tokens co-schedule into one logical window, lifting aggregate prefill under mixed load — llama_decode still tiles the physical work at 2048.

3. Phased implementation

Phase 0 — Verification harness (do first; TDD red)

Bite-sized, no code change to the scheduler.

0.1 Token-identical greedy under mixed load. Script: start the backend with n_parallel >= 4, greedy sampling (temp 0, fixed seed). Fire (a) several short decode streams and (b) one ~8k-token prompt concurrently (the exact repro from PR #10718's body works). Capture each stream's full token id sequence. Re-run with the prefill request absent. Assert the short streams' token ids are byte-identical in both runs — proves interleaving does not perturb decode numerics (KV/position correctness across chunk boundaries). Wire as a Ginkgo spec under the backend e2e suite.
0.2 Mixed-workload throughput baseline. Use llama-batched-bench (built from the same tree) or a small driver hitting /v1/chat/completions: measure aggregate prefill tok/s and decode tok/s, and p50/p99 ITL of the decode streams, under the mixed workload. Record numbers for the current n_batch==n_ubatch config. This is the before of Phase A/B.

Expected result of Phase 0: 0.1 already passes (interleave is correct today); 0.2 gives the baseline the decouple must beat.

Phase A — Decouple n_batch from n_ubatch

Goal: let model config set the physical ubatch independently of the logical batch, defaulting to today's behavior (no regression).

A.1 C++: accept a batch/n_batch option (and keep ubatch). In grpc-server.cpp, after the existing ubatch branch (:781-784), add a sibling branch:
```
} else if (!strcmp(optname, "n_batch") || !strcmp(optname, "batch")) {
    if (optval != NULL) {
        try { params.n_batch = std::stoi(optval_str); } catch (...) {}
    }
```
This is the missing direction (raise n_batch above the coupled value). Order matters: both :515/:519 run first (coupling as default), then option parsing overrides either independently. Add a clamp note: if a user sets n_ubatch > n_batch, llama.cpp will clamp/upbatch; log a warning. Keep the :519 aliasing for backward compat (rerank still works with no options).
A.2 Proto: add an explicit physical ubatch field. backend/backend.proto:341 add int32 NUBatch = <next free tag>; (do not reuse 4). Regenerate with make protogen-go + the C++ proto build.
A.3 C++: honor NUBatch when present. In grpc-server.cpp params_parse, after :519, add:
```
if (request->nubatch() > 0) {
    params.n_ubatch = request->nubatch();
}
```
so an explicit physical ubatch wins over the n_batch alias, with the ubatch string-option as a third path for users who only edit options:.
A.4 Go: config surface + plumbing.
- Add UBatch *int (yaml ubatch) to the llama config struct alongside Batch (search core/config for the Batch field; mirror it).
- In core/backend/options.go: add EffectiveUBatchSize(c) mirroring EffectiveBatchSize (return c.UBatch if set, else min(EffectiveBatchSize(c), BlackwellBatchSize-or-512) so the physical ubatch stays at the hardware sweet spot while n_batch may be larger). Set NUBatch: int32(EffectiveUBatchSize(c)) next to NBatch: (:228).
- Keep the default such that when neither is set, NUBatch == NBatch ⇒ byte-identical to today.
A.5 Serving default (the lever). In hardware_defaults.go, introduce BlackwellLogicalBatch = 4096 (or a measured value) and let EffectiveBatchSize return it for multi-slot serving configs (when n_parallel > 1 and the model is a completion model), while EffectiveUBatchSize stays at BlackwellBatchSize = 2048. Gate behind the same Blackwell detection already used at :37-40. Single-stream/embedding/rerank paths keep n_batch == n_ubatch. This is the only behavioral change shipped by Phase A; Phase 0.2 must show it is net-positive before defaulting it on.
A.6 Tests. Extend hardware_defaults_internal_test.go with EffectiveUBatchSize cases; add a grpcModelOpts test asserting NUBatch <= NBatch and that unset config yields NUBatch == NBatch. Re-run 0.1 (must still be token-identical) and 0.2 (must show aggregate-prefill gain or neutral ITL) at n_batch=4096, n_ubatch=2048.

Phase B — Decode-headroom prefill cap (optional policy, vendored patch)

Only if Phase 0.2 / A shows decode ITL jitter from fat prefill chunks. This is the one change that touches the inherited scheduler, so it lives as a patch in backend/cpp/llama-cpp/patches/ (applied by prepare.sh:6-11 / Makefile :141-145), never as an edit to a checked-in upstream file.

Policy (pseudocode; insert into update_slots() prefill fill loop, the while (… && batch.n_tokens < n_batch) at ≈ server-context.cpp:2552):

# token budget for THIS iteration, decode already seated:
n_decode_in_batch = batch.n_tokens            # set after the decode phase
prefill_budget    = n_batch                    # default == today

if serving_mode and n_decode_in_batch > 0:
    # leave room so decoders are not starved/jittered by one giant prefill chunk
    # max_prefill_per_iter defaults to n_ubatch (one physical tile) when decode active
    prefill_budget = min(n_batch, n_decode_in_batch + max_prefill_per_iter)

# fill loop guard becomes:
while slot.prompt.n_tokens() < slot.task->n_tokens()
      and batch.n_tokens < prefill_budget:
      ...

max_prefill_per_iter is a new common_params field surfaced as an options: knob (max_prefill_tokens / mpt) parsed in grpc-server.cpp exactly like A.1, default 0 = disabled = today's behavior.
Semantics mirror vLLM long_prefill_token_threshold: cap the prefill share so ongoing decodes keep a steady cadence; the remaining prompt rides the next iteration (already supported by the state machine — slot stays PROCESSING_PROMPT).
Correctness: unchanged KV/position path — chunk boundaries already advance slot.prompt.tokens.pos_next() per added token (≈ 2570) and the slot resumes from slot.prompt.n_tokens() next iteration. Capping the budget only changes how many tokens are added this iteration, not which positions, so 0.1 must remain token-identical.

Phase C — Docs + defaults rollout

Document batch / ubatch (and max_prefill_tokens if B ships) in docs/content/ model-config reference, with the serving recipe (n_parallel>1, n_batch=4096, ubatch=2048).
Note the orthogonality to paged KV (below) in PHASED_VLLM_PARITY_PLAN.md Phase 3.

4. Risk / correctness

KV-cache & positions across chunks: already handled upstream. Each prefill token added advances pos_next() (≈ 2570) and is pushed to slot.prompt.tokens (≈ 2573); the next iteration resumes from slot.prompt.n_tokens(). Chunk boundaries are transparent to the KV cache because positions are absolute, not per-chunk. Phase A changes only budgets, not positions; Phase B changes only the per-iteration count. The 0.1 token-identical test is the guardrail.
Unified KV cache (LocalAI default, n_parallel slots share one cache): unaffected — co-batching prefill+decode across slots is what the unified cache is for; positions are per-seq_id ({ slot.id } in common_batch_add).
n_ubatch > n_batch: invalid; A.4 clamps EffectiveUBatchSize <= EffectiveBatchSize and A.1 logs a warning if options violate it.
Embeddings / rerank: must keep n_ubatch >= prompt length (single pass, can_split()==false). The existing :519 alias + EffectiveBatchSize context-sizing for single-pass usecases (options.go:119-124) must be preserved — do not let the serving BlackwellLogicalBatch default leak into single-pass configs (A.5 gates on completion + n_parallel>1).
Turboquant fork: the fork lacks some common_params fields (see LOCALAI_LEGACY_LLAMA_CPP_SPEC precedent at grpc-server.cpp:755). n_batch / n_ubatch are ancient fields and safe; if Phase B adds max_prefill_per_iter, guard the new field behind a #ifndef like the checkpoint block does.

5. Orthogonality to paged KV (Phase 2)

Keep them independent. Paged KV (the -kvp / block-manager effort, draft #22569, and paged/) changes where KV blocks live (allocation/utilization). Chunked prefill / this decouple changes how many tokens per iteration the scheduler batches (the n_batch budget and decode/prefill interleave). They compose: paged KV raises the concurrency ceiling (more slots), the decouple widens the per-iter scheduling window to feed those slots; neither touches the other's data structures. The only contact point is update_slots() — if both ship a vendored patch to it, land them as separate, ordered patches in patches/ and keep the hunks disjoint (paged touches allocation/seq_rm; chunked-prefill Phase B touches the prefill fill budget).

6. Bottom line

Chunked prefill + decode interleave: already present and correct on the pinned llama.cpp — verify (Phase 0.1), do not rebuild.
Real work: the n_batch/n_ubatch decouple (Phase A) — small, additive, default-preserving — plus an optional decode-headroom prefill cap (Phase B) if measurements show ITL jitter. Both are LocalAI-side: A in grpc-server.cpp
- proto + options.go; B as a vendored patches/ hunk.

18 KiB Raw Blame History