mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
Add CHUNKED_PREFILL_PLAN.md for the llama.cpp backend. Key finding: the vendored llama.cpp server scheduler (update_slots) already implements chunked prefill with prefill/decode interleaving on the pinned version - decode tokens are seated first each iteration, prefill fills the leftover n_batch budget, both share one llama_decode. The draft upstream PR #10718 goal is already absorbed; no re-implementation needed. The real LocalAI gap is the n_batch/n_ubatch coupling at grpc-server.cpp (both set to nbatch()), which pins the logical scheduling window to the physical ubatch width. The plan scopes the decouple (C++ option + proto NUBatch + options.go), an optional decode-headroom prefill cap as a vendored patch, a token-identical verification harness, and keeps the work orthogonal to paged KV. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>