LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 08:08:52 -04:00

Files

Ettore Di Giacinto fdb7f56bb7 docs(llama-cpp): scope chunked prefill + n_batch/n_ubatch decouple

Add CHUNKED_PREFILL_PLAN.md for the llama.cpp backend. Key finding: the
vendored llama.cpp server scheduler (update_slots) already implements
chunked prefill with prefill/decode interleaving on the pinned version -
decode tokens are seated first each iteration, prefill fills the leftover
n_batch budget, both share one llama_decode. The draft upstream PR #10718
goal is already absorbed; no re-implementation needed.

The real LocalAI gap is the n_batch/n_ubatch coupling at grpc-server.cpp
(both set to nbatch()), which pins the logical scheduling window to the
physical ubatch width. The plan scopes the decouple (C++ option + proto
NUBatch + options.go), an optional decode-headroom prefill cap as a
vendored patch, a token-identical verification harness, and keeps the
work orthogonal to paged KV.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-21 12:54:22 +00:00

ds4

chore: ⬆️ Update antirez/ds4 to 80ebbc396aee40eedc1d829222f3362d10fa4c6c (#10378 )

2026-06-18 00:32:13 +02:00

grpc

fix: speedup git submodule update with --single-branch (#2847 )

2024-07-13 22:32:25 +02:00

ik-llama-cpp

chore: ⬆️ Update ikawrakow/ik_llama.cpp to b3dfb7858cfcb9166e92f366e5af87f19ebc94be (#10395 )

2026-06-19 00:03:37 +02:00

llama-cpp

docs(llama-cpp): scope chunked prefill + n_batch/n_ubatch decouple