Maintainers rejected PR #22569 (the upstream paged draft) as "slop" - it rewrites core attention and is unvendorable. Our own series must be additive so it survives llama.cpp pin bumps. This documents the rule and the per-patch core-touch budget: every change is either new code in a new vendored src/ file, or a single env-gated hook at one call site that delegates to it - no logic in core files, no core struct edits. Grounds it in the pinned source: llm_graph_input_i is pure-virtual and res->add_input() lets a new file register a graph input, so paged behavior plugs in without editing core graph types. Redesigns 0003 (gather-read) from the old 4-file surgery to one build_attn hook + a new paged-attn.{h,cpp} (a gather-input subclass) + two thin cache accessors (~8 core lines vs a core-struct rewrite). 0005 lands entirely in LocalAI's grpc-server.cpp (no core patch). Dev tree at the pin with 0001+0002 applied is set up; 0003 implementation is the next focused token-identical Gate-0 block. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
6.6 KiB
Additive layout for the paged-KV patch series - "hook, don't edit"
Goal: ship paged KV as a vendored patch series that survives llama.cpp pin bumps with minimal rebase pain. PR #22569 (the upstream draft) was rejected by maintainers as "slop" and is far too invasive to vendor - it rewrites core attention. Our series must be the opposite: additive. This document is the design rule and the per-patch core-touch budget.
The rule
Every change is either (a) new code in a new vendored file under
src/, or (b) a single, env-gated hook at one call site in a core file that delegates to the new file. No logic lives in a core file. No core struct/signature is edited.
Why it works: a hook is a 1-3 line diff against a core file. When upstream churns that file,
git apply either still lands the hook (context unchanged) or fails only on that tiny
hunk, which is trivial to re-place. Logic embedded inside a core function (the PR #22569 /
old-0003 approach) conflicts on every bump and must be re-understood each time.
This is enforceable as a core-touch budget: each patch declares the core files it touches and the line count; review rejects anything that grows logic in core.
Why it's achievable here (grounded in the pinned source)
The two seams paged KV needs are both already abstract in llama.cpp at the pin
(LLAMA_VERSION=f3e1828), so new behavior plugs in without editing core types:
- KV placement -
llama_kv_cache::find_slotalready returns aslot_infoof physical cell indices. Paged placement is just different indices. 0002 already does this as one gated block (if (paged_mode) { ... continue; }, 41 lines, one file). Ideal. - Graph inputs -
llm_graph_input_iis a pure-virtual base (set_input()), andllm_graph_result::add_input(llm_graph_input_ptr)lets any code register a new input subclass. So a paged graph input (the gather index) can be a new class in a new file, added from a one-line hook - no edit tollm_graph_input_attn_kvorllama-graph.h.
Per-patch core-touch budget
| # | Patch | New files (additive) | Core hooks (gated, minimal) | Core lines |
|---|---|---|---|---|
| 0001 | vendor manager | paged-kv-manager.{h,cpp} |
CMakeLists.txt +1 |
1 |
| 0002 | block placement | - | one if(paged_mode){...continue;} in find_slot |
~41 |
| 0003 | gather-read | paged-attn.{h,cpp} |
CMakeLists.txt +1; one hook in build_attn; 2 tiny accessors on llama_kv_cache_context |
~8 |
| 0004 | on-demand alloc | (uses 0001 manager) | one branch in find_slot calling the manager |
~10 |
| 0005 | continuous batching | - | LocalAI grpc-server.cpp (already a LocalAI override, not a core patch) |
0 core |
| 0006 | prefix caching | (uses 0001 manager) | one hash-lookup hook in the 0004 alloc branch | ~6 |
Net core surface for the entire engine: find_slot (placement/alloc - where physical
cells are already chosen) + one line in build_attn + two accessors. Everything else
is new files or the LocalAI-side server loop.
0003 redesigned to the rule (replaces the 4-file-surgery plan)
The old 0003-gather-read-plan.md edited llama-kv-cache.{h,cpp} + llama-graph.{h,cpp}
(including a field added to llm_graph_input_attn_kv and fill logic in its set_input).
The additive form removes the core-struct and core-set_input edits entirely:
New file src/paged-attn.{h,cpp} holds all logic:
class llm_graph_input_paged_gather : public llm_graph_input_i- owns theI32 [n_gather]gather-index tensor and aconst llama_kv_cache_context * mctx. Itsset_input()fills the index with the sequence's used cells ({ i in [0,n_kv) : !cells.is_empty(i) }, the same set thekq_maskkeeps), in the canonical order.paged_attn::gather(ctx0, res, mctx, v_trans, &k, &v, &kq_mask)- when paged is active, constructs that input viares->add_input(...), and appliesggml_get_rowstok,v, and the transposedkq_maskby the shared index (mask:transpose -> get_rows -> transpose). When not active it returns immediately -> stock path byte-identical.
Core hooks (the whole core diff for 0003):
src/llama-graph.cpp, inbuild_attnright beforebuild_attn_mha(~line 2357):One line. No new field onpaged_attn::gather(ctx0, res, mctx_cur, v_trans, &k, &v, &kq_mask); // no-op unless LLAMA_KV_PAGEDllm_graph_input_attn_kv; the gather input is a separate registered input, sollama-graph.his untouched.src/llama-kv-cache.{h,cpp}: two thin accessors onllama_kv_cache_contextso the new file can read the used-cell set without reaching into internals -uint32_t get_n_gather() const;andvoid get_gather_idxs(int32_t * dst) const;(delegate tokv/sinfos[i_cur], mirroring the existingget_n_kv/set_input_k_idxspattern). ~8 lines total, no signature changes to existing methods.src/CMakeLists.txt:+ paged-attn.cpp.
First cut: gate to flash-attn + single-stream (GGML_ASSERT otherwise) - the V-transposed
(non-FA) and multi-stream gathers are a localized follow-up entirely inside paged-attn.cpp,
no new core touch. Gate 0 stays the same: diff of greedy llama-simple output, stock vs
LLAMA_KV_PAGED=1, must be identical (attention is permutation-invariant over the gathered
KV set; n_gather < n_kv proves compaction, not identity).
Anti-drift practices (already in README.md, restated as policy)
- Stacking patches, one concern each, exported 1:1 from a dev branch via
git format-patch. On a pin bump, rebase the branch; only the conflicting small patch needs a touch, and the failure names the exact step. - Default-off (
LLAMA_KV_PAGED) until each gate is green, so a partial series never changes stock behavior - and the hooks compile to a no-op branch when the env is unset. - Dev tree:
git worktree add <dev> <LLAMA_VERSION>off any checkout that has the pin (e.g. the existing llama.cpp clone),git applythe series, develop the next patch as one commit, re-export. (Set up and verified for this pin during this work.)
Status / next step
- 0001, 0002: done, additive, verified token-identical.
- 0003: redesigned to the additive form above (this doc). Dev tree at the pin with
0001+0002 applied is ready (
pagedbranch). Remaining work is the focused implement-and-verify block forpaged-attn.{h,cpp}+ the onebuild_attnhook, driven to the token-identical Gate 0. That is a numerical-correctness task (mask/gather alignment, FA-first), not a structural one - the structure is settled here. - 0004-0006: follow the budget above; 0005 lands in LocalAI's
grpc-server.cpp(no core patch at all).