docs(paged): additive "hook, don't edit" layout for the patch series

Maintainers rejected PR #22569 (the upstream paged draft) as "slop" - it rewrites
core attention and is unvendorable. Our own series must be additive so it survives
llama.cpp pin bumps. This documents the rule and the per-patch core-touch budget:
every change is either new code in a new vendored src/ file, or a single env-gated
hook at one call site that delegates to it - no logic in core files, no core struct
edits.

Grounds it in the pinned source: llm_graph_input_i is pure-virtual and
res->add_input() lets a new file register a graph input, so paged behavior plugs in
without editing core graph types. Redesigns 0003 (gather-read) from the old 4-file
surgery to one build_attn hook + a new paged-attn.{h,cpp} (a gather-input subclass)
+ two thin cache accessors (~8 core lines vs a core-struct rewrite). 0005 lands
entirely in LocalAI's grpc-server.cpp (no core patch).

Dev tree at the pin with 0001+0002 applied is set up; 0003 implementation is the
next focused token-identical Gate-0 block.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-22 07:28:44 +00:00
parent 931793aa24
commit 84d59e659b

View File

@@ -0,0 +1,107 @@
# Additive layout for the paged-KV patch series - "hook, don't edit"
Goal: ship paged KV as a vendored patch series that **survives llama.cpp pin bumps with
minimal rebase pain**. PR #22569 (the upstream draft) was rejected by maintainers as
"slop" and is far too invasive to vendor - it rewrites core attention. Our series must be
the opposite: **additive**. This document is the design rule and the per-patch core-touch
budget.
## The rule
> Every change is either (a) **new code in a new vendored file** under `src/`, or (b) a
> **single, env-gated hook** at one call site in a core file that delegates to the new
> file. No logic lives in a core file. No core struct/signature is edited.
Why it works: a hook is a 1-3 line diff against a core file. When upstream churns that file,
`git apply` either still lands the hook (context unchanged) or fails *only on that tiny
hunk*, which is trivial to re-place. Logic embedded inside a core function (the PR #22569 /
old-0003 approach) conflicts on every bump and must be re-understood each time.
This is enforceable as a **core-touch budget**: each patch declares the core files it
touches and the line count; review rejects anything that grows logic in core.
## Why it's achievable here (grounded in the pinned source)
The two seams paged KV needs are both already abstract in llama.cpp at the pin
(`LLAMA_VERSION=f3e1828`), so new behavior plugs in without editing core types:
- **KV placement** - `llama_kv_cache::find_slot` already returns a `slot_info` of physical
cell indices. Paged placement is just *different indices*. 0002 already does this as one
gated block (`if (paged_mode) { ... continue; }`, 41 lines, one file). Ideal.
- **Graph inputs** - `llm_graph_input_i` is a pure-virtual base (`set_input()`), and
`llm_graph_result::add_input(llm_graph_input_ptr)` lets *any* code register a new input
subclass. So a paged graph input (the gather index) can be **a new class in a new file**,
added from a one-line hook - no edit to `llm_graph_input_attn_kv` or `llama-graph.h`.
## Per-patch core-touch budget
| # | Patch | New files (additive) | Core hooks (gated, minimal) | Core lines |
|---|-------|----------------------|------------------------------|-----------:|
| 0001 | vendor manager | `paged-kv-manager.{h,cpp}` | `CMakeLists.txt` +1 | 1 |
| 0002 | block placement | - | one `if(paged_mode){...continue;}` in `find_slot` | ~41 |
| 0003 | gather-read | `paged-attn.{h,cpp}` | `CMakeLists.txt` +1; **one** hook in `build_attn`; 2 tiny accessors on `llama_kv_cache_context` | ~8 |
| 0004 | on-demand alloc | (uses 0001 manager) | one branch in `find_slot` calling the manager | ~10 |
| 0005 | continuous batching | - | **LocalAI `grpc-server.cpp`** (already a LocalAI override, not a core patch) | 0 core |
| 0006 | prefix caching | (uses 0001 manager) | one hash-lookup hook in the 0004 alloc branch | ~6 |
Net core surface for the *entire* engine: `find_slot` (placement/alloc - where physical
cells are already chosen) + **one** line in `build_attn` + two accessors. Everything else
is new files or the LocalAI-side server loop.
## 0003 redesigned to the rule (replaces the 4-file-surgery plan)
The old `0003-gather-read-plan.md` edited `llama-kv-cache.{h,cpp}` + `llama-graph.{h,cpp}`
(including a field added to `llm_graph_input_attn_kv` and fill logic in its `set_input`).
The additive form removes the core-struct and core-`set_input` edits entirely:
**New file `src/paged-attn.{h,cpp}`** holds *all* logic:
- `class llm_graph_input_paged_gather : public llm_graph_input_i` - owns the `I32 [n_gather]`
gather-index tensor and a `const llama_kv_cache_context * mctx`. Its `set_input()` fills
the index with the sequence's used cells (`{ i in [0,n_kv) : !cells.is_empty(i) }`, the
same set the `kq_mask` keeps), in the canonical order.
- `paged_attn::gather(ctx0, res, mctx, v_trans, &k, &v, &kq_mask)` - when paged is active,
constructs that input via `res->add_input(...)`, and applies `ggml_get_rows` to `k`, `v`,
and the transposed `kq_mask` by the shared index (mask: `transpose -> get_rows ->
transpose`). When not active it returns immediately -> **stock path byte-identical**.
**Core hooks (the whole core diff for 0003):**
1. `src/llama-graph.cpp`, in `build_attn` right before `build_attn_mha` (~line 2357):
```cpp
paged_attn::gather(ctx0, res, mctx_cur, v_trans, &k, &v, &kq_mask); // no-op unless LLAMA_KV_PAGED
```
One line. No new field on `llm_graph_input_attn_kv`; the gather input is a *separate*
registered input, so `llama-graph.h` is untouched.
2. `src/llama-kv-cache.{h,cpp}`: two thin accessors on `llama_kv_cache_context` so the new
file can read the used-cell set without reaching into internals -
`uint32_t get_n_gather() const;` and `void get_gather_idxs(int32_t * dst) const;`
(delegate to `kv`/`sinfos[i_cur]`, mirroring the existing `get_n_kv` / `set_input_k_idxs`
pattern). ~8 lines total, no signature changes to existing methods.
3. `src/CMakeLists.txt`: `+ paged-attn.cpp`.
First cut: gate to **flash-attn + single-stream** (`GGML_ASSERT` otherwise) - the V-transposed
(non-FA) and multi-stream gathers are a localized follow-up entirely inside `paged-attn.cpp`,
no new core touch. Gate 0 stays the same: `diff` of greedy `llama-simple` output, stock vs
`LLAMA_KV_PAGED=1`, must be identical (attention is permutation-invariant over the gathered
KV set; `n_gather < n_kv` proves compaction, not identity).
## Anti-drift practices (already in `README.md`, restated as policy)
- **Stacking patches, one concern each**, exported 1:1 from a dev branch via
`git format-patch`. On a pin bump, rebase the branch; only the conflicting small patch
needs a touch, and the failure names the exact step.
- **Default-off (`LLAMA_KV_PAGED`)** until each gate is green, so a partial series never
changes stock behavior - and the hooks compile to a no-op branch when the env is unset.
- **Dev tree:** `git worktree add <dev> <LLAMA_VERSION>` off any checkout that has the pin
(e.g. the existing llama.cpp clone), `git apply` the series, develop the next patch as one
commit, re-export. (Set up and verified for this pin during this work.)
## Status / next step
- 0001, 0002: done, additive, verified token-identical.
- 0003: **redesigned to the additive form above** (this doc). Dev tree at the pin with
0001+0002 applied is ready (`paged` branch). Remaining work is the focused
implement-and-verify block for `paged-attn.{h,cpp}` + the one `build_attn` hook, driven to
the token-identical Gate 0. That is a numerical-correctness task (mask/gather alignment,
FA-first), not a structural one - the structure is settled here.
- 0004-0006: follow the budget above; 0005 lands in LocalAI's `grpc-server.cpp` (no core
patch at all).