mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 08:08:52 -04:00
docs(paged): additive "hook, don't edit" layout for the patch series
Maintainers rejected PR #22569 (the upstream paged draft) as "slop" - it rewrites core attention and is unvendorable. Our own series must be additive so it survives llama.cpp pin bumps. This documents the rule and the per-patch core-touch budget: every change is either new code in a new vendored src/ file, or a single env-gated hook at one call site that delegates to it - no logic in core files, no core struct edits. Grounds it in the pinned source: llm_graph_input_i is pure-virtual and res->add_input() lets a new file register a graph input, so paged behavior plugs in without editing core graph types. Redesigns 0003 (gather-read) from the old 4-file surgery to one build_attn hook + a new paged-attn.{h,cpp} (a gather-input subclass) + two thin cache accessors (~8 core lines vs a core-struct rewrite). 0005 lands entirely in LocalAI's grpc-server.cpp (no core patch). Dev tree at the pin with 0001+0002 applied is set up; 0003 implementation is the next focused token-identical Gate-0 block. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
107
backend/cpp/llama-cpp/patches/ADDITIVE_DESIGN.md
Normal file
107
backend/cpp/llama-cpp/patches/ADDITIVE_DESIGN.md
Normal file
@@ -0,0 +1,107 @@
|
||||
# Additive layout for the paged-KV patch series - "hook, don't edit"
|
||||
|
||||
Goal: ship paged KV as a vendored patch series that **survives llama.cpp pin bumps with
|
||||
minimal rebase pain**. PR #22569 (the upstream draft) was rejected by maintainers as
|
||||
"slop" and is far too invasive to vendor - it rewrites core attention. Our series must be
|
||||
the opposite: **additive**. This document is the design rule and the per-patch core-touch
|
||||
budget.
|
||||
|
||||
## The rule
|
||||
|
||||
> Every change is either (a) **new code in a new vendored file** under `src/`, or (b) a
|
||||
> **single, env-gated hook** at one call site in a core file that delegates to the new
|
||||
> file. No logic lives in a core file. No core struct/signature is edited.
|
||||
|
||||
Why it works: a hook is a 1-3 line diff against a core file. When upstream churns that file,
|
||||
`git apply` either still lands the hook (context unchanged) or fails *only on that tiny
|
||||
hunk*, which is trivial to re-place. Logic embedded inside a core function (the PR #22569 /
|
||||
old-0003 approach) conflicts on every bump and must be re-understood each time.
|
||||
|
||||
This is enforceable as a **core-touch budget**: each patch declares the core files it
|
||||
touches and the line count; review rejects anything that grows logic in core.
|
||||
|
||||
## Why it's achievable here (grounded in the pinned source)
|
||||
|
||||
The two seams paged KV needs are both already abstract in llama.cpp at the pin
|
||||
(`LLAMA_VERSION=f3e1828`), so new behavior plugs in without editing core types:
|
||||
|
||||
- **KV placement** - `llama_kv_cache::find_slot` already returns a `slot_info` of physical
|
||||
cell indices. Paged placement is just *different indices*. 0002 already does this as one
|
||||
gated block (`if (paged_mode) { ... continue; }`, 41 lines, one file). Ideal.
|
||||
- **Graph inputs** - `llm_graph_input_i` is a pure-virtual base (`set_input()`), and
|
||||
`llm_graph_result::add_input(llm_graph_input_ptr)` lets *any* code register a new input
|
||||
subclass. So a paged graph input (the gather index) can be **a new class in a new file**,
|
||||
added from a one-line hook - no edit to `llm_graph_input_attn_kv` or `llama-graph.h`.
|
||||
|
||||
## Per-patch core-touch budget
|
||||
|
||||
| # | Patch | New files (additive) | Core hooks (gated, minimal) | Core lines |
|
||||
|---|-------|----------------------|------------------------------|-----------:|
|
||||
| 0001 | vendor manager | `paged-kv-manager.{h,cpp}` | `CMakeLists.txt` +1 | 1 |
|
||||
| 0002 | block placement | - | one `if(paged_mode){...continue;}` in `find_slot` | ~41 |
|
||||
| 0003 | gather-read | `paged-attn.{h,cpp}` | `CMakeLists.txt` +1; **one** hook in `build_attn`; 2 tiny accessors on `llama_kv_cache_context` | ~8 |
|
||||
| 0004 | on-demand alloc | (uses 0001 manager) | one branch in `find_slot` calling the manager | ~10 |
|
||||
| 0005 | continuous batching | - | **LocalAI `grpc-server.cpp`** (already a LocalAI override, not a core patch) | 0 core |
|
||||
| 0006 | prefix caching | (uses 0001 manager) | one hash-lookup hook in the 0004 alloc branch | ~6 |
|
||||
|
||||
Net core surface for the *entire* engine: `find_slot` (placement/alloc - where physical
|
||||
cells are already chosen) + **one** line in `build_attn` + two accessors. Everything else
|
||||
is new files or the LocalAI-side server loop.
|
||||
|
||||
## 0003 redesigned to the rule (replaces the 4-file-surgery plan)
|
||||
|
||||
The old `0003-gather-read-plan.md` edited `llama-kv-cache.{h,cpp}` + `llama-graph.{h,cpp}`
|
||||
(including a field added to `llm_graph_input_attn_kv` and fill logic in its `set_input`).
|
||||
The additive form removes the core-struct and core-`set_input` edits entirely:
|
||||
|
||||
**New file `src/paged-attn.{h,cpp}`** holds *all* logic:
|
||||
- `class llm_graph_input_paged_gather : public llm_graph_input_i` - owns the `I32 [n_gather]`
|
||||
gather-index tensor and a `const llama_kv_cache_context * mctx`. Its `set_input()` fills
|
||||
the index with the sequence's used cells (`{ i in [0,n_kv) : !cells.is_empty(i) }`, the
|
||||
same set the `kq_mask` keeps), in the canonical order.
|
||||
- `paged_attn::gather(ctx0, res, mctx, v_trans, &k, &v, &kq_mask)` - when paged is active,
|
||||
constructs that input via `res->add_input(...)`, and applies `ggml_get_rows` to `k`, `v`,
|
||||
and the transposed `kq_mask` by the shared index (mask: `transpose -> get_rows ->
|
||||
transpose`). When not active it returns immediately -> **stock path byte-identical**.
|
||||
|
||||
**Core hooks (the whole core diff for 0003):**
|
||||
1. `src/llama-graph.cpp`, in `build_attn` right before `build_attn_mha` (~line 2357):
|
||||
```cpp
|
||||
paged_attn::gather(ctx0, res, mctx_cur, v_trans, &k, &v, &kq_mask); // no-op unless LLAMA_KV_PAGED
|
||||
```
|
||||
One line. No new field on `llm_graph_input_attn_kv`; the gather input is a *separate*
|
||||
registered input, so `llama-graph.h` is untouched.
|
||||
2. `src/llama-kv-cache.{h,cpp}`: two thin accessors on `llama_kv_cache_context` so the new
|
||||
file can read the used-cell set without reaching into internals -
|
||||
`uint32_t get_n_gather() const;` and `void get_gather_idxs(int32_t * dst) const;`
|
||||
(delegate to `kv`/`sinfos[i_cur]`, mirroring the existing `get_n_kv` / `set_input_k_idxs`
|
||||
pattern). ~8 lines total, no signature changes to existing methods.
|
||||
3. `src/CMakeLists.txt`: `+ paged-attn.cpp`.
|
||||
|
||||
First cut: gate to **flash-attn + single-stream** (`GGML_ASSERT` otherwise) - the V-transposed
|
||||
(non-FA) and multi-stream gathers are a localized follow-up entirely inside `paged-attn.cpp`,
|
||||
no new core touch. Gate 0 stays the same: `diff` of greedy `llama-simple` output, stock vs
|
||||
`LLAMA_KV_PAGED=1`, must be identical (attention is permutation-invariant over the gathered
|
||||
KV set; `n_gather < n_kv` proves compaction, not identity).
|
||||
|
||||
## Anti-drift practices (already in `README.md`, restated as policy)
|
||||
|
||||
- **Stacking patches, one concern each**, exported 1:1 from a dev branch via
|
||||
`git format-patch`. On a pin bump, rebase the branch; only the conflicting small patch
|
||||
needs a touch, and the failure names the exact step.
|
||||
- **Default-off (`LLAMA_KV_PAGED`)** until each gate is green, so a partial series never
|
||||
changes stock behavior - and the hooks compile to a no-op branch when the env is unset.
|
||||
- **Dev tree:** `git worktree add <dev> <LLAMA_VERSION>` off any checkout that has the pin
|
||||
(e.g. the existing llama.cpp clone), `git apply` the series, develop the next patch as one
|
||||
commit, re-export. (Set up and verified for this pin during this work.)
|
||||
|
||||
## Status / next step
|
||||
|
||||
- 0001, 0002: done, additive, verified token-identical.
|
||||
- 0003: **redesigned to the additive form above** (this doc). Dev tree at the pin with
|
||||
0001+0002 applied is ready (`paged` branch). Remaining work is the focused
|
||||
implement-and-verify block for `paged-attn.{h,cpp}` + the one `build_attn` hook, driven to
|
||||
the token-identical Gate 0. That is a numerical-correctness task (mask/gather alignment,
|
||||
FA-first), not a structural one - the structure is settled here.
|
||||
- 0004-0006: follow the budget above; 0005 lands in LocalAI's `grpc-server.cpp` (no core
|
||||
patch at all).
|
||||
Reference in New Issue
Block a user