docs(paged): exact executable plan for 0003 gather-read

Every edit mapped (gather-index graph input mirroring k_idxs; gather K/V/mask by one aligned index; n_kv compaction; gated so stock stays byte-identical) with the token-identical gate and the known risks (mask transpose layout, v_trans). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-23 16:19:07 -04:00 · 2026-06-19 23:12:18 +00:00
parent c4b4f3a3e4
commit 145e45b6f2
1 changed files with 81 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/0003-gather-read-plan.md
+++ b/backend/cpp/llama-cpp/patches/0003-gather-read-plan.md
@@ -0,0 +1,81 @@
+# Patch 0003 — paged gather-read: exact implementation plan
+
+**Goal:** a sequence attends only its own (compacted) cells via `ggml_get_rows`, instead of the scattered
+`[0,n_kv)` window. Token-identical (attention is permutation-invariant over the KV set). **Gated**: stock
+path stays byte-identical (no new ops unless `LLAMA_KV_PAGED`).
+
+**Base:** applies on top of 0001+0002 at the pin. Dev tree: `backend/cpp/llama-cpp-paged-dev` (branch `paged`).
+
+## Design
+
+The gather is keyed off one runtime index list (the sequence's used cells, in a fixed order), exposed as a
+graph input (mirroring `k_idxs`). In `build_attn`, gather K, V **and the kq_mask** by that same index, so all
+three stay aligned. `n_gathered` replaces `n_kv` for the attention. Only active when the cache is in paged
+mode (a new `is_paged()` flag set when `LLAMA_KV_PAGED`/find_slot used permuted placement).
+
+ggml note: `ggml_get_rows(a,b)` gathers `a`'s **ne1** by `b` (I32). Raw K is `[n_embd_k_gqa, kv_size, n_stream]`
+→ ne1 = cells → direct. The mask is `[n_kv, n_tokens, 1, n_stream]` → n_kv is **ne0**, so gather as
+`transpose → get_rows → transpose`.
+
+## Edits
+
+### 1. `src/llama-kv-cache.h` — declare gather infra (in `llama_kv_cache`)
+```cpp
+    bool        is_paged() const { return paged_active; }            // near get_size()
+    ggml_tensor * build_input_gather_idxs(ggml_context * ctx, const slot_info & sinfo) const;
+    void          set_input_gather_idxs (ggml_tensor * dst, const slot_info & sinfo) const;
+    uint32_t      get_n_gather(const slot_info & sinfo) const;       // == sum of used cells gathered
+```
+Add member `mutable bool paged_active = false;` and in `llama_kv_cache_context` forward the three (like
+`build_input_k_idxs`/`get_n_kv`).
+
+### 2. `src/llama-kv-cache.cpp`
+- In `find_slot`, in the paged branch (0002), set `paged_active = true;` on success.
+- `get_n_gather(sinfo)` = `sinfo.idxs[0].size()` summed over streams (the count actually placed).
+- `build_input_gather_idxs`: `ggml_new_tensor_1d(ctx, GGML_TYPE_I32, get_n_gather(sinfo)); ggml_set_input(...)`.
+- `set_input_gather_idxs`: fill `data[k++] = strm_off + sinfo.idxs[s][i]` for every placed cell (same order
+  the mask/k/v will see). This is the canonical gather order.
+
+### 3. `src/llama-graph.h` — `llm_graph_input_attn_kv`
+Add `ggml_tensor * gather_idxs = nullptr;` + `ggml_tensor * get_gather_idxs() const { return gather_idxs; }`.
+
+### 4. `src/llama-graph.cpp`
+- `llm_graph_input_attn_kv::set_input`: if `mctx->is_paged()` → `mctx->set_input_gather_idxs(gather_idxs, ...)`.
+- `build_attn_inp_kv` (creates the input): if `mctx_cur->is_paged()` → `inp->gather_idxs =
+  mctx_cur->build_input_gather_idxs(ctx0, ...)`.
+- `build_attn` (the kv overload, ~2356): after `k`,`v`,`kq_mask`:
+```cpp
+if (ggml_tensor * gi = inp->get_gather_idxs()) {
+    k = ggml_get_rows(ctx0, k, gi);                                   // [d, n_gather, ...] (reshape view ok)
+    v = v_trans ? /* gather columns */ : ggml_get_rows(ctx0, v, gi);
+    ggml_tensor * m = ggml_cont(ctx0, ggml_transpose(ctx0, kq_mask)); // [n_tokens, n_kv]
+    m = ggml_get_rows(ctx0, m, gi);                                   // [n_tokens, n_gather]
+    kq_mask = ggml_cont(ctx0, ggml_transpose(ctx0, m));              // [n_gather, n_tokens]
+}
+ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask, sinks, v_mla, kq_scale, il);
+```
+Note: `get_k` returns the reshaped 4d view; gather must run on a cell-major shape. Simplest: add a paged
+variant `get_k(ctx,il)` that returns `ggml_get_rows` of the **raw** `layers[ikv].k` then reshapes to
+`[n_embd_head, n_head_kv, n_gather, ns]`. Do the gather in the cache, not the graph, for K/V; keep only the
+mask gather in the graph. (Cleaner — revisit during impl.)
+
+### 5. V-transposed path
+When `!flash_attn`, V is stored transposed `[kv_size, n_embd_v_gqa]`; gather its **rows** (ne1 = n_embd) won't
+work — gather columns via the same idx on the non-transposed store, OR force `is_paged()` to require
+flash-attn for the first cut (`GGML_ASSERT`) and handle v_trans in a follow-up.
+
+## Verification (the gate)
+```sh
+cmake --build build-cpu --target llama-simple -j
+M=Qwen3-0.6B.Q4_K_M.gguf ; P="<the 0002 prompt>"
+build-cpu/bin/llama-simple -m $M -n 64 "$P" > a.txt                    # stock
+LLAMA_KV_PAGED=1 build-cpu/bin/llama-simple -m $M -n 64 "$P" > b.txt   # paged gather-read
+diff a.txt b.txt        # MUST be identical
+```
+Also assert (debug) that `n_gather < n_kv` on a multi-chunk sequence (proves compaction, not identity).
+Export only when identical: `git format-patch HEAD~1 -o patches/ --start-number 3 -N`.
+
+## Risks
+- Mask transpose/layout: if `b.txt` diverges, dump the gathered mask vs expected for token 0; off-by-order
+  means the `set_input_gather_idxs` order ≠ the get_k gather order — they MUST use the identical loop.
+- flash-attn vs not: do flash-attn first (simpler mask), then v_trans.