From 48fbb9384f293e476f8244c89685ed4f4ea57c28 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Fri, 19 Jun 2026 23:14:25 +0000
Subject: [PATCH] docs(paged): refine 0003 plan - used-cell gather, per-ubatch
 rebuild, single-stream first

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 .../patches/0003-gather-read-plan.md          | 21 +++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/backend/cpp/llama-cpp/patches/0003-gather-read-plan.md b/backend/cpp/llama-cpp/patches/0003-gather-read-plan.md
index 993cb70d4..a4356fa4a 100644
--- a/backend/cpp/llama-cpp/patches/0003-gather-read-plan.md
+++ b/backend/cpp/llama-cpp/patches/0003-gather-read-plan.md
@@ -17,6 +17,27 @@ ggml note: `ggml_get_rows(a,b)` gathers `a`'s **ne1** by `b` (I32). Raw K is `[n
 → ne1 = cells → direct. The mask is `[n_kv, n_tokens, 1, n_stream]` → n_kv is **ne0**, so gather as
 `transpose → get_rows → transpose`.
 
+### KEY CORRECTIONS (found while implementing — these change the edits)
+
+1. **Gather index = ALL used (non-empty) cells in `[0,n_kv)`, NOT `sinfo.idxs`.** `sinfo.idxs` is only the
+   *current ubatch's write slots*; attention reads the *full history*. The query set per token is masked by
+   `kq_mask`, so gathering the union of all used cells + gathering the mask the same way is token-identical
+   and drops exactly the empty (already-masked) cells. So: `gather = { i in [0,n_kv) : !cells.is_empty(i) }`.
+
+2. **Static-graph size is fine because llama.cpp rebuilds the graph every ubatch.** `n_gather` (used-cell
+   count) is therefore a build-time constant for that ubatch — `build_input_gather_idxs` sizes the I32
+   tensor to `get_n_gather()` computed at build, `set_input_gather_idxs` fills the identical cell list. They
+   MUST use the same loop (`for i in [0,n_kv): if !is_empty(i) push i`) so build-order == fill-order.
+
+3. **K/V gather can live entirely in `build_attn`, no cache get_k change.** The `get_k` 4d view is contiguous
+   in `[ne0,ne1,ne2]` from cell 0 (nb2 == n_embd_head*n_head_kv*elemsz), so for **single stream (ns==1)**:
+   `reshape_3d(k, n_embd_head*n_head_kv, n_kv, 1) → get_rows(., gi) → reshape_4d(., n_embd_head, n_head_kv, n_gather, 1)`.
+   Multi-stream (ns>1) breaks contiguity (nb3 uses kv_size) → gate to ns==1 first, multi-stream follow-up.
+
+4. So the ONLY cache additions are `is_paged()`, `get_n_gather(n_kv)`, `build/set_input_gather_idxs(n_kv)`;
+   everything else (K/V/mask gather) is in `build_attn`. `set_input_kq_mask` is **unchanged** (built over
+   n_kv, then gathered). Smaller than the 7-edit estimate above.
+
 ## Edits
 
 ### 1. `src/llama-kv-cache.h` — declare gather infra (in `llama_kv_cache`)