Files
LocalAI/backend/cpp/llama-cpp/patches/README.md
Ettore Di Giacinto d9d846e04b feat(paged): patch 0003 gather-read - Gate 0 green, token-identical, additive
Implements the paged-attention gather-read (the real engine compute): attention
reads ONLY a sequence's used cells by gathering K, V and the kq_mask by the
non-empty-cell index list before build_attn_mha. Verified token-identical to stock
greedy generation, 9/9 across 3 prompts x {32,96,128} tokens on Qwen3-0.6B, with
n_gather=71 < n_kv=256 confirming real compaction (not an identity no-op).

Built in the additive "hook, don't edit" form: all logic in new src/paged-attn.{h,cpp}
(an llm_graph_input_i gather-index subclass + the K/V/mask gather), hooked by one line
in build_attn + two thin accessors on llama_kv_cache_context + one CMake line. No edit
to llm_graph_input_attn_kv or llama-graph.h. 216 insertions; default-off behind
LLAMA_KV_PAGED so stock path stays byte-identical.

Key correctness finding: get_gather_idxs emits cells sorted by token position. CPU
flash-attn's online softmax reduces cells in physical-array order and is FP-order-
sensitive, so 0002's scattered placement alone (full-window read) diverges from stock
past the first block; the position-sorted gather reproduces stock's exact reduction
order -> bit-identical. So 0003 is what makes paged placement token-identical under
flash-attn.

Verified on a dev tree at the pin (0001+0002+0003 on branch paged); not pushed.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 08:26:46 +00:00

83 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# llama.cpp patch series — paged attention (vLLM-parity engine)
A **stacking** series: each patch is a small, self-contained, independently-buildable step toward an
in-model paged-attention engine. They apply in numeric order on top of the pinned `LLAMA_VERSION`
(`backend/cpp/llama-cpp/Makefile`). The build applies them automatically after checkout (see the
`llama.cpp:` target). Keeping the work as ordered patches — rather than one big diff — is what lets us
**rebase cleanly across llama.cpp bumps and avoid drift**: when a patch stops applying, only that small
patch needs fixing, and the failure points at exactly which step the upstream change touched.
## Base
- `LLAMA_VERSION` pin in `../Makefile`. **All patches are generated against that exact commit.** Bumping
the pin = re-run the regen workflow below and fix only the patches that no longer apply.
## The series (phases → patches)
| # | Patch | What | Verifies |
|---|-------|------|----------|
| 0001 | `0001-vendor-paged-kv-manager.patch` | Add `src/paged-kv-manager.{h,cpp}` (vLLM-parity block manager, CPU foundation) + CMake; no behavior change | builds; unit-tested separately under `../paged/` |
| 0002 | `0002-paged-kv-storage.patch` | Shared block-pool KV tensor + `set_rows`-by-slot writes, behind `LLAMA_KV_PAGED` | builds; write/gather round-trip |
| 0003 | `0003-paged-gather-read.patch` | `build_attn_paged` gather-read in `llama-graph.cpp` | **Gate 0**: token-identical greedy gen, single + multi-seq |
| 0004 | `0004-paged-ondemand-alloc.patch` | On-demand block allocation via PagedKVManager | max concurrent seqs before OOM |
| 0005 | `0005-paged-continuous-batching.patch` | Block-granular admit/evict in the server slot path | tok/s vs concurrency, mixed-length |
| 0006 | `0006-paged-prefix-caching.patch` | Block-hash cross-request prefix dedup | TTFT + memory on shared prefixes |
Each row is a separate `git commit` on the dev branch (below), exported 1:1 as a patch. Default off
(`LLAMA_KV_PAGED`) until Gate 0 (0003) is green, so partial series never changes stock behavior.
## Regen workflow (the anti-drift recipe)
```sh
# 1. check out the exact pin into a dev tree
git -C /tmp clone https://github.com/ggml-org/llama.cpp llama-dev && cd /tmp/llama-dev
git checkout <LLAMA_VERSION from ../Makefile>
git checkout -b paged
# 2. apply the current series (each becomes a commit), or develop the next patch
git am /path/to/backend/cpp/llama-cpp/patches/00*.patch # or `git apply` + commit per patch
# 3. iterate a phase as ONE commit, then export the whole series 1:1
git format-patch <LLAMA_VERSION>..paged -o /path/to/backend/cpp/llama-cpp/patches/ --zero-commit -N
# 4. on a pin bump: rebase `paged` onto the new pin; only conflicting patches need edits; re-export.
```
## Build integration
`../Makefile`'s `llama.cpp:` target runs, after `git checkout -b build $(LLAMA_VERSION)`:
```
for p in $(CURRENT_MAKEFILE_DIR)/patches/0*.patch; do git apply --verbose "$p"; done
```
All variants (avx/avx2/avx512/cuda/…) copy the patched `llama.cpp/` tree, so the series ships everywhere.
## Status
- **0001 vendor manager — DONE.** Applies clean to the pin; builds into `libllama`.
- **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is
**token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing.
- **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form
(`ADDITIVE_DESIGN.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on
`llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or
`llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B,
**9/9** across 3 prompts × {32,96,128} tokens), with `n_gather=71 < n_kv=256` confirming real
compaction. Patch: `0003-paged-gather-read-env-LLAMA_KV_PAGED.patch`.
- **Key correctness finding:** `get_gather_idxs` must emit cells **sorted by token position**. The CPU
flash-attn online softmax reduces cells in physical-array order and is FP-order-sensitive, so 0002's
scattered placement *alone* (full-window read, no gather) diverges from stock once a sequence crosses
the first 16-cell block. The position-sorted gather reproduces stock's exact reduction order -> bit-
identical, not merely mathematically equivalent. So 0002 is the placement substrate; **0003 is what
makes paged placement token-identical under flash-attn.**
- 00040006 follow.
### Honest parity note (important)
This series delivers the paged-attention **engine** (capacity + scheduling + prefix sharing). It does **not**
by itself reach vLLM throughput parity, because the measured prefill bottleneck is the **FP4 MoE GEMM kernel**
(Lever 3: `mul_mat_q<MXFP4>` ~22 TFLOP/s, ~27× behind vLLM) — a *per-token compute* gap that paging does not
touch. Paged attention closes the **concurrency/memory** gap (more sequences, prefix reuse); the prefill/throughput
gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see
`../paged/UPSTREAM_GGML_ISSUE.md` and `DGX_BLACKWELL_PLAN.md`). So full vLLM parity = this series **AND** the
kernel; neither alone suffices.