mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 09:57:14 -04:00
The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv,
dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv.
Restore the invariant that patches/ holds only the .patch series.
Moves:
- patches/paged/README.md -> README.md (canonical doc at the backend root)
- patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md,
final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/
- patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README)
Deletes:
- patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section)
- patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide)
Repoint every reference to the moved files: README internal links (docs/ + the
.github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md,
.github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml,
the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml,
docs/content/features/backends.md, gallery/index.yaml.
The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged)
is unchanged and still resolves to the 28 patches.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
87 lines
5.6 KiB
Markdown
87 lines
5.6 KiB
Markdown
# llama.cpp patch series — paged attention (vLLM-parity engine)
|
||
|
||
A **stacking** series: each patch is a small, self-contained, independently-buildable step toward an
|
||
in-model paged-attention engine. They apply in numeric order on top of the pinned `LLAMA_VERSION`
|
||
(`backend/cpp/llama-cpp/Makefile`). The build applies them automatically after checkout (see the
|
||
`llama.cpp:` target). Keeping the work as ordered patches — rather than one big diff — is what lets us
|
||
**rebase cleanly across llama.cpp bumps and avoid drift**: when a patch stops applying, only that small
|
||
patch needs fixing, and the failure points at exactly which step the upstream change touched.
|
||
|
||
## Base
|
||
|
||
- `LLAMA_VERSION` pin in `../Makefile`. **All patches are generated against that exact commit.** Bumping
|
||
the pin = re-run the regen workflow below and fix only the patches that no longer apply.
|
||
|
||
## The series (phases → patches)
|
||
|
||
| # | Patch | What | Verifies |
|
||
|---|-------|------|----------|
|
||
| 0001 | `0001-vendor-paged-kv-manager.patch` | Add `src/paged-kv-manager.{h,cpp}` (vLLM-parity block manager, CPU foundation) + CMake; no behavior change | builds; unit-tested separately |
|
||
| 0002 | `0002-paged-kv-storage.patch` | Shared block-pool KV tensor + `set_rows`-by-slot writes, behind `LLAMA_KV_PAGED` | builds; write/gather round-trip |
|
||
| 0003 | `0003-paged-gather-read.patch` | `build_attn_paged` gather-read in `llama-graph.cpp` | **Gate 0**: token-identical greedy gen, single + multi-seq |
|
||
| 0004 | `0004-paged-ondemand-alloc.patch` | On-demand block allocation via PagedKVManager | max concurrent seqs before OOM |
|
||
| 0005 | `0005-paged-continuous-batching.patch` | Block-granular admit/evict in the server slot path | tok/s vs concurrency, mixed-length |
|
||
| 0006 | `0006-paged-prefix-caching.patch` | Block-hash cross-request prefix dedup | TTFT + memory on shared prefixes |
|
||
|
||
Each row is a separate `git commit` on the dev branch (below), exported 1:1 as a patch. Default off
|
||
(`LLAMA_KV_PAGED`) until Gate 0 (0003) is green, so partial series never changes stock behavior.
|
||
|
||
## Regen workflow (the anti-drift recipe)
|
||
|
||
```sh
|
||
# 1. check out the exact pin into a dev tree
|
||
git -C /tmp clone https://github.com/ggml-org/llama.cpp llama-dev && cd /tmp/llama-dev
|
||
git checkout <LLAMA_VERSION from ../Makefile>
|
||
git checkout -b paged
|
||
|
||
# 2. apply the current series (each becomes a commit), or develop the next patch
|
||
git am /path/to/backend/cpp/llama-cpp-localai-paged/patches/paged/00*.patch # or `git apply` + commit per patch
|
||
|
||
# 3. iterate a phase as ONE commit, then export the whole series 1:1
|
||
git format-patch <LLAMA_VERSION>..paged -o /path/to/backend/cpp/llama-cpp-localai-paged/patches/paged/ --zero-commit -N
|
||
|
||
# 4. on a pin bump: rebase `paged` onto the new pin; only conflicting patches need edits; re-export.
|
||
```
|
||
|
||
## Build integration
|
||
|
||
The series is owned by this backend (`backend/cpp/llama-cpp-localai-paged`), not by the stock
|
||
`llama-cpp` backend, which is pure upstream. `../Makefile` (the paged wrapper) clones the pinned
|
||
`llama.cpp` via the copied stock build infra, then applies this series onto the cloned tree with the
|
||
same strict `git apply` the stock build uses for base patches:
|
||
```
|
||
for p in $(PAGED_PATCHES_DIR)/0*.patch; do git apply --verbose "$p" || exit 1; done
|
||
```
|
||
All variants (avx/avx2/avx512/cuda/…) clone + apply into their own build copy, so the series ships
|
||
everywhere without ever touching the stock `llama-cpp` source tree.
|
||
|
||
## Status
|
||
|
||
- **0001 vendor manager — DONE.** Applies clean to the pin; builds into `libllama`.
|
||
- **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is
|
||
**token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing.
|
||
- **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form
|
||
(see `../README.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
|
||
subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on
|
||
`llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or
|
||
`llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B,
|
||
**9/9** across 3 prompts × {32,96,128} tokens), with `n_gather=71 < n_kv=256` confirming real
|
||
compaction. Patch: `0003-paged-gather-read-env-LLAMA_KV_PAGED.patch`.
|
||
- **Key correctness finding:** `get_gather_idxs` must emit cells **sorted by token position**. The CPU
|
||
flash-attn online softmax reduces cells in physical-array order and is FP-order-sensitive, so 0002's
|
||
scattered placement *alone* (full-window read, no gather) diverges from stock once a sequence crosses
|
||
the first 16-cell block. The position-sorted gather reproduces stock's exact reduction order -> bit-
|
||
identical, not merely mathematically equivalent. So 0002 is the placement substrate; **0003 is what
|
||
makes paged placement token-identical under flash-attn.**
|
||
- 0004–0006 follow.
|
||
|
||
### Honest parity note (important)
|
||
|
||
This series delivers the paged-attention **engine** (capacity + scheduling + prefix sharing). It does **not**
|
||
by itself reach vLLM throughput parity, because the measured prefill bottleneck is the **FP4 MoE GEMM kernel**
|
||
(Lever 3: `mul_mat_q<MXFP4>` ~22 TFLOP/s, ~27× behind vLLM) — a *per-token compute* gap that paging does not
|
||
touch. Paged attention closes the **concurrency/memory** gap (more sequences, prefix reuse); the prefill/throughput
|
||
gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see
|
||
`../README.md`). So full vLLM parity = this series **AND** the
|
||
kernel; neither alone suffices.
|