Files
LocalAI/backend/cpp/llama-cpp/patches
Ettore Di Giacinto ec7c1b1f68 feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches)
The worktree merge bumped LLAMA_VERSION 8be759e6 -> 9d5d882d. This re-syncs the
paged patch-stack (0001-0024) to the new tip: the stack was rebased onto
9d5d882d on the DGX dev tree, rebuilt clean (CUDA sm_121), and re-validated
bit-exact before re-exporting the LocalAI .patch files.

Re-exporting each shipped patch from its rebased commit and diffing body-to-body
against the committed files identifies exactly 4 that changed and no longer
git-apply to 9d5d882d:

- 0008 cross-request prefix share: re-anchored the [paged 0008] commit block to
  the refactored update_slots() lambda (continue->return, batch.n_tokens->
  batch.size()); identical env-guarded logic.
- 0013 static prefill budget: budget var-block / while-gate / admission-break
  re-expressed against the refactored loop (add_ok=false idiom).
- 0015 expert-density MoE token-tile auto-select: pure context re-anchor; upstream
  inserted a test_mul_mat_id case at the hunk anchor in test-backend-ops.cpp. The
  inserted lines are unchanged. (This one rebased cleanly via 3-way but its
  committed .patch no longer applies with plain git apply, so it is caught by the
  per-patch apply-check, not by the rebase conflict count.)
- 0016 dynamic decode-first budget: dynamic budget block + n_decode_in_batch =
  batch.size() + add_ok=false against the refactored loop.

All four are byte-faithful format-patch exports of the gate-green rebased commits.
Applying the full corrected series to a fresh 9d5d882d reproduces the gate-green
tree byte-for-byte across every code file.

The other 7 touched patches (0009/0017/0018/0019/0020/0021/0024) are LINENUM-only
(hunk bodies byte-identical, only @@ line-numbers shifted) and still apply
cleanly, so they are left unchanged. The remaining patches are identical.

Validation on the rebased build (NVFP4 Qwen3.6, GB10 sm_121):
- test-backend-ops CUDA0: GATED_DELTA_NET 36/36, SSM_CONV 45/45, MUL_MAT
  1146/1146, MUL_MAT_ID 806/806 all OK.
- greedy md5 (-fa on -n 48 --temp 0 --seed 1): dense q36-27b-nvfp4
  5951a5b4d624ce891e22ab5fca9bc439 and MoE q36-35b-a3b-nvfp4
  07db32c2bcb78d17a43ed18bc22705cd, both == baseline.
- decode S_TG @npl128: dense 366.41 t/s (ref 373.2, -1.8%), MoE 751.11 t/s
  (ref 745.7, +0.7%), both within noise.

Details in backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 14:12:36 +00:00
..

llama.cpp patch series — paged attention (vLLM-parity engine)

A stacking series: each patch is a small, self-contained, independently-buildable step toward an in-model paged-attention engine. They apply in numeric order on top of the pinned LLAMA_VERSION (backend/cpp/llama-cpp/Makefile). The build applies them automatically after checkout (see the llama.cpp: target). Keeping the work as ordered patches — rather than one big diff — is what lets us rebase cleanly across llama.cpp bumps and avoid drift: when a patch stops applying, only that small patch needs fixing, and the failure points at exactly which step the upstream change touched.

Base

  • LLAMA_VERSION pin in ../Makefile. All patches are generated against that exact commit. Bumping the pin = re-run the regen workflow below and fix only the patches that no longer apply.

The series (phases → patches)

# Patch What Verifies
0001 0001-vendor-paged-kv-manager.patch Add src/paged-kv-manager.{h,cpp} (vLLM-parity block manager, CPU foundation) + CMake; no behavior change builds; unit-tested separately under ../paged/
0002 0002-paged-kv-storage.patch Shared block-pool KV tensor + set_rows-by-slot writes, behind LLAMA_KV_PAGED builds; write/gather round-trip
0003 0003-paged-gather-read.patch build_attn_paged gather-read in llama-graph.cpp Gate 0: token-identical greedy gen, single + multi-seq
0004 0004-paged-ondemand-alloc.patch On-demand block allocation via PagedKVManager max concurrent seqs before OOM
0005 0005-paged-continuous-batching.patch Block-granular admit/evict in the server slot path tok/s vs concurrency, mixed-length
0006 0006-paged-prefix-caching.patch Block-hash cross-request prefix dedup TTFT + memory on shared prefixes

Each row is a separate git commit on the dev branch (below), exported 1:1 as a patch. Default off (LLAMA_KV_PAGED) until Gate 0 (0003) is green, so partial series never changes stock behavior.

Regen workflow (the anti-drift recipe)

# 1. check out the exact pin into a dev tree
git -C /tmp clone https://github.com/ggml-org/llama.cpp llama-dev && cd /tmp/llama-dev
git checkout <LLAMA_VERSION from ../Makefile>
git checkout -b paged

# 2. apply the current series (each becomes a commit), or develop the next patch
git am /path/to/backend/cpp/llama-cpp/patches/00*.patch     # or `git apply` + commit per patch

# 3. iterate a phase as ONE commit, then export the whole series 1:1
git format-patch <LLAMA_VERSION>..paged -o /path/to/backend/cpp/llama-cpp/patches/ --zero-commit -N

# 4. on a pin bump: rebase `paged` onto the new pin; only conflicting patches need edits; re-export.

Build integration

../Makefile's llama.cpp: target runs, after git checkout -b build $(LLAMA_VERSION):

for p in $(CURRENT_MAKEFILE_DIR)/patches/0*.patch; do git apply --verbose "$p"; done

All variants (avx/avx2/avx512/cuda/…) copy the patched llama.cpp/ tree, so the series ships everywhere.

Status

  • 0001 vendor manager — DONE. Applies clean to the pin; builds into libllama.
  • 0002 block placement — DONE + VERIFIED. Built llama-simple at the pin; greedy generation is token-identical stock vs LLAMA_KV_PAGED=1 (Qwen3-0.6B), paged branch confirmed firing.
  • 0003 gather-read — DONE + VERIFIED (Gate 0 green). Implemented in the additive form (ADDITIVE_DESIGN.md): all logic in new src/paged-attn.{h,cpp} (a llm_graph_input_i gather-index subclass + the K/V/mask gather), hooked by one line in build_attn + two thin accessors on llama_kv_cache_context + 1 CMake line (216 insertions; no edit to llm_graph_input_attn_kv or llama-graph.h). Greedy generation is token-identical stock vs LLAMA_KV_PAGED=1 (Qwen3-0.6B, 9/9 across 3 prompts × {32,96,128} tokens), with n_gather=71 < n_kv=256 confirming real compaction. Patch: 0003-paged-gather-read-env-LLAMA_KV_PAGED.patch.
    • Key correctness finding: get_gather_idxs must emit cells sorted by token position. The CPU flash-attn online softmax reduces cells in physical-array order and is FP-order-sensitive, so 0002's scattered placement alone (full-window read, no gather) diverges from stock once a sequence crosses the first 16-cell block. The position-sorted gather reproduces stock's exact reduction order -> bit- identical, not merely mathematically equivalent. So 0002 is the placement substrate; 0003 is what makes paged placement token-identical under flash-attn.
  • 00040006 follow.

Honest parity note (important)

This series delivers the paged-attention engine (capacity + scheduling + prefix sharing). It does not by itself reach vLLM throughput parity, because the measured prefill bottleneck is the FP4 MoE GEMM kernel (Lever 3: mul_mat_q<MXFP4> ~22 TFLOP/s, ~27× behind vLLM) — a per-token compute gap that paging does not touch. Paged attention closes the concurrency/memory gap (more sequences, prefix reuse); the prefill/throughput gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see ../paged/UPSTREAM_GGML_ISSUE.md and DGX_BLACKWELL_PLAN.md). So full vLLM parity = this series AND the kernel; neither alone suffices.