From e160041f05f74d2386eba60e98734cd74b6a5677 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Sat, 27 Jun 2026 08:02:37 +0000 Subject: [PATCH] chore(paged): decouple paged llama.cpp pin from the nightly auto-bumper The llama-cpp-localai-paged backend reused backend/cpp/llama-cpp's LLAMA_VERSION, which .github/workflows/bump_deps.yaml auto-bumps nightly to the latest ggml-org/llama.cpp master tip. The stock backend is patch-free so that bump is safe, but the paged backend applies a vendored patch series (backend/cpp/llama-cpp/patches/paged/) hand-verified bit-exact against ONE specific tip. A naive bump moves the tip out from under the patches and breaks 'git apply' at build time - a dep-bump PR would go red (or, worse, the break surfaces later in a release build). Mirror the turboquant precedent: give the paged wrapper its OWN LLAMA_VERSION pin (the verified 9d5d882d) and force it into every copied build via LLAMA_VERSION=$(LLAMA_VERSION), so the nightly stock bump no longer drags the paged build to an unverified tip. Unlike turboquant (whose fork branch carries the patches and is safe to auto-bump), the paged series is vendored, so it gets NO bump_deps.yaml entry: it is advanced only by the manual PIN_SYNC process. Add cross-referencing comments in both Makefiles and bump_deps.yaml. Also add PIN_BUMP_APPLY_CHECK.md: an apply-feasibility report for the latest tip (c299a92c, 23 commits ahead). The full series applies CLEAN under 'git apply' with only benign line offsets and zero conflicts; the lone failure (0019) is a pre-existing stray dev-doc hunk, identical on the current pin, not a bump regression. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .github/workflows/bump_deps.yaml | 9 ++ backend/cpp/llama-cpp-localai-paged/Makefile | 43 ++++--- backend/cpp/llama-cpp/Makefile | 5 + .../patches/paged/PIN_BUMP_APPLY_CHECK.md | 107 ++++++++++++++++++ 4 files changed, 150 insertions(+), 14 deletions(-) create mode 100644 backend/cpp/llama-cpp/patches/paged/PIN_BUMP_APPLY_CHECK.md diff --git a/.github/workflows/bump_deps.yaml b/.github/workflows/bump_deps.yaml index a2c37881f..506e2fe44 100644 --- a/.github/workflows/bump_deps.yaml +++ b/.github/workflows/bump_deps.yaml @@ -9,6 +9,15 @@ jobs: strategy: fail-fast: false matrix: + # NOTE: there is intentionally NO entry for the llama-cpp-localai-paged + # backend. It carries a vendored paged-attention patch series + # (backend/cpp/llama-cpp/patches/paged/) hand-verified bit-exact against + # ONE specific llama.cpp tip; a naive nightly bump would move the tip out + # from under the patches and break `git apply` at build time. Its pin is + # therefore decoupled (its own LLAMA_VERSION in + # backend/cpp/llama-cpp-localai-paged/Makefile) and advanced ONLY by the + # manual PIN_SYNC process. Do not add it here. (turboquant CAN be + # auto-bumped below because its fork branch carries the patches.) include: - repository: "ggml-org/llama.cpp" variable: "LLAMA_VERSION" diff --git a/backend/cpp/llama-cpp-localai-paged/Makefile b/backend/cpp/llama-cpp-localai-paged/Makefile index 09f6bbf76..6a5a4f41b 100644 --- a/backend/cpp/llama-cpp-localai-paged/Makefile +++ b/backend/cpp/llama-cpp-localai-paged/Makefile @@ -1,20 +1,35 @@ -# llama-cpp-localai-paged is LocalAI's paged-attention llama.cpp variant. It is -# the SAME upstream llama.cpp pin as the stock llama-cpp backend, with the -# LocalAI paged-attention patch series (backend/cpp/llama-cpp/patches/paged/) -# applied on top (LLAMA_PAGED=on). It reuses backend/cpp/llama-cpp's -# grpc-server.cpp / CMakeLists.txt / prepare.sh sources verbatim via a thin -# wrapper, so there is nothing to keep in sync here. +# llama-cpp-localai-paged is LocalAI's paged-attention llama.cpp variant. It +# builds upstream llama.cpp with the LocalAI paged-attention patch series +# (backend/cpp/llama-cpp/patches/paged/) applied on top (LLAMA_PAGED=on). It +# reuses backend/cpp/llama-cpp's grpc-server.cpp / CMakeLists.txt / prepare.sh +# sources verbatim via a thin wrapper. +# +# Pin handling (mirrors the turboquant wrapper, the precedent this is modelled +# on): the paged patch series is hand-verified bit-exact against ONE specific +# llama.cpp tip and re-exported by the manual PIN_SYNC process +# (backend/cpp/llama-cpp/patches/paged/PIN_SYNC_*.md). A naive pin bump would +# move the tip out from under the patches and break `git apply` at build time, +# so this backend OWNS its pin (LLAMA_VERSION below) instead of inheriting the +# auto-bumped stock pin from backend/cpp/llama-cpp/Makefile. The override is +# forced into every copied build via `LLAMA_VERSION=$(LLAMA_VERSION)`. There is +# deliberately NO bump_deps.yaml entry for it: it is advanced ONLY by PIN_SYNC, +# never nightly. (turboquant CAN auto-bump because its fork branch carries the +# patches; the paged series is vendored as .patch files here, so it cannot.) # -# Differences vs the turboquant wrapper (the precedent this is modelled on): -# - NO LLAMA_REPO / LLAMA_VERSION override: we build the SAME upstream pin as -# stock llama-cpp (it lives in backend/cpp/llama-cpp/Makefile and is -# auto-bumped there), so there is no bump_deps.yaml entry to maintain. # - NO patch-grpc-server.sh and NO apply-patches.sh: the shared # grpc-server.cpp already carries the (runtime-gated) paged option hooks, # and the paged patch series is applied by the copied llama-cpp Makefile's # own `llama.cpp` target whenever LLAMA_PAGED=on (which we force below). +# Manually pin-synced llama.cpp tip the paged patch series is verified against. +# Decoupled from the auto-bumped stock pin in backend/cpp/llama-cpp/Makefile so +# the nightly llama.cpp bump cannot silently break the vendored paged patches. +# Advance ONLY via the PIN_SYNC process (rebase patches + bit-exact gate + +# re-export), then update this value. See: +# backend/cpp/llama-cpp/patches/paged/PIN_SYNC_*.md +LLAMA_VERSION?=9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1 + CMAKE_ARGS?= BUILD_TYPE?= NATIVE?=false @@ -45,8 +60,8 @@ define paged-build cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build purge $(info $(GREEN)I llama-cpp-localai-paged build info:$(1)$(RESET)) - LLAMA_PAGED=on $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build llama.cpp - CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" LLAMA_PAGED=on \ + LLAMA_VERSION=$(LLAMA_VERSION) LLAMA_PAGED=on $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build llama.cpp + CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" LLAMA_VERSION=$(LLAMA_VERSION) LLAMA_PAGED=on \ $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build grpc-server cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build/grpc-server llama-cpp-localai-paged-$(1) endef @@ -75,8 +90,8 @@ llama-cpp-localai-paged-cpu-all: cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build purge $(info $(GREEN)I llama-cpp-localai-paged build info:cpu-all-variants$(RESET)) - LLAMA_PAGED=on $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build llama.cpp - SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" LLAMA_PAGED=on \ + LLAMA_VERSION=$(LLAMA_VERSION) LLAMA_PAGED=on $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build llama.cpp + SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" LLAMA_VERSION=$(LLAMA_VERSION) LLAMA_PAGED=on \ $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build grpc-server cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build/grpc-server llama-cpp-localai-paged-cpu-all rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs diff --git a/backend/cpp/llama-cpp/Makefile b/backend/cpp/llama-cpp/Makefile index f4d3f3765..bc404ee6c 100644 --- a/backend/cpp/llama-cpp/Makefile +++ b/backend/cpp/llama-cpp/Makefile @@ -1,4 +1,9 @@ +# This pin is auto-bumped nightly by .github/workflows/bump_deps.yaml (the stock +# llama-cpp backend is patch-free, so a naive bump is safe). The paged backend +# (backend/cpp/llama-cpp-localai-paged) does NOT inherit this pin: it owns its +# own LLAMA_VERSION because its vendored patch series would break on a naive +# bump and is advanced only by the manual PIN_SYNC process. LLAMA_VERSION?=9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp # LLAMA_PAGED controls whether the vendored paged-attention patch series diff --git a/backend/cpp/llama-cpp/patches/paged/PIN_BUMP_APPLY_CHECK.md b/backend/cpp/llama-cpp/patches/paged/PIN_BUMP_APPLY_CHECK.md new file mode 100644 index 000000000..bbccaaeb9 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/PIN_BUMP_APPLY_CHECK.md @@ -0,0 +1,107 @@ +# Pin-bump apply-feasibility check: paged patch series vs latest llama.cpp tip + +Date: 2026-06-27. Scope: textual `git apply` feasibility ONLY. No compile, no +bit-exact gate (those require the DGX GPU and the manual PIN_SYNC process). This +report answers one question: if we bumped the pin to the latest upstream tip, +would the vendored paged patch series still apply? + +## Pins + +| | commit | subject | +|---|---|---| +| Current shipped pin | `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1` | model : Add label for LFM2.5-230M (#25008) | +| Latest master tip | `c299a92c38b6de6a1139617652b66081828648db` | binaries : Improve rpc-server and export-graph-ops names (#25045) | + +Gap: the pin is **23 commits behind** the latest master tip (`ahead_by: 23`, +GitHub compare API). The upstream range touched many files across the tree +(modifications plus at least one rename). + +## Method + +Two fresh shallow clones of `ggml-org/llama.cpp` (the current pin as a baseline, +and the latest master tip as the target). The series +`backend/cpp/llama-cpp/patches/paged/0*.patch` (28 files: 0001-0030, gaps at +0005 and 0027) was applied IN ORDER to each tree. + +Each patch was classified two ways: + +- **`git apply --check -p1`** - this is the BUILD's real apply method + (`backend/cpp/llama-cpp/Makefile`'s `llama.cpp` target does + `git apply --verbose "$p" || exit 1`). This is the only signal that decides + whether a bumped build succeeds. `git apply` natively tolerates `@@` + line-number offsets but NOT context-line changes. +- **GNU `patch -p1` dry-run** - the `prepare.sh` fallback method, used here as a + recovery probe to tell a fixable offset/fuzz from a genuine conflict. + +Running against BOTH pins isolates bump-induced failures from pre-existing, +pin-independent quirks of the shipped series. + +## Result: the bump is CLEAN / offset-tolerant. Zero re-exports needed for the bump. + +The series behaves **identically** under `git apply` on the latest tip and on +the current pin. + +- **27 / 28 patches apply CLEAN under `git apply`** on the latest tip (same 27 + as on the current pin). +- **1 / 28 fails `git apply` (0019) - and it fails identically on the current + pin too**, for a reason that has nothing to do with the bump (see below). Its + code applies fine. +- **No new conflicts.** Not a single patch that applied on the current pin fails + on the latest tip. +- **Zero context-fuzz anywhere.** Every recovery the GNU-patch probe reported is + a pure line-number offset, which `git apply` absorbs natively. + +### What the 23-commit jump actually changed + +Only which patches `git apply` has to place at a line offset (context drift from +the 23 upstream commits). All still apply CLEAN; none needs re-export. + +- Offset-placed on the current pin (6): 0009, 0017, 0018, 0020, 0021, 0024. +- Offset-placed on the latest tip (10): 0009, 0015, 0017, 0018, 0020, 0021, + 0024, 0025, 0026, 0028. +- New offsets introduced by the bump (4): **0015, 0025, 0026, 0028** - all + remain CLEAN under `git apply` (line offset only, no fuzz, no conflict). + +### The single `git apply` failure (0019) is pre-existing, not a bump regression + +`0019-qwen35-ssm-decode-fused-gather.patch` fails `git apply` on BOTH pins. The +sole cause is its first hunk, a *modify* hunk against `SSM_DECODE_FIX_RESULTS.md` +- a dev-only doc that exists on the DGX dev tree (from an unshipped docs commit) +but is absent from any clean upstream checkout: + +``` +error: SSM_DECODE_FIX_RESULTS.md: No such file or directory +``` + +`git apply` is atomic, so that one stray hunk rejects the whole patch. 0019's 8 +real code files (ggml.h, ggml-cpu/ops.cpp, ggml-cuda/gated_delta_net.cu, ggml.c, +delta-net-base.cpp, models.h, qwen35.cpp, qwen35moe.cpp) all apply cleanly (the +GNU-patch probe applies them with only line offsets and reports 0 failed code +hunks). This is exactly the pre-existing finding documented in +`PIN_SYNC_9d5d882d.md` ("Pre-existing finding ... NOT introduced by this +pin-sync, NOT fixed here ... a separate cleanup, out of scope"). It is identical +at both pins, so it is NOT introduced by a bump. Stripping the stray dev-doc +hunk from 0019 (and the analogous 0021 *create* hunk for +`CONV_STATE_FUSION_RESULTS.md`, which happens to apply fine) is a cleanup that +should happen regardless of any pin bump. + +## Verdict + +A pin bump from `9d5d882d` to the latest tip `c299a92c` is **textually clean**: +the full paged series applies via the build's `git apply` with only benign +line-number offsets and zero conflicts - no patch needs re-export for the bump. +The lone `git apply` failure (0019) is a pre-existing shipped-series defect (a +stray dev-doc hunk), present identically on the current pin, and unrelated to the +bump. + +## Caveats (why this does NOT authorise shipping a bump) + +This is a textual apply check only. It does NOT verify that the patches are still +SEMANTICALLY correct against upstream's 23 refactor commits, that the result +compiles, or that it stays bit-exact. The 23 upstream commits touched many files; +a clean text-apply can still hide a semantic break (e.g. a function the kernel +patches call was refactored). The manual PIN_SYNC process on the DGX GPU +(rebuild + `test-backend-ops` + the greedy-md5 bit-exact gate + a decode bench) +remains the gate before any pin is advanced. This report only establishes that +the bump's textual conflict surface is empty, so that pin-sync would start from a +clean apply.