Files
LocalAI/backend/cpp/llama-cpp-localai-paged/Makefile
Ettore Di Giacinto 2bee7a5ab1 ci(paged): add early-warning canary for vendored llama.cpp paged patches
The paged backend (backend/cpp/llama-cpp-localai-paged) pins its own verified
llama.cpp tip and is excluded from the nightly auto-bumper so a naive bump can
never silently break the shipped build. That exclusion also removed the early
warning of upstream drift. This restores the signal without touching the pin.

Add .github/workflows/llama-cpp-paged-canary.yml (weekly + workflow_dispatch):

- apply-check job (ubuntu-latest, toolchain-free): resolve the latest
  ggml-org/llama.cpp master tip, shallow-checkout it, and apply the full paged
  series 0001-0030 in order with the build's own git-apply method via the new
  shared helper .github/scripts/paged-canary-apply.sh. Red on any apply break.
- compile job (needs apply-check): on the exact tip it validated, build the
  paged backend (cublas) inside the same base-grpc-cuda-12 toolchain and the
  same `make grpc-server` target the shipped build uses, so a red means upstream
  drift, not toolchain noise. nvcc compiles the kernels with no GPU present.

Red here = run a PIN_SYNC (rebase + bit-exact gate + re-export), then bump the
paged Makefile pin. The canary is signal-only: it opens no PR and never moves
the pin, so the shipped build and the dep-bump PRs stay green regardless. It is
fully separate from bump_deps.

The lone pre-existing quirk in the series (patch 0019 carries a stray modify
hunk against the dev-only doc SSM_DECODE_FIX_RESULTS.md, absent from any clean
upstream checkout; git apply is atomic so it rejects the whole patch and
cascades to 0021/0022/0026/0028) is handled path-scoped: the helper excludes
only that dev-doc and still applies 0019's real code hunks atomically, mirroring
prepare.sh's tolerance, so the quirk never false-positives the canary but a
genuine code break in 0019 still turns it red.

Point the existing pin comments in backend/cpp/llama-cpp-localai-paged/Makefile
and .github/workflows/bump_deps.yaml at this canary as the drift signal, and
document it in the PIN_SYNC doc: canary red -> do a pin-sync.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 08:29:09 +00:00

122 lines
6.9 KiB
Makefile

# llama-cpp-localai-paged is LocalAI's paged-attention llama.cpp variant. It
# builds upstream llama.cpp with the LocalAI paged-attention patch series
# (backend/cpp/llama-cpp/patches/paged/) applied on top (LLAMA_PAGED=on). It
# reuses backend/cpp/llama-cpp's grpc-server.cpp / CMakeLists.txt / prepare.sh
# sources verbatim via a thin wrapper.
#
# Pin handling (mirrors the turboquant wrapper, the precedent this is modelled
# on): the paged patch series is hand-verified bit-exact against ONE specific
# llama.cpp tip and re-exported by the manual PIN_SYNC process
# (backend/cpp/llama-cpp/patches/paged/PIN_SYNC_*.md). A naive pin bump would
# move the tip out from under the patches and break `git apply` at build time,
# so this backend OWNS its pin (LLAMA_VERSION below) instead of inheriting the
# auto-bumped stock pin from backend/cpp/llama-cpp/Makefile. The override is
# forced into every copied build via `LLAMA_VERSION=$(LLAMA_VERSION)`. There is
# deliberately NO bump_deps.yaml entry for it: it is advanced ONLY by PIN_SYNC,
# never nightly. (turboquant CAN auto-bump because its fork branch carries the
# patches; the paged series is vendored as .patch files here, so it cannot.)
#
# - NO patch-grpc-server.sh and NO apply-patches.sh: the shared
# grpc-server.cpp already carries the (runtime-gated) paged option hooks,
# and the paged patch series is applied by the copied llama-cpp Makefile's
# own `llama.cpp` target whenever LLAMA_PAGED=on (which we force below).
# Manually pin-synced llama.cpp tip the paged patch series is verified against.
# Decoupled from the auto-bumped stock pin in backend/cpp/llama-cpp/Makefile so
# the nightly llama.cpp bump cannot silently break the vendored paged patches.
# Advance ONLY via the PIN_SYNC process (rebase patches + bit-exact gate +
# re-export), then update this value. See:
# backend/cpp/llama-cpp/patches/paged/PIN_SYNC_*.md
#
# This pin = the manual, verified sync. The signal telling you WHEN to do the
# next sync is the early-warning canary
# (.github/workflows/llama-cpp-paged-canary.yml): weekly it applies + compiles
# this patch series against the latest upstream llama.cpp tip and goes red the
# moment upstream drifts past the patches. Canary red -> run a PIN_SYNC, then
# bump this value. The canary never touches this pin; it is signal-only.
LLAMA_VERSION?=9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1
CMAKE_ARGS?=
BUILD_TYPE?=
NATIVE?=false
ONEAPI_VARS?=/opt/intel/oneapi/setvars.sh
TARGET?=--target grpc-server
JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1)
ARCH?=$(shell uname -m)
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp
GREEN := \033[0;32m
RESET := \033[0m
# Each flavor target:
# 1. copies backend/cpp/llama-cpp/ (grpc-server.cpp + prepare.sh +
# CMakeLists.txt + Makefile) into a sibling
# llama-cpp-localai-paged-<flavor>-build directory;
# 2. clones the SAME upstream llama.cpp pin into that copy and applies the
# base AND paged patch series via the copy's own `llama.cpp` target with
# LLAMA_PAGED=on;
# 3. runs the copy's `grpc-server` target (LLAMA_PAGED=on) and copies the
# produced binary up as llama-cpp-localai-paged-<flavor>.
# We patch only the *copy*, never the original under backend/cpp/llama-cpp/, so
# the stock llama-cpp build stays untouched.
define paged-build
rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build
cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build purge
$(info $(GREEN)I llama-cpp-localai-paged build info:$(1)$(RESET))
LLAMA_VERSION=$(LLAMA_VERSION) LLAMA_PAGED=on $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build llama.cpp
CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" LLAMA_VERSION=$(LLAMA_VERSION) LLAMA_PAGED=on \
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build/grpc-server llama-cpp-localai-paged-$(1)
endef
llama-cpp-localai-paged-avx2:
$(call paged-build,avx2,-DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
llama-cpp-localai-paged-avx512:
$(call paged-build,avx512,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
llama-cpp-localai-paged-avx:
$(call paged-build,avx,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
llama-cpp-localai-paged-fallback:
$(call paged-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
# Single-build CPU backend via ggml CPU_ALL_VARIANTS (mirrors llama-cpp-cpu-all).
# Reuses backend/cpp/llama-cpp's CMakeLists.txt (hw_grpc_proto STATIC) and
# Makefile (SHARED_LIBS make-var + EXTRA_CMAKE_ARGS), so this passes the same
# overrides through to the copied build: SHARED_LIBS=ON, the DL flags, and
# --target ggml (which pulls in the per-microarch libggml-cpu-*.so via ggml's
# add_dependencies). The .so set is collected for package.sh to bundle into
# package/lib.
llama-cpp-localai-paged-cpu-all:
rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build
cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build purge
$(info $(GREEN)I llama-cpp-localai-paged build info:cpu-all-variants$(RESET))
LLAMA_VERSION=$(LLAMA_VERSION) LLAMA_PAGED=on $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build llama.cpp
SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" LLAMA_VERSION=$(LLAMA_VERSION) LLAMA_PAGED=on \
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build/grpc-server llama-cpp-localai-paged-cpu-all
rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs
find $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \;
@echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/
llama-cpp-localai-paged-grpc:
$(call paged-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target rpc-server)
llama-cpp-localai-paged-rpc-server: llama-cpp-localai-paged-grpc
cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-grpc-build/llama.cpp/build/bin/rpc-server llama-cpp-localai-paged-rpc-server
package:
bash package.sh
purge:
rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-*-build
rm -rf llama-cpp-localai-paged-* package
clean: purge