mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-26 01:16:58 -04:00
* feat(llama-cpp): single x86 CPU build via ggml CPU_ALL_VARIANTS
Replace the per-microarch avx/avx2/avx512/fallback multi-binary build on
x86 with a single grpc-server plus the dlopen-able libggml-cpu-*.so set
that ggml's backend registry selects at runtime by probing host CPU
features. One build instead of four, broader microarch coverage (adds
alderlake AVX-VNNI, zen4 AVX512-BF16, sapphirerapids AMX), and the
shell-side /proc/cpuinfo probing in run.sh goes away.
Build/link notes:
- CPU_ALL_VARIANTS requires GGML_BACKEND_DL + BUILD_SHARED_LIBS=ON, so
ggml/llama become shared objects. SHARED_LIBS is now a make variable
(default OFF) so the override survives the recursive sub-make into the
VARIANT build dir instead of being re-clobbered by the base flags.
- The cpu-all target also builds "--target ggml": the per-microarch
backends are runtime-dlopened, not link deps, so they only compile via
ggml's add_dependencies().
- hw_grpc_proto is pinned STATIC. Under BUILD_SHARED_LIBS=ON it would
otherwise become a DSO referencing hidden-visibility symbols in the
static libprotobuf.a, which fails to link ("hidden symbol ... is
referenced by DSO"). Keeping it static links gRPC/protobuf into the
executable while only ggml/llama stay shared, so no PIC or base-image
change is required.
- package.sh bundles the libggml-*.so set into package/lib; ggml finds
them by scanning the bundled ld.so directory (/proc/self/exe), which
run.sh launches from.
Scope: x86 only. arm64/darwin keep the single fallback build. The
ik-llama-cpp / turboquant forks and the other ggml C++ backends are
unchanged; the same recipe applies but is out of scope here.
Validated with a full docker build plus a live inference smoke test:
the model loads, ggml selects the AVX512_BF16 variant on a Zen-class
host, and tokens generate correctly.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(llama-cpp,turboquant): extend CPU_ALL_VARIANTS to arm64 + turboquant
- llama-cpp: x86 AND arm64 now use the single llama-cpp-cpu-all build
(only hipblas keeps the fallback build). ggml's arm64 variant table
(armv8.x / armv9.x, plus apple_m* on darwin) is selected at runtime.
- turboquant: same recipe via a turboquant-cpu-all target. turboquant
copies backend/cpp/llama-cpp's CMakeLists.txt + Makefile per flavor, so
the hw_grpc_proto STATIC fix and the SHARED_LIBS / EXTRA_CMAKE_ARGS
make-vars are inherited; the target just passes SHARED_LIBS=ON, the DL
flags and --target ggml through, then collects the .so set. run.sh and
package.sh updated to ship/select turboquant-cpu-all.
- Makefile lib-collection find now also matches *.dylib (for the darwin
build, which emits dylibs rather than .so).
ik-llama-cpp is intentionally left unchanged: its pinned ggml has no
CPU_ALL_VARIANTS support and its IQK kernels require AVX2, so the
per-microarch dynamic backend set does not apply.
Scope still excludes the darwin packaging wiring (separate change).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(llama-cpp,turboquant): arm64 gcc-14 for SME variants + darwin cpu-all packaging
- arm64: ggml CPU_ALL_VARIANTS builds armv9.2 SME variants whose -march=...+sme
is rejected by the Ubuntu 24.04 default gcc-13. Build the arm64 variants with
gcc-14 (installed in the compile step). The host only selects a variant it
actually supports at runtime, but every variant must still compile.
- darwin: scripts/build/llama-cpp-darwin.sh builds llama-cpp-cpu-all instead of
the fallback binary, keeps Metal (GGML_METAL stays ON; --target ggml also builds
ggml-metal). The per-microarch libggml-cpu-*.dylib are placed in the package
root next to the binary (darwin has no bundled ld.so, so ggml's executable-dir
scan looks there), while the other shared dylibs go in lib/ for DYLD_LIBRARY_PATH.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(llama-cpp-darwin): distribute ggml backends by suffix (.so root, .dylib lib)
ggml emits its loadable backends (per-microarch CPU variants, metal, blas) with a
.so suffix even on darwin, while the core libraries (ggml-base/ggml/llama/
llama-common/mtmd) use .dylib. Split the distribution by suffix: .so DL backends
go in the package root for ggml's executable-directory scan, .dylib core libs go
in lib/ for DYLD_LIBRARY_PATH. The previous .dylib name-pattern matched none of the
variants.
Verified on an M4: ggml loads the apple_m4 CPU variant (SME=1) and Metal, model
loads and generates correct tokens.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(llama-cpp,turboquant): only CPU_ALL_VARIANTS for pure-CPU builds, GPU uses fallback
The previous gate sent every non-hipblas build through llama-cpp-cpu-all, so the
GPU image builds (cublas, sycl_f16/f32, vulkan, nvidia l4t) compiled the whole CPU
microarch variant matrix on top of their already-huge GPU backend - blowing the
build time (the sycl job was only 59% done after 2h11m) - and the arm64 l4t build
failed at `apt-get install gcc-14` (exit 100) on the Jetson base.
Gate on an empty BUILD_TYPE instead: only the pure CPU image (build-type: '' in
.github/backend-matrix.yml) builds the CPU_ALL_VARIANTS set; every GPU build gets a
single fallback CPU grpc-server, since the accelerator does the compute. This also
confines the arm64 gcc-14 step (needed for the armv9.2 SME variants) to the CPU
build, away from the GPU base images.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* docs(llama-cpp): correct run.sh comment for arm64/darwin cpu-all
arm64 and darwin CPU images now also ship llama-cpp-cpu-all (not fallback-only);
only GPU images ship fallback-only. Fix the stale comment to match.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
105 lines
5.7 KiB
Makefile
105 lines
5.7 KiB
Makefile
|
|
# Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
|
|
# Auto-bumped nightly by .github/workflows/bump_deps.yaml.
|
|
TURBOQUANT_VERSION?=7d9715f1f071fa07c7b2ad3dbfd320b314139e65
|
|
LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant
|
|
|
|
CMAKE_ARGS?=
|
|
BUILD_TYPE?=
|
|
NATIVE?=false
|
|
ONEAPI_VARS?=/opt/intel/oneapi/setvars.sh
|
|
TARGET?=--target grpc-server
|
|
JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1)
|
|
ARCH?=$(shell uname -m)
|
|
|
|
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
|
|
LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp
|
|
|
|
GREEN := \033[0;32m
|
|
RESET := \033[0m
|
|
|
|
# turboquant is a llama.cpp fork. Rather than duplicating grpc-server.cpp / CMakeLists.txt /
|
|
# prepare.sh we reuse the ones in backend/cpp/llama-cpp, and only swap which repo+sha the
|
|
# fetch step pulls. Each flavor target copies ../llama-cpp into a sibling ../turboquant-<flavor>-build
|
|
# directory, then invokes llama-cpp's own build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION
|
|
# overridden to point at the fork.
|
|
PATCHES_DIR := $(CURRENT_MAKEFILE_DIR)/patches
|
|
|
|
# Each flavor target:
|
|
# 1. copies backend/cpp/llama-cpp/ (grpc-server.cpp + prepare.sh + CMakeLists.txt + Makefile)
|
|
# into a sibling turboquant-<flavor>-build directory;
|
|
# 2. clones the turboquant fork into turboquant-<flavor>-build/llama.cpp via the copy's
|
|
# own `llama.cpp` target, overriding LLAMA_REPO/LLAMA_VERSION;
|
|
# 3. applies patches from backend/cpp/turboquant/patches/ to the cloned fork sources
|
|
# (needed until the fork catches up with upstream server-context.cpp changes);
|
|
# 4. runs the copy's `grpc-server` target, which produces the binary we copy up as
|
|
# turboquant-<flavor>.
|
|
define turboquant-build
|
|
rm -rf $(CURRENT_MAKEFILE_DIR)/../turboquant-$(1)-build
|
|
cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../turboquant-$(1)-build
|
|
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-$(1)-build purge
|
|
# Augment the copied grpc-server.cpp's KV-cache allow-list with the
|
|
# fork's turbo2/turbo3/turbo4 types. We patch the *copy*, never the
|
|
# original under backend/cpp/llama-cpp/, so the stock llama-cpp build
|
|
# stays compiling against vanilla upstream.
|
|
bash $(CURRENT_MAKEFILE_DIR)/patch-grpc-server.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-$(1)-build/grpc-server.cpp
|
|
$(info $(GREEN)I turboquant build info:$(1)$(RESET))
|
|
LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
|
|
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-$(1)-build llama.cpp
|
|
bash $(CURRENT_MAKEFILE_DIR)/apply-patches.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-$(1)-build/llama.cpp $(PATCHES_DIR)
|
|
CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" \
|
|
LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
|
|
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-$(1)-build grpc-server
|
|
cp -rfv $(CURRENT_MAKEFILE_DIR)/../turboquant-$(1)-build/grpc-server turboquant-$(1)
|
|
endef
|
|
|
|
turboquant-avx2:
|
|
$(call turboquant-build,avx2,-DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
|
|
|
|
turboquant-avx512:
|
|
$(call turboquant-build,avx512,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
|
|
|
|
turboquant-avx:
|
|
$(call turboquant-build,avx,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
|
|
|
|
turboquant-fallback:
|
|
$(call turboquant-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
|
|
|
|
# Single-build CPU backend via ggml CPU_ALL_VARIANTS (mirrors llama-cpp-cpu-all).
|
|
# turboquant reuses backend/cpp/llama-cpp's CMakeLists.txt (hw_grpc_proto STATIC) and
|
|
# Makefile (SHARED_LIBS make-var + EXTRA_CMAKE_ARGS), so this passes the same overrides
|
|
# through to the copied build: SHARED_LIBS=ON, the DL flags, and --target ggml (which
|
|
# pulls in the per-microarch libggml-cpu-*.so via ggml's add_dependencies). The .so set
|
|
# is collected for package.sh to bundle into package/lib.
|
|
turboquant-cpu-all:
|
|
rm -rf $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build
|
|
cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build
|
|
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build purge
|
|
bash $(CURRENT_MAKEFILE_DIR)/patch-grpc-server.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/grpc-server.cpp
|
|
$(info $(GREEN)I turboquant build info:cpu-all-variants$(RESET))
|
|
LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
|
|
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build llama.cpp
|
|
bash $(CURRENT_MAKEFILE_DIR)/apply-patches.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/llama.cpp $(PATCHES_DIR)
|
|
SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" \
|
|
LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
|
|
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build grpc-server
|
|
cp -rfv $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/grpc-server turboquant-cpu-all
|
|
rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs
|
|
find $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \;
|
|
@echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/
|
|
|
|
turboquant-grpc:
|
|
$(call turboquant-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target rpc-server)
|
|
|
|
turboquant-rpc-server: turboquant-grpc
|
|
cp -rf $(CURRENT_MAKEFILE_DIR)/../turboquant-grpc-build/llama.cpp/build/bin/rpc-server turboquant-rpc-server
|
|
|
|
package:
|
|
bash package.sh
|
|
|
|
purge:
|
|
rm -rf $(CURRENT_MAKEFILE_DIR)/../turboquant-*-build
|
|
rm -rf turboquant-* package
|
|
|
|
clean: purge
|