From 4ac67d255dcddb6b42b53a48652078058facc9a4 Mon Sep 17 00:00:00 2001 From: "LocalAI [bot]" <139863280+localai-bot@users.noreply.github.com> Date: Thu, 25 Jun 2026 15:47:03 +0200 Subject: [PATCH] feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple) (#10497) * feat(llama-cpp): single x86 CPU build via ggml CPU_ALL_VARIANTS Replace the per-microarch avx/avx2/avx512/fallback multi-binary build on x86 with a single grpc-server plus the dlopen-able libggml-cpu-*.so set that ggml's backend registry selects at runtime by probing host CPU features. One build instead of four, broader microarch coverage (adds alderlake AVX-VNNI, zen4 AVX512-BF16, sapphirerapids AMX), and the shell-side /proc/cpuinfo probing in run.sh goes away. Build/link notes: - CPU_ALL_VARIANTS requires GGML_BACKEND_DL + BUILD_SHARED_LIBS=ON, so ggml/llama become shared objects. SHARED_LIBS is now a make variable (default OFF) so the override survives the recursive sub-make into the VARIANT build dir instead of being re-clobbered by the base flags. - The cpu-all target also builds "--target ggml": the per-microarch backends are runtime-dlopened, not link deps, so they only compile via ggml's add_dependencies(). - hw_grpc_proto is pinned STATIC. Under BUILD_SHARED_LIBS=ON it would otherwise become a DSO referencing hidden-visibility symbols in the static libprotobuf.a, which fails to link ("hidden symbol ... is referenced by DSO"). Keeping it static links gRPC/protobuf into the executable while only ggml/llama stay shared, so no PIC or base-image change is required. - package.sh bundles the libggml-*.so set into package/lib; ggml finds them by scanning the bundled ld.so directory (/proc/self/exe), which run.sh launches from. Scope: x86 only. arm64/darwin keep the single fallback build. The ik-llama-cpp / turboquant forks and the other ggml C++ backends are unchanged; the same recipe applies but is out of scope here. Validated with a full docker build plus a live inference smoke test: the model loads, ggml selects the AVX512_BF16 variant on a Zen-class host, and tokens generate correctly. Signed-off-by: Ettore Di Giacinto Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(llama-cpp,turboquant): extend CPU_ALL_VARIANTS to arm64 + turboquant - llama-cpp: x86 AND arm64 now use the single llama-cpp-cpu-all build (only hipblas keeps the fallback build). ggml's arm64 variant table (armv8.x / armv9.x, plus apple_m* on darwin) is selected at runtime. - turboquant: same recipe via a turboquant-cpu-all target. turboquant copies backend/cpp/llama-cpp's CMakeLists.txt + Makefile per flavor, so the hw_grpc_proto STATIC fix and the SHARED_LIBS / EXTRA_CMAKE_ARGS make-vars are inherited; the target just passes SHARED_LIBS=ON, the DL flags and --target ggml through, then collects the .so set. run.sh and package.sh updated to ship/select turboquant-cpu-all. - Makefile lib-collection find now also matches *.dylib (for the darwin build, which emits dylibs rather than .so). ik-llama-cpp is intentionally left unchanged: its pinned ggml has no CPU_ALL_VARIANTS support and its IQK kernels require AVX2, so the per-microarch dynamic backend set does not apply. Scope still excludes the darwin packaging wiring (separate change). Signed-off-by: Ettore Di Giacinto Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(llama-cpp,turboquant): arm64 gcc-14 for SME variants + darwin cpu-all packaging - arm64: ggml CPU_ALL_VARIANTS builds armv9.2 SME variants whose -march=...+sme is rejected by the Ubuntu 24.04 default gcc-13. Build the arm64 variants with gcc-14 (installed in the compile step). The host only selects a variant it actually supports at runtime, but every variant must still compile. - darwin: scripts/build/llama-cpp-darwin.sh builds llama-cpp-cpu-all instead of the fallback binary, keeps Metal (GGML_METAL stays ON; --target ggml also builds ggml-metal). The per-microarch libggml-cpu-*.dylib are placed in the package root next to the binary (darwin has no bundled ld.so, so ggml's executable-dir scan looks there), while the other shared dylibs go in lib/ for DYLD_LIBRARY_PATH. Signed-off-by: Ettore Di Giacinto Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(llama-cpp-darwin): distribute ggml backends by suffix (.so root, .dylib lib) ggml emits its loadable backends (per-microarch CPU variants, metal, blas) with a .so suffix even on darwin, while the core libraries (ggml-base/ggml/llama/ llama-common/mtmd) use .dylib. Split the distribution by suffix: .so DL backends go in the package root for ggml's executable-directory scan, .dylib core libs go in lib/ for DYLD_LIBRARY_PATH. The previous .dylib name-pattern matched none of the variants. Verified on an M4: ggml loads the apple_m4 CPU variant (SME=1) and Metal, model loads and generates correct tokens. Signed-off-by: Ettore Di Giacinto Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(llama-cpp,turboquant): only CPU_ALL_VARIANTS for pure-CPU builds, GPU uses fallback The previous gate sent every non-hipblas build through llama-cpp-cpu-all, so the GPU image builds (cublas, sycl_f16/f32, vulkan, nvidia l4t) compiled the whole CPU microarch variant matrix on top of their already-huge GPU backend - blowing the build time (the sycl job was only 59% done after 2h11m) - and the arm64 l4t build failed at `apt-get install gcc-14` (exit 100) on the Jetson base. Gate on an empty BUILD_TYPE instead: only the pure CPU image (build-type: '' in .github/backend-matrix.yml) builds the CPU_ALL_VARIANTS set; every GPU build gets a single fallback CPU grpc-server, since the accelerator does the compute. This also confines the arm64 gcc-14 step (needed for the armv9.2 SME variants) to the CPU build, away from the GPU base images. Signed-off-by: Ettore Di Giacinto Assisted-by: Claude:claude-opus-4-8 [Claude Code] * docs(llama-cpp): correct run.sh comment for arm64/darwin cpu-all arm64 and darwin CPU images now also ship llama-cpp-cpu-all (not fallback-only); only GPU images ship fallback-only. Fix the stale comment to match. Signed-off-by: Ettore Di Giacinto Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto Co-authored-by: Ettore Di Giacinto --- .docker/llama-cpp-compile.sh | 32 ++++++++++++++++--------- .docker/turboquant-compile.sh | 22 ++++++++++------- backend/cpp/llama-cpp/CMakeLists.txt | 9 +++++-- backend/cpp/llama-cpp/Makefile | 36 ++++++++++++++++++++++++++-- backend/cpp/llama-cpp/package.sh | 16 +++++++++++++ backend/cpp/llama-cpp/run.sh | 26 +++++--------------- backend/cpp/turboquant/Makefile | 23 ++++++++++++++++++ backend/cpp/turboquant/package.sh | 9 +++++++ backend/cpp/turboquant/run.sh | 25 ++++--------------- scripts/build/llama-cpp-darwin.sh | 28 +++++++++++++++------- 10 files changed, 154 insertions(+), 72 deletions(-) diff --git a/.docker/llama-cpp-compile.sh b/.docker/llama-cpp-compile.sh index bbc9aa21f..647a1c448 100755 --- a/.docker/llama-cpp-compile.sh +++ b/.docker/llama-cpp-compile.sh @@ -17,19 +17,29 @@ if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then rm -rf /LocalAI/backend/cpp/llama-cpp-*-build fi -if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then - cd /LocalAI/backend/cpp/llama-cpp - make llama-cpp-fallback - make llama-cpp-grpc - make llama-cpp-rpc-server +cd /LocalAI/backend/cpp/llama-cpp +if [ -z "${BUILD_TYPE:-}" ]; then + # Pure CPU image (BUILD_TYPE empty): one build with ggml CPU_ALL_VARIANTS replaces the + # per-microarch binaries (x86: avx/avx2/avx512/fallback; arm64: armv8.x/armv9.x). ggml + # dlopens the best libggml-cpu-*.so at runtime by probing host CPU features. + # + # arm64: the CPU_ALL_VARIANTS table includes armv9.2 SME variants whose -march=...+sme is + # rejected by the Ubuntu 24.04 default gcc-13. gcc-14 accepts it, so build the arm64 + # variants with it (the host never *selects* SME unless it has it, but every variant must + # still compile). + if [ "${TARGETARCH}" = "arm64" ]; then + apt-get update -qq && apt-get install -y -qq gcc-14 g++-14 + export CC=gcc-14 CXX=g++-14 + fi + make llama-cpp-cpu-all else - cd /LocalAI/backend/cpp/llama-cpp - make llama-cpp-avx - make llama-cpp-avx2 - make llama-cpp-avx512 + # GPU build (cublas/hipblas/sycl/vulkan/...): the accelerator does the compute, so a + # single fallback CPU build is enough - no per-microarch CPU variants needed. (This also + # keeps the heavy GPU backend compile from also building the whole CPU variant matrix, + # and avoids the gcc-14 apt step on GPU base images such as nvidia l4t.) make llama-cpp-fallback - make llama-cpp-grpc - make llama-cpp-rpc-server fi +make llama-cpp-grpc +make llama-cpp-rpc-server ccache -s || true diff --git a/.docker/turboquant-compile.sh b/.docker/turboquant-compile.sh index 7468bc1a7..ca6cf2690 100755 --- a/.docker/turboquant-compile.sh +++ b/.docker/turboquant-compile.sh @@ -19,17 +19,21 @@ fi cd /LocalAI/backend/cpp/turboquant -if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then - make turboquant-fallback - make turboquant-grpc - make turboquant-rpc-server +if [ -z "${BUILD_TYPE:-}" ]; then + # Pure CPU image: one ggml CPU_ALL_VARIANTS build replaces the per-microarch binaries. + # arm64: the armv9.2 SME variants need gcc-14 (gcc-13 rejects +sme). + if [ "${TARGETARCH}" = "arm64" ]; then + apt-get update -qq && apt-get install -y -qq gcc-14 g++-14 + export CC=gcc-14 CXX=g++-14 + fi + make turboquant-cpu-all else - make turboquant-avx - make turboquant-avx2 - make turboquant-avx512 + # GPU build (cublas/hipblas/sycl/vulkan/...): single fallback CPU build, the accelerator + # does the compute. Keeps the GPU compile from also building the CPU variant matrix and + # avoids the gcc-14 apt step on GPU base images such as nvidia l4t. make turboquant-fallback - make turboquant-grpc - make turboquant-rpc-server fi +make turboquant-grpc +make turboquant-rpc-server ccache -s || true diff --git a/backend/cpp/llama-cpp/CMakeLists.txt b/backend/cpp/llama-cpp/CMakeLists.txt index cb1f5298c..bdf20802a 100644 --- a/backend/cpp/llama-cpp/CMakeLists.txt +++ b/backend/cpp/llama-cpp/CMakeLists.txt @@ -50,8 +50,13 @@ add_custom_command( "${hw_proto}" DEPENDS "${hw_proto}") -# hw_grpc_proto -add_library(hw_grpc_proto +# hw_grpc_proto: force STATIC. Under the CPU_ALL_VARIANTS build BUILD_SHARED_LIBS=ON +# (ggml/llama become shared), which would otherwise make this glue library a DSO. As a +# DSO it references the hidden-visibility symbols in the static libprotobuf.a, which the +# linker cannot satisfy ("hidden symbol ... in libprotobuf.a is referenced by DSO"). +# Keeping it STATIC links protobuf/gRPC directly into the grpc-server executable while +# only ggml/llama stay shared. No effect on the static variants (already BUILD_SHARED_LIBS=OFF). +add_library(hw_grpc_proto STATIC ${hw_grpc_srcs} ${hw_grpc_hdrs} ${hw_proto_srcs} diff --git a/backend/cpp/llama-cpp/Makefile b/backend/cpp/llama-cpp/Makefile index f00fad518..b0fa0423c 100644 --- a/backend/cpp/llama-cpp/Makefile +++ b/backend/cpp/llama-cpp/Makefile @@ -10,8 +10,16 @@ TARGET?=--target grpc-server JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1) ARCH?=$(shell uname -m) -# Disable Shared libs as we are linking on static gRPC and we can't mix shared and static -CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=OFF +# Shared libs default to OFF: we link static gRPC and the avx/avx2/avx512/fallback +# variants are fully static. The CPU_ALL_VARIANTS build flips SHARED_LIBS=ON (ggml/llama +# become shared so the dynamic CPU backends work; gRPC stays static via its imported +# targets). SHARED_LIBS is a make variable, not an appended -D, so it survives the +# recursive sub-make into the VARIANT build dir (which re-parses this Makefile) instead +# of being re-clobbered by a second -DBUILD_SHARED_LIBS=OFF. EXTRA_CMAKE_ARGS is the hook +# the CPU_ALL_VARIANTS target uses to inject -DGGML_BACKEND_DL/-DGGML_CPU_ALL_VARIANTS. +SHARED_LIBS?=OFF +EXTRA_CMAKE_ARGS?= +CMAKE_ARGS+=-DBUILD_SHARED_LIBS=$(SHARED_LIBS) -DLLAMA_CURL=OFF $(EXTRA_CMAKE_ARGS) CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST)))) ifeq ($(NATIVE),false) @@ -120,6 +128,30 @@ llama-cpp-fallback: llama.cpp CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) VARIANT="llama-cpp-fallback-build" build-llama-cpp-grpc-server cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-fallback-build/grpc-server llama-cpp-fallback +# Single-build CPU backend using ggml's CPU_ALL_VARIANTS. Produces ONE grpc-server +# plus a set of dlopen-able libggml-cpu-*.so (sandybridge/haswell/skylakex/...) that +# ggml's backend registry selects from at runtime by probing host CPU features. +# Replaces the avx/avx2/avx512/fallback multi-binary build on x86. +# +# CPU_ALL_VARIANTS requires GGML_BACKEND_DL, which requires BUILD_SHARED_LIBS=ON, so we +# pass SHARED_LIBS=ON and the DL flags as make variables (NOT pre-expanded into the +# CMAKE_ARGS env string): command-line make variables propagate through every recursive +# sub-make, so the deepest VARIANT-dir build computes BUILD_SHARED_LIBS=ON consistently. +# Only ggml/llama go shared - gRPC is found via its static imported targets, so the +# grpc-server binary keeps static gRPC and only dynamically links ggml. +# +# TARGET adds "ggml": the per-microarch backends are runtime-dlopened, not link deps of +# grpc-server, so they only build because each is an add_dependencies() of the ggml target. +llama-cpp-cpu-all: llama.cpp + cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build + $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build purge + $(info ${GREEN}I llama-cpp build info:cpu-all-variants${RESET}) + $(MAKE) SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" VARIANT="llama-cpp-cpu-all-build" build-llama-cpp-grpc-server + cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build/grpc-server llama-cpp-cpu-all + rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs + find $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \; + @echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/ + llama-cpp-grpc: llama.cpp cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build purge diff --git a/backend/cpp/llama-cpp/package.sh b/backend/cpp/llama-cpp/package.sh index d1897e6be..5d2b18c5b 100755 --- a/backend/cpp/llama-cpp/package.sh +++ b/backend/cpp/llama-cpp/package.sh @@ -14,6 +14,22 @@ mkdir -p $CURDIR/package/lib cp -avrf $CURDIR/llama-cpp-* $CURDIR/package/ cp -rfv $CURDIR/run.sh $CURDIR/package/ +# Bundle the ggml shared backends produced by the CPU_ALL_VARIANTS build (libggml-base.so, +# libggml.so, libllama.so and the per-microarch libggml-cpu-*.so), all into package/lib. +# +# Two distinct resolution mechanisms both land here: +# - NEEDED deps (libggml-base/libggml/libllama): resolved by the dynamic linker via the +# LD_LIBRARY_PATH=$CURDIR/lib that run.sh exports. +# - The per-microarch libggml-cpu-*.so are NOT linked; ggml *discovers* them at runtime by +# scanning the executable's own directory (readlink /proc/self/exe). run.sh launches via +# the bundled $CURDIR/lib/ld.so, so /proc/self/exe -> .../lib/ld.so and ggml scans lib/. +# That is why the variants must sit in lib/ (next to ld.so), not just on the link path. +# No-op on builds (arm64/darwin) that don't produce the all-variants set. +if [ -d "$CURDIR/ggml-shared-libs" ]; then + echo "Bundling ggml shared backends (CPU_ALL_VARIANTS)..." + cp -avf $CURDIR/ggml-shared-libs/*.so* $CURDIR/package/lib/ +fi + # Detect architecture and copy appropriate libraries if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then # x86_64 architecture diff --git a/backend/cpp/llama-cpp/run.sh b/backend/cpp/llama-cpp/run.sh index 553faeb27..db8498f4b 100755 --- a/backend/cpp/llama-cpp/run.sh +++ b/backend/cpp/llama-cpp/run.sh @@ -12,26 +12,12 @@ grep -e "flags" /proc/cpuinfo | head -1 BINARY=llama-cpp-fallback -if grep -q -e "\savx\s" /proc/cpuinfo ; then - echo "CPU: AVX found OK" - if [ -e $CURDIR/llama-cpp-avx ]; then - BINARY=llama-cpp-avx - fi -fi - -if grep -q -e "\savx2\s" /proc/cpuinfo ; then - echo "CPU: AVX2 found OK" - if [ -e $CURDIR/llama-cpp-avx2 ]; then - BINARY=llama-cpp-avx2 - fi -fi - -# Check avx 512 -if grep -q -e "\savx512f\s" /proc/cpuinfo ; then - echo "CPU: AVX512F found OK" - if [ -e $CURDIR/llama-cpp-avx512 ]; then - BINARY=llama-cpp-avx512 - fi +# CPU images (x86, arm64, darwin) ship a single llama-cpp-cpu-all built with ggml +# CPU_ALL_VARIANTS: ggml's backend registry dlopens the best libggml-cpu-*.so for this +# host, so no shell-side AVX probing. GPU images (cublas/sycl/vulkan/hipblas) ship only +# llama-cpp-fallback (the accelerator does the compute), so fall back to it when absent. +if [ -e $CURDIR/llama-cpp-cpu-all ]; then + BINARY=llama-cpp-cpu-all fi if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then diff --git a/backend/cpp/turboquant/Makefile b/backend/cpp/turboquant/Makefile index 98f5e4978..a32adf0b6 100644 --- a/backend/cpp/turboquant/Makefile +++ b/backend/cpp/turboquant/Makefile @@ -65,6 +65,29 @@ turboquant-avx: turboquant-fallback: $(call turboquant-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server) +# Single-build CPU backend via ggml CPU_ALL_VARIANTS (mirrors llama-cpp-cpu-all). +# turboquant reuses backend/cpp/llama-cpp's CMakeLists.txt (hw_grpc_proto STATIC) and +# Makefile (SHARED_LIBS make-var + EXTRA_CMAKE_ARGS), so this passes the same overrides +# through to the copied build: SHARED_LIBS=ON, the DL flags, and --target ggml (which +# pulls in the per-microarch libggml-cpu-*.so via ggml's add_dependencies). The .so set +# is collected for package.sh to bundle into package/lib. +turboquant-cpu-all: + rm -rf $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build + cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build + $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build purge + bash $(CURRENT_MAKEFILE_DIR)/patch-grpc-server.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/grpc-server.cpp + $(info $(GREEN)I turboquant build info:cpu-all-variants$(RESET)) + LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \ + $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build llama.cpp + bash $(CURRENT_MAKEFILE_DIR)/apply-patches.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/llama.cpp $(PATCHES_DIR) + SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" \ + LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \ + $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build grpc-server + cp -rfv $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/grpc-server turboquant-cpu-all + rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs + find $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \; + @echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/ + turboquant-grpc: $(call turboquant-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target rpc-server) diff --git a/backend/cpp/turboquant/package.sh b/backend/cpp/turboquant/package.sh index d5402fc31..c4559a68d 100755 --- a/backend/cpp/turboquant/package.sh +++ b/backend/cpp/turboquant/package.sh @@ -14,6 +14,15 @@ mkdir -p $CURDIR/package/lib cp -avrf $CURDIR/turboquant-* $CURDIR/package/ cp -rfv $CURDIR/run.sh $CURDIR/package/ +# Bundle the ggml shared backends from the CPU_ALL_VARIANTS build into package/lib. ggml +# discovers the per-microarch libggml-cpu-*.so by scanning the executable directory, which +# (via the bundled lib/ld.so that run.sh launches through) resolves to lib/. See the +# matching comment in backend/cpp/llama-cpp/package.sh. No-op on the fallback/ROCm builds. +if [ -d "$CURDIR/ggml-shared-libs" ]; then + echo "Bundling ggml shared backends (CPU_ALL_VARIANTS)..." + cp -avf $CURDIR/ggml-shared-libs/*.so* $CURDIR/package/lib/ +fi + # Detect architecture and copy appropriate libraries if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then # x86_64 architecture diff --git a/backend/cpp/turboquant/run.sh b/backend/cpp/turboquant/run.sh index b0239e237..cd41a0f7f 100755 --- a/backend/cpp/turboquant/run.sh +++ b/backend/cpp/turboquant/run.sh @@ -12,26 +12,11 @@ grep -e "flags" /proc/cpuinfo | head -1 BINARY=turboquant-fallback -if grep -q -e "\savx\s" /proc/cpuinfo ; then - echo "CPU: AVX found OK" - if [ -e $CURDIR/turboquant-avx ]; then - BINARY=turboquant-avx - fi -fi - -if grep -q -e "\savx2\s" /proc/cpuinfo ; then - echo "CPU: AVX2 found OK" - if [ -e $CURDIR/turboquant-avx2 ]; then - BINARY=turboquant-avx2 - fi -fi - -# Check avx 512 -if grep -q -e "\savx512f\s" /proc/cpuinfo ; then - echo "CPU: AVX512F found OK" - if [ -e $CURDIR/turboquant-avx512 ]; then - BINARY=turboquant-avx512 - fi +# x86/arm64 ship a single turboquant-cpu-all built with ggml CPU_ALL_VARIANTS: ggml's +# backend registry dlopens the best libggml-cpu-*.so for this host, so no shell-side +# probing. ROCm ships only turboquant-fallback, so fall back to it when cpu-all is absent. +if [ -e $CURDIR/turboquant-cpu-all ]; then + BINARY=turboquant-cpu-all fi if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then diff --git a/scripts/build/llama-cpp-darwin.sh b/scripts/build/llama-cpp-darwin.sh index 9bdf36875..adec88f04 100644 --- a/scripts/build/llama-cpp-darwin.sh +++ b/scripts/build/llama-cpp-darwin.sh @@ -6,10 +6,11 @@ IMAGE_NAME="${IMAGE_NAME:-localai/llama-cpp-darwin}" pushd backend/cpp/llama-cpp -# make llama-cpp-avx && \ -# make llama-cpp-avx2 && \ -# make llama-cpp-avx512 && \ -make llama-cpp-fallback && \ +# Single build via ggml CPU_ALL_VARIANTS: one binary plus the per-microarch Apple/arm +# dylibs (apple_m1/m2_m3/m4, armv8.x) that ggml selects at runtime. GGML_METAL stays ON +# and --target ggml also builds ggml-metal (via add_dependencies), so the Metal GPU +# backend is still produced as a loadable libggml-metal.dylib. +make llama-cpp-cpu-all && \ make llama-cpp-grpc && \ make llama-cpp-rpc-server @@ -19,13 +20,24 @@ mkdir -p build/darwin mkdir -p backend-images mkdir -p build/darwin/lib -# cp -rf backend/cpp/llama-cpp/llama-cpp-avx build/darwin/ -# cp -rf backend/cpp/llama-cpp/llama-cpp-avx2 build/darwin/ -# cp -rf backend/cpp/llama-cpp/llama-cpp-avx512 build/darwin/ -cp -rf backend/cpp/llama-cpp/llama-cpp-fallback build/darwin/ +cp -rf backend/cpp/llama-cpp/llama-cpp-cpu-all build/darwin/ cp -rf backend/cpp/llama-cpp/llama-cpp-grpc build/darwin/ cp -rf backend/cpp/llama-cpp/llama-cpp-rpc-server build/darwin/ +# Distribute the shared ggml/llama libraries from the CPU_ALL_VARIANTS build. Unlike the +# old fully-static fallback build, these have @rpath install names, so the otool loop below +# (which only copies deps that exist on disk) will not pick them up. The split is by suffix: +# - ggml emits its loadable backends (per-microarch CPU variants, metal, blas) with a .so +# suffix EVEN ON DARWIN. These go in the package ROOT next to the binary, because darwin +# run.sh execs the binary directly (no bundled ld.so) so ggml's executable-directory +# scan looks there. +# - the core libraries (libggml-base/libggml/libllama/libllama-common/libmtmd) use the +# platform .dylib suffix and are NEEDED deps; they go in lib/, resolved at load time via +# the DYLD_LIBRARY_PATH=lib that run.sh exports. -a preserves the version symlinks. +SHLIBS=backend/cpp/llama-cpp/ggml-shared-libs +cp -a $SHLIBS/*.so build/darwin/ +cp -a $SHLIBS/*.dylib build/darwin/lib/ + # Set default additional libs only for Darwin on M chips (arm64) if [[ "$(uname -s)" == "Darwin" && "$(uname -m)" == "arm64" ]]; then ADDITIONAL_LIBS=${ADDITIONAL_LIBS:-$(ls /opt/homebrew/Cellar/protobuf/**/lib/libutf8_validity*.dylib 2>/dev/null)}