feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple) (#10497)

* feat(llama-cpp): single x86 CPU build via ggml CPU_ALL_VARIANTS

Replace the per-microarch avx/avx2/avx512/fallback multi-binary build on
x86 with a single grpc-server plus the dlopen-able libggml-cpu-*.so set
that ggml's backend registry selects at runtime by probing host CPU
features. One build instead of four, broader microarch coverage (adds
alderlake AVX-VNNI, zen4 AVX512-BF16, sapphirerapids AMX), and the
shell-side /proc/cpuinfo probing in run.sh goes away.

Build/link notes:
- CPU_ALL_VARIANTS requires GGML_BACKEND_DL + BUILD_SHARED_LIBS=ON, so
  ggml/llama become shared objects. SHARED_LIBS is now a make variable
  (default OFF) so the override survives the recursive sub-make into the
  VARIANT build dir instead of being re-clobbered by the base flags.
- The cpu-all target also builds "--target ggml": the per-microarch
  backends are runtime-dlopened, not link deps, so they only compile via
  ggml's add_dependencies().
- hw_grpc_proto is pinned STATIC. Under BUILD_SHARED_LIBS=ON it would
  otherwise become a DSO referencing hidden-visibility symbols in the
  static libprotobuf.a, which fails to link ("hidden symbol ... is
  referenced by DSO"). Keeping it static links gRPC/protobuf into the
  executable while only ggml/llama stay shared, so no PIC or base-image
  change is required.
- package.sh bundles the libggml-*.so set into package/lib; ggml finds
  them by scanning the bundled ld.so directory (/proc/self/exe), which
  run.sh launches from.

Scope: x86 only. arm64/darwin keep the single fallback build. The
ik-llama-cpp / turboquant forks and the other ggml C++ backends are
unchanged; the same recipe applies but is out of scope here.

Validated with a full docker build plus a live inference smoke test:
the model loads, ggml selects the AVX512_BF16 variant on a Zen-class
host, and tokens generate correctly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(llama-cpp,turboquant): extend CPU_ALL_VARIANTS to arm64 + turboquant

- llama-cpp: x86 AND arm64 now use the single llama-cpp-cpu-all build
  (only hipblas keeps the fallback build). ggml's arm64 variant table
  (armv8.x / armv9.x, plus apple_m* on darwin) is selected at runtime.
- turboquant: same recipe via a turboquant-cpu-all target. turboquant
  copies backend/cpp/llama-cpp's CMakeLists.txt + Makefile per flavor, so
  the hw_grpc_proto STATIC fix and the SHARED_LIBS / EXTRA_CMAKE_ARGS
  make-vars are inherited; the target just passes SHARED_LIBS=ON, the DL
  flags and --target ggml through, then collects the .so set. run.sh and
  package.sh updated to ship/select turboquant-cpu-all.
- Makefile lib-collection find now also matches *.dylib (for the darwin
  build, which emits dylibs rather than .so).

ik-llama-cpp is intentionally left unchanged: its pinned ggml has no
CPU_ALL_VARIANTS support and its IQK kernels require AVX2, so the
per-microarch dynamic backend set does not apply.

Scope still excludes the darwin packaging wiring (separate change).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(llama-cpp,turboquant): arm64 gcc-14 for SME variants + darwin cpu-all packaging

- arm64: ggml CPU_ALL_VARIANTS builds armv9.2 SME variants whose -march=...+sme
  is rejected by the Ubuntu 24.04 default gcc-13. Build the arm64 variants with
  gcc-14 (installed in the compile step). The host only selects a variant it
  actually supports at runtime, but every variant must still compile.
- darwin: scripts/build/llama-cpp-darwin.sh builds llama-cpp-cpu-all instead of
  the fallback binary, keeps Metal (GGML_METAL stays ON; --target ggml also builds
  ggml-metal). The per-microarch libggml-cpu-*.dylib are placed in the package
  root next to the binary (darwin has no bundled ld.so, so ggml's executable-dir
  scan looks there), while the other shared dylibs go in lib/ for DYLD_LIBRARY_PATH.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(llama-cpp-darwin): distribute ggml backends by suffix (.so root, .dylib lib)

ggml emits its loadable backends (per-microarch CPU variants, metal, blas) with a
.so suffix even on darwin, while the core libraries (ggml-base/ggml/llama/
llama-common/mtmd) use .dylib. Split the distribution by suffix: .so DL backends
go in the package root for ggml's executable-directory scan, .dylib core libs go
in lib/ for DYLD_LIBRARY_PATH. The previous .dylib name-pattern matched none of the
variants.

Verified on an M4: ggml loads the apple_m4 CPU variant (SME=1) and Metal, model
loads and generates correct tokens.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(llama-cpp,turboquant): only CPU_ALL_VARIANTS for pure-CPU builds, GPU uses fallback

The previous gate sent every non-hipblas build through llama-cpp-cpu-all, so the
GPU image builds (cublas, sycl_f16/f32, vulkan, nvidia l4t) compiled the whole CPU
microarch variant matrix on top of their already-huge GPU backend - blowing the
build time (the sycl job was only 59% done after 2h11m) - and the arm64 l4t build
failed at `apt-get install gcc-14` (exit 100) on the Jetson base.

Gate on an empty BUILD_TYPE instead: only the pure CPU image (build-type: '' in
.github/backend-matrix.yml) builds the CPU_ALL_VARIANTS set; every GPU build gets a
single fallback CPU grpc-server, since the accelerator does the compute. This also
confines the arm64 gcc-14 step (needed for the armv9.2 SME variants) to the CPU
build, away from the GPU base images.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(llama-cpp): correct run.sh comment for arm64/darwin cpu-all

arm64 and darwin CPU images now also ship llama-cpp-cpu-all (not fallback-only);
only GPU images ship fallback-only. Fix the stale comment to match.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
LocalAI [bot]
2026-06-25 15:47:03 +02:00
committed by GitHub
parent 3a87d9e48f
commit 4ac67d255d
10 changed files with 154 additions and 72 deletions

View File

@@ -50,8 +50,13 @@ add_custom_command(
"${hw_proto}"
DEPENDS "${hw_proto}")
# hw_grpc_proto
add_library(hw_grpc_proto
# hw_grpc_proto: force STATIC. Under the CPU_ALL_VARIANTS build BUILD_SHARED_LIBS=ON
# (ggml/llama become shared), which would otherwise make this glue library a DSO. As a
# DSO it references the hidden-visibility symbols in the static libprotobuf.a, which the
# linker cannot satisfy ("hidden symbol ... in libprotobuf.a is referenced by DSO").
# Keeping it STATIC links protobuf/gRPC directly into the grpc-server executable while
# only ggml/llama stay shared. No effect on the static variants (already BUILD_SHARED_LIBS=OFF).
add_library(hw_grpc_proto STATIC
${hw_grpc_srcs}
${hw_grpc_hdrs}
${hw_proto_srcs}

View File

@@ -10,8 +10,16 @@ TARGET?=--target grpc-server
JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1)
ARCH?=$(shell uname -m)
# Disable Shared libs as we are linking on static gRPC and we can't mix shared and static
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=OFF
# Shared libs default to OFF: we link static gRPC and the avx/avx2/avx512/fallback
# variants are fully static. The CPU_ALL_VARIANTS build flips SHARED_LIBS=ON (ggml/llama
# become shared so the dynamic CPU backends work; gRPC stays static via its imported
# targets). SHARED_LIBS is a make variable, not an appended -D, so it survives the
# recursive sub-make into the VARIANT build dir (which re-parses this Makefile) instead
# of being re-clobbered by a second -DBUILD_SHARED_LIBS=OFF. EXTRA_CMAKE_ARGS is the hook
# the CPU_ALL_VARIANTS target uses to inject -DGGML_BACKEND_DL/-DGGML_CPU_ALL_VARIANTS.
SHARED_LIBS?=OFF
EXTRA_CMAKE_ARGS?=
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=$(SHARED_LIBS) -DLLAMA_CURL=OFF $(EXTRA_CMAKE_ARGS)
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
ifeq ($(NATIVE),false)
@@ -120,6 +128,30 @@ llama-cpp-fallback: llama.cpp
CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) VARIANT="llama-cpp-fallback-build" build-llama-cpp-grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-fallback-build/grpc-server llama-cpp-fallback
# Single-build CPU backend using ggml's CPU_ALL_VARIANTS. Produces ONE grpc-server
# plus a set of dlopen-able libggml-cpu-*.so (sandybridge/haswell/skylakex/...) that
# ggml's backend registry selects from at runtime by probing host CPU features.
# Replaces the avx/avx2/avx512/fallback multi-binary build on x86.
#
# CPU_ALL_VARIANTS requires GGML_BACKEND_DL, which requires BUILD_SHARED_LIBS=ON, so we
# pass SHARED_LIBS=ON and the DL flags as make variables (NOT pre-expanded into the
# CMAKE_ARGS env string): command-line make variables propagate through every recursive
# sub-make, so the deepest VARIANT-dir build computes BUILD_SHARED_LIBS=ON consistently.
# Only ggml/llama go shared - gRPC is found via its static imported targets, so the
# grpc-server binary keeps static gRPC and only dynamically links ggml.
#
# TARGET adds "ggml": the per-microarch backends are runtime-dlopened, not link deps of
# grpc-server, so they only build because each is an add_dependencies() of the ggml target.
llama-cpp-cpu-all: llama.cpp
cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build purge
$(info ${GREEN}I llama-cpp build info:cpu-all-variants${RESET})
$(MAKE) SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" VARIANT="llama-cpp-cpu-all-build" build-llama-cpp-grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build/grpc-server llama-cpp-cpu-all
rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs
find $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \;
@echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/
llama-cpp-grpc: llama.cpp
cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build purge

View File

@@ -14,6 +14,22 @@ mkdir -p $CURDIR/package/lib
cp -avrf $CURDIR/llama-cpp-* $CURDIR/package/
cp -rfv $CURDIR/run.sh $CURDIR/package/
# Bundle the ggml shared backends produced by the CPU_ALL_VARIANTS build (libggml-base.so,
# libggml.so, libllama.so and the per-microarch libggml-cpu-*.so), all into package/lib.
#
# Two distinct resolution mechanisms both land here:
# - NEEDED deps (libggml-base/libggml/libllama): resolved by the dynamic linker via the
# LD_LIBRARY_PATH=$CURDIR/lib that run.sh exports.
# - The per-microarch libggml-cpu-*.so are NOT linked; ggml *discovers* them at runtime by
# scanning the executable's own directory (readlink /proc/self/exe). run.sh launches via
# the bundled $CURDIR/lib/ld.so, so /proc/self/exe -> .../lib/ld.so and ggml scans lib/.
# That is why the variants must sit in lib/ (next to ld.so), not just on the link path.
# No-op on builds (arm64/darwin) that don't produce the all-variants set.
if [ -d "$CURDIR/ggml-shared-libs" ]; then
echo "Bundling ggml shared backends (CPU_ALL_VARIANTS)..."
cp -avf $CURDIR/ggml-shared-libs/*.so* $CURDIR/package/lib/
fi
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture

View File

@@ -12,26 +12,12 @@ grep -e "flags" /proc/cpuinfo | head -1
BINARY=llama-cpp-fallback
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/llama-cpp-avx ]; then
BINARY=llama-cpp-avx
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/llama-cpp-avx2 ]; then
BINARY=llama-cpp-avx2
fi
fi
# Check avx 512
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/llama-cpp-avx512 ]; then
BINARY=llama-cpp-avx512
fi
# CPU images (x86, arm64, darwin) ship a single llama-cpp-cpu-all built with ggml
# CPU_ALL_VARIANTS: ggml's backend registry dlopens the best libggml-cpu-*.so for this
# host, so no shell-side AVX probing. GPU images (cublas/sycl/vulkan/hipblas) ship only
# llama-cpp-fallback (the accelerator does the compute), so fall back to it when absent.
if [ -e $CURDIR/llama-cpp-cpu-all ]; then
BINARY=llama-cpp-cpu-all
fi
if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then

View File

@@ -65,6 +65,29 @@ turboquant-avx:
turboquant-fallback:
$(call turboquant-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
# Single-build CPU backend via ggml CPU_ALL_VARIANTS (mirrors llama-cpp-cpu-all).
# turboquant reuses backend/cpp/llama-cpp's CMakeLists.txt (hw_grpc_proto STATIC) and
# Makefile (SHARED_LIBS make-var + EXTRA_CMAKE_ARGS), so this passes the same overrides
# through to the copied build: SHARED_LIBS=ON, the DL flags, and --target ggml (which
# pulls in the per-microarch libggml-cpu-*.so via ggml's add_dependencies). The .so set
# is collected for package.sh to bundle into package/lib.
turboquant-cpu-all:
rm -rf $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build
cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build purge
bash $(CURRENT_MAKEFILE_DIR)/patch-grpc-server.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/grpc-server.cpp
$(info $(GREEN)I turboquant build info:cpu-all-variants$(RESET))
LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build llama.cpp
bash $(CURRENT_MAKEFILE_DIR)/apply-patches.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/llama.cpp $(PATCHES_DIR)
SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" \
LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/grpc-server turboquant-cpu-all
rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs
find $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \;
@echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/
turboquant-grpc:
$(call turboquant-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target rpc-server)

View File

@@ -14,6 +14,15 @@ mkdir -p $CURDIR/package/lib
cp -avrf $CURDIR/turboquant-* $CURDIR/package/
cp -rfv $CURDIR/run.sh $CURDIR/package/
# Bundle the ggml shared backends from the CPU_ALL_VARIANTS build into package/lib. ggml
# discovers the per-microarch libggml-cpu-*.so by scanning the executable directory, which
# (via the bundled lib/ld.so that run.sh launches through) resolves to lib/. See the
# matching comment in backend/cpp/llama-cpp/package.sh. No-op on the fallback/ROCm builds.
if [ -d "$CURDIR/ggml-shared-libs" ]; then
echo "Bundling ggml shared backends (CPU_ALL_VARIANTS)..."
cp -avf $CURDIR/ggml-shared-libs/*.so* $CURDIR/package/lib/
fi
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture

View File

@@ -12,26 +12,11 @@ grep -e "flags" /proc/cpuinfo | head -1
BINARY=turboquant-fallback
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/turboquant-avx ]; then
BINARY=turboquant-avx
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/turboquant-avx2 ]; then
BINARY=turboquant-avx2
fi
fi
# Check avx 512
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/turboquant-avx512 ]; then
BINARY=turboquant-avx512
fi
# x86/arm64 ship a single turboquant-cpu-all built with ggml CPU_ALL_VARIANTS: ggml's
# backend registry dlopens the best libggml-cpu-*.so for this host, so no shell-side
# probing. ROCm ships only turboquant-fallback, so fall back to it when cpu-all is absent.
if [ -e $CURDIR/turboquant-cpu-all ]; then
BINARY=turboquant-cpu-all
fi
if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then