fix(llama-cpp,turboquant): only CPU_ALL_VARIANTS for pure-CPU builds, GPU uses fallback

The previous gate sent every non-hipblas build through llama-cpp-cpu-all, so the
GPU image builds (cublas, sycl_f16/f32, vulkan, nvidia l4t) compiled the whole CPU
microarch variant matrix on top of their already-huge GPU backend - blowing the
build time (the sycl job was only 59% done after 2h11m) - and the arm64 l4t build
failed at `apt-get install gcc-14` (exit 100) on the Jetson base.

Gate on an empty BUILD_TYPE instead: only the pure CPU image (build-type: '' in
.github/backend-matrix.yml) builds the CPU_ALL_VARIANTS set; every GPU build gets a
single fallback CPU grpc-server, since the accelerator does the compute. This also
confines the arm64 gcc-14 step (needed for the armv9.2 SME variants) to the CPU
build, away from the GPU base images.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
This commit is contained in:
Ettore Di Giacinto
2026-06-25 07:04:06 +00:00
parent 4e9bb4f879
commit 292c1cab94
2 changed files with 23 additions and 17 deletions

View File

@@ -18,22 +18,26 @@ if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
fi
cd /LocalAI/backend/cpp/llama-cpp
if [ "${BUILD_TYPE}" = "hipblas" ]; then
# ROCm: the GPU does the compute, so a single fallback CPU build is enough.
make llama-cpp-fallback
else
# arm64: ggml's CPU_ALL_VARIANTS table includes armv9.2 SME variants whose
# -march=...+sme is rejected by the Ubuntu 24.04 default gcc-13. gcc-14 accepts it, so
# build the arm64 variants with gcc-14 (the host never *selects* SME unless it has it,
# but every variant must still compile).
if [ -z "${BUILD_TYPE:-}" ]; then
# Pure CPU image (BUILD_TYPE empty): one build with ggml CPU_ALL_VARIANTS replaces the
# per-microarch binaries (x86: avx/avx2/avx512/fallback; arm64: armv8.x/armv9.x). ggml
# dlopens the best libggml-cpu-*.so at runtime by probing host CPU features.
#
# arm64: the CPU_ALL_VARIANTS table includes armv9.2 SME variants whose -march=...+sme is
# rejected by the Ubuntu 24.04 default gcc-13. gcc-14 accepts it, so build the arm64
# variants with it (the host never *selects* SME unless it has it, but every variant must
# still compile).
if [ "${TARGETARCH}" = "arm64" ]; then
apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
export CC=gcc-14 CXX=g++-14
fi
# x86 and arm64: one build with ggml CPU_ALL_VARIANTS replaces the per-microarch
# binaries (x86: avx/avx2/avx512/fallback; arm64: armv8.x/armv9.x). ggml dlopens the
# best libggml-cpu-*.so at runtime by probing host CPU features.
make llama-cpp-cpu-all
else
# GPU build (cublas/hipblas/sycl/vulkan/...): the accelerator does the compute, so a
# single fallback CPU build is enough - no per-microarch CPU variants needed. (This also
# keeps the heavy GPU backend compile from also building the whole CPU variant matrix,
# and avoids the gcc-14 apt step on GPU base images such as nvidia l4t.)
make llama-cpp-fallback
fi
make llama-cpp-grpc
make llama-cpp-rpc-server

View File

@@ -19,17 +19,19 @@ fi
cd /LocalAI/backend/cpp/turboquant
if [ "${BUILD_TYPE}" = "hipblas" ]; then
# ROCm: single fallback CPU build (GPU does the compute).
make turboquant-fallback
else
# arm64: the CPU_ALL_VARIANTS armv9.2 SME variants need gcc-14 (gcc-13 rejects +sme).
if [ -z "${BUILD_TYPE:-}" ]; then
# Pure CPU image: one ggml CPU_ALL_VARIANTS build replaces the per-microarch binaries.
# arm64: the armv9.2 SME variants need gcc-14 (gcc-13 rejects +sme).
if [ "${TARGETARCH}" = "arm64" ]; then
apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
export CC=gcc-14 CXX=g++-14
fi
# x86 and arm64: one ggml CPU_ALL_VARIANTS build replaces the per-microarch binaries.
make turboquant-cpu-all
else
# GPU build (cublas/hipblas/sycl/vulkan/...): single fallback CPU build, the accelerator
# does the compute. Keeps the GPU compile from also building the CPU variant matrix and
# avoids the gcc-14 apt step on GPU base images such as nvidia l4t.
make turboquant-fallback
fi
make turboquant-grpc
make turboquant-rpc-server