Compare commits

..

4 Commits

Author SHA1 Message Date
Ettore Di Giacinto
4e9bb4f879 fix(llama-cpp-darwin): distribute ggml backends by suffix (.so root, .dylib lib)
ggml emits its loadable backends (per-microarch CPU variants, metal, blas) with a
.so suffix even on darwin, while the core libraries (ggml-base/ggml/llama/
llama-common/mtmd) use .dylib. Split the distribution by suffix: .so DL backends
go in the package root for ggml's executable-directory scan, .dylib core libs go
in lib/ for DYLD_LIBRARY_PATH. The previous .dylib name-pattern matched none of the
variants.

Verified on an M4: ggml loads the apple_m4 CPU variant (SME=1) and Metal, model
loads and generates correct tokens.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
2026-06-24 21:59:29 +00:00
Ettore Di Giacinto
3b47122e54 feat(llama-cpp,turboquant): arm64 gcc-14 for SME variants + darwin cpu-all packaging
- arm64: ggml CPU_ALL_VARIANTS builds armv9.2 SME variants whose -march=...+sme
  is rejected by the Ubuntu 24.04 default gcc-13. Build the arm64 variants with
  gcc-14 (installed in the compile step). The host only selects a variant it
  actually supports at runtime, but every variant must still compile.
- darwin: scripts/build/llama-cpp-darwin.sh builds llama-cpp-cpu-all instead of
  the fallback binary, keeps Metal (GGML_METAL stays ON; --target ggml also builds
  ggml-metal). The per-microarch libggml-cpu-*.dylib are placed in the package
  root next to the binary (darwin has no bundled ld.so, so ggml's executable-dir
  scan looks there), while the other shared dylibs go in lib/ for DYLD_LIBRARY_PATH.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
2026-06-24 21:50:29 +00:00
Ettore Di Giacinto
379fa3e525 feat(llama-cpp,turboquant): extend CPU_ALL_VARIANTS to arm64 + turboquant
- llama-cpp: x86 AND arm64 now use the single llama-cpp-cpu-all build
  (only hipblas keeps the fallback build). ggml's arm64 variant table
  (armv8.x / armv9.x, plus apple_m* on darwin) is selected at runtime.
- turboquant: same recipe via a turboquant-cpu-all target. turboquant
  copies backend/cpp/llama-cpp's CMakeLists.txt + Makefile per flavor, so
  the hw_grpc_proto STATIC fix and the SHARED_LIBS / EXTRA_CMAKE_ARGS
  make-vars are inherited; the target just passes SHARED_LIBS=ON, the DL
  flags and --target ggml through, then collects the .so set. run.sh and
  package.sh updated to ship/select turboquant-cpu-all.
- Makefile lib-collection find now also matches *.dylib (for the darwin
  build, which emits dylibs rather than .so).

ik-llama-cpp is intentionally left unchanged: its pinned ggml has no
CPU_ALL_VARIANTS support and its IQK kernels require AVX2, so the
per-microarch dynamic backend set does not apply.

Scope still excludes the darwin packaging wiring (separate change).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
2026-06-24 21:33:32 +00:00
Ettore Di Giacinto
e47c58656f feat(llama-cpp): single x86 CPU build via ggml CPU_ALL_VARIANTS
Replace the per-microarch avx/avx2/avx512/fallback multi-binary build on
x86 with a single grpc-server plus the dlopen-able libggml-cpu-*.so set
that ggml's backend registry selects at runtime by probing host CPU
features. One build instead of four, broader microarch coverage (adds
alderlake AVX-VNNI, zen4 AVX512-BF16, sapphirerapids AMX), and the
shell-side /proc/cpuinfo probing in run.sh goes away.

Build/link notes:
- CPU_ALL_VARIANTS requires GGML_BACKEND_DL + BUILD_SHARED_LIBS=ON, so
  ggml/llama become shared objects. SHARED_LIBS is now a make variable
  (default OFF) so the override survives the recursive sub-make into the
  VARIANT build dir instead of being re-clobbered by the base flags.
- The cpu-all target also builds "--target ggml": the per-microarch
  backends are runtime-dlopened, not link deps, so they only compile via
  ggml's add_dependencies().
- hw_grpc_proto is pinned STATIC. Under BUILD_SHARED_LIBS=ON it would
  otherwise become a DSO referencing hidden-visibility symbols in the
  static libprotobuf.a, which fails to link ("hidden symbol ... is
  referenced by DSO"). Keeping it static links gRPC/protobuf into the
  executable while only ggml/llama stay shared, so no PIC or base-image
  change is required.
- package.sh bundles the libggml-*.so set into package/lib; ggml finds
  them by scanning the bundled ld.so directory (/proc/self/exe), which
  run.sh launches from.

Scope: x86 only. arm64/darwin keep the single fallback build. The
ik-llama-cpp / turboquant forks and the other ggml C++ backends are
unchanged; the same recipe applies but is out of scope here.

Validated with a full docker build plus a live inference smoke test:
the model loads, ggml selects the AVX512_BF16 variant on a Zen-class
host, and tokens generate correctly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
2026-06-24 21:21:03 +00:00
43 changed files with 261 additions and 1024 deletions

View File

@@ -17,19 +17,25 @@ if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
rm -rf /LocalAI/backend/cpp/llama-cpp-*-build
fi
if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
cd /LocalAI/backend/cpp/llama-cpp
cd /LocalAI/backend/cpp/llama-cpp
if [ "${BUILD_TYPE}" = "hipblas" ]; then
# ROCm: the GPU does the compute, so a single fallback CPU build is enough.
make llama-cpp-fallback
make llama-cpp-grpc
make llama-cpp-rpc-server
else
cd /LocalAI/backend/cpp/llama-cpp
make llama-cpp-avx
make llama-cpp-avx2
make llama-cpp-avx512
make llama-cpp-fallback
make llama-cpp-grpc
make llama-cpp-rpc-server
# arm64: ggml's CPU_ALL_VARIANTS table includes armv9.2 SME variants whose
# -march=...+sme is rejected by the Ubuntu 24.04 default gcc-13. gcc-14 accepts it, so
# build the arm64 variants with gcc-14 (the host never *selects* SME unless it has it,
# but every variant must still compile).
if [ "${TARGETARCH}" = "arm64" ]; then
apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
export CC=gcc-14 CXX=g++-14
fi
# x86 and arm64: one build with ggml CPU_ALL_VARIANTS replaces the per-microarch
# binaries (x86: avx/avx2/avx512/fallback; arm64: armv8.x/armv9.x). ggml dlopens the
# best libggml-cpu-*.so at runtime by probing host CPU features.
make llama-cpp-cpu-all
fi
make llama-cpp-grpc
make llama-cpp-rpc-server
ccache -s || true

View File

@@ -19,17 +19,19 @@ fi
cd /LocalAI/backend/cpp/turboquant
if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
if [ "${BUILD_TYPE}" = "hipblas" ]; then
# ROCm: single fallback CPU build (GPU does the compute).
make turboquant-fallback
make turboquant-grpc
make turboquant-rpc-server
else
make turboquant-avx
make turboquant-avx2
make turboquant-avx512
make turboquant-fallback
make turboquant-grpc
make turboquant-rpc-server
# arm64: the CPU_ALL_VARIANTS armv9.2 SME variants need gcc-14 (gcc-13 rejects +sme).
if [ "${TARGETARCH}" = "arm64" ]; then
apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
export CC=gcc-14 CXX=g++-14
fi
# x86 and arm64: one ggml CPU_ALL_VARIANTS build replaces the per-microarch binaries.
make turboquant-cpu-all
fi
make turboquant-grpc
make turboquant-rpc-server
ccache -s || true

View File

@@ -4974,16 +4974,6 @@ includeDarwin:
- backend: "kitten-tts"
tag-suffix: "-metal-darwin-arm64-kitten-tts"
build-type: "mps"
# vLLM on Apple Silicon via vllm-metal (MLX). The install is custom
# (backend/python/vllm/install.sh has a darwin branch); lang stays python so
# backend_build_darwin.yml drives it through build-darwin-python-backend ->
# scripts/build/python-darwin.sh, which runs the backend's install.sh.
- backend: "vllm"
tag-suffix: "-metal-darwin-arm64-vllm"
build-type: "mps"
- backend: "liquid-audio"
tag-suffix: "-metal-darwin-arm64-liquid-audio"
build-type: "mps"
- backend: "piper"
tag-suffix: "-metal-darwin-arm64-piper"
build-type: "metal"
@@ -5000,10 +4990,6 @@ includeDarwin:
tag-suffix: "-metal-darwin-arm64-sherpa-onnx"
build-type: "metal"
lang: "go"
- backend: "supertonic"
tag-suffix: "-metal-darwin-arm64-supertonic"
build-type: "metal"
lang: "go"
- backend: "local-store"
tag-suffix: "-metal-darwin-arm64-local-store"
build-type: "metal"

View File

@@ -1,55 +0,0 @@
#!/bin/bash
# Bump the single vllm-metal pin (VLLM_METAL_VERSION) in the vLLM backend's
# darwin (Apple Silicon) install path. The macOS/Metal build
# (backend/python/vllm/install.sh, Darwin branch) installs vllm-metal, which is
# version-locked to a specific vLLM source release. install.sh derives that vLLM
# version at build time from vllm-metal's own installer (`vllm_v=`) at the pinned
# tag, so there is only ONE value to bump here -- mirroring bump_vllm_wheel.sh,
# which bumps the Linux cu130 wheel pin.
#
# This deliberately tracks vllm-project/vllm-metal, NOT vllm-project/vllm: the
# darwin build can only use the exact vLLM version vllm-metal supports, so it may
# lag the Linux pin (requirements-cublas13-after.txt) until vllm-metal catches up.
set -xe
REPO=$1 # vllm-project/vllm-metal
FILE=$2 # backend/python/vllm/install.sh
VAR=$3 # VLLM_METAL_VERSION (used for the workflow's output file names)
if [ -z "$FILE" ] || [ -z "$REPO" ] || [ -z "$VAR" ]; then
echo "usage: $0 <repo> <install-file> <var-name>" >&2
exit 1
fi
# vllm-metal ships frequent dev releases, all flagged as non-prerelease, so
# /releases/latest returns the newest one (with its cp312 wheel asset).
LATEST_TAG=$(curl -sS -H "Accept: application/vnd.github+json" \
"https://api.github.com/repos/$REPO/releases/latest" \
| python3 -c "import json,sys; print(json.load(sys.stdin)['tag_name'])")
# The coupled vLLM source version lives in vllm-metal's installer at that tag.
NEW_VLLM_VERSION=$(curl -fsSL \
"https://raw.githubusercontent.com/$REPO/$LATEST_TAG/install.sh" \
| grep -oE 'vllm_v="[0-9]+\.[0-9]+\.[0-9]+"' | head -1 | cut -d'"' -f2)
if [ -z "$LATEST_TAG" ] || [ -z "$NEW_VLLM_VERSION" ]; then
echo "Could not resolve vllm-metal tag ($LATEST_TAG) or its vllm_v ($NEW_VLLM_VERSION)." >&2
exit 1
fi
set +e
CURRENT_TAG=$(grep -oE 'VLLM_METAL_VERSION="[^"]*"' "$FILE" | head -1 | cut -d'"' -f2)
set -e
# Rewrite the single pin. install.sh derives VLLM_VERSION from this tag at build
# time, so there is nothing else to touch. peter-evans/create-pull-request opens
# no PR on a clean tree, so a no-op rewrite (already current) is safe.
sed -i "$FILE" \
-e "s|VLLM_METAL_VERSION=\"[^\"]*\"|VLLM_METAL_VERSION=\"$LATEST_TAG\"|"
if [ -z "$CURRENT_TAG" ]; then
echo "Could not find VLLM_METAL_VERSION=\"...\" in $FILE." >&2
exit 0
fi
echo "vllm-metal ${CURRENT_TAG} -> ${LATEST_TAG} (builds vLLM ${NEW_VLLM_VERSION}): https://github.com/$REPO/releases/tag/${LATEST_TAG}" >> "${VAR}_message.txt"
echo "${LATEST_TAG}" >> "${VAR}_commit.txt"

View File

@@ -154,39 +154,3 @@ jobs:
branch: "update/VLLM_VERSION"
body: ${{ steps.bump.outputs.message }}
signoff: true
bump-vllm-metal:
# The darwin (Apple Silicon) vLLM build installs vllm-metal, which is locked
# to a specific vLLM source release. install.sh pins both VLLM_METAL_VERSION
# (the wheel release) and VLLM_VERSION (the vLLM it builds against); this job
# tracks vllm-project/vllm-metal and rewrites both atomically. Separate from
# bump-vllm-wheel because darwin follows vllm-metal, not vllm/vllm latest.
if: github.repository == 'mudler/LocalAI'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v7
- name: Bump vllm-metal pin 🔧
id: bump
run: |
bash .github/bump_vllm_metal.sh vllm-project/vllm-metal backend/python/vllm/install.sh VLLM_METAL_VERSION
{
echo 'message<<EOF'
cat "VLLM_METAL_VERSION_message.txt"
echo EOF
} >> "$GITHUB_OUTPUT"
{
echo 'commit<<EOF'
cat "VLLM_METAL_VERSION_commit.txt"
echo EOF
} >> "$GITHUB_OUTPUT"
rm -rfv VLLM_METAL_VERSION_message.txt VLLM_METAL_VERSION_commit.txt
- name: Create Pull Request
uses: peter-evans/create-pull-request@v8
with:
token: ${{ secrets.UPDATE_BOT_TOKEN }}
push-to-fork: ci-forks/LocalAI
commit-message: ':arrow_up: Update vllm-project/vllm-metal (darwin)'
title: 'chore: :arrow_up: Update vllm-metal (darwin) to `${{ steps.bump.outputs.commit }}`'
branch: "update/VLLM_METAL_VERSION"
body: ${{ steps.bump.outputs.message }}
signoff: true

View File

@@ -50,8 +50,13 @@ add_custom_command(
"${hw_proto}"
DEPENDS "${hw_proto}")
# hw_grpc_proto
add_library(hw_grpc_proto
# hw_grpc_proto: force STATIC. Under the CPU_ALL_VARIANTS build BUILD_SHARED_LIBS=ON
# (ggml/llama become shared), which would otherwise make this glue library a DSO. As a
# DSO it references the hidden-visibility symbols in the static libprotobuf.a, which the
# linker cannot satisfy ("hidden symbol ... in libprotobuf.a is referenced by DSO").
# Keeping it STATIC links protobuf/gRPC directly into the grpc-server executable while
# only ggml/llama stay shared. No effect on the static variants (already BUILD_SHARED_LIBS=OFF).
add_library(hw_grpc_proto STATIC
${hw_grpc_srcs}
${hw_grpc_hdrs}
${hw_proto_srcs}

View File

@@ -10,8 +10,16 @@ TARGET?=--target grpc-server
JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1)
ARCH?=$(shell uname -m)
# Disable Shared libs as we are linking on static gRPC and we can't mix shared and static
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=OFF
# Shared libs default to OFF: we link static gRPC and the avx/avx2/avx512/fallback
# variants are fully static. The CPU_ALL_VARIANTS build flips SHARED_LIBS=ON (ggml/llama
# become shared so the dynamic CPU backends work; gRPC stays static via its imported
# targets). SHARED_LIBS is a make variable, not an appended -D, so it survives the
# recursive sub-make into the VARIANT build dir (which re-parses this Makefile) instead
# of being re-clobbered by a second -DBUILD_SHARED_LIBS=OFF. EXTRA_CMAKE_ARGS is the hook
# the CPU_ALL_VARIANTS target uses to inject -DGGML_BACKEND_DL/-DGGML_CPU_ALL_VARIANTS.
SHARED_LIBS?=OFF
EXTRA_CMAKE_ARGS?=
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=$(SHARED_LIBS) -DLLAMA_CURL=OFF $(EXTRA_CMAKE_ARGS)
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
ifeq ($(NATIVE),false)
@@ -120,6 +128,30 @@ llama-cpp-fallback: llama.cpp
CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) VARIANT="llama-cpp-fallback-build" build-llama-cpp-grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-fallback-build/grpc-server llama-cpp-fallback
# Single-build CPU backend using ggml's CPU_ALL_VARIANTS. Produces ONE grpc-server
# plus a set of dlopen-able libggml-cpu-*.so (sandybridge/haswell/skylakex/...) that
# ggml's backend registry selects from at runtime by probing host CPU features.
# Replaces the avx/avx2/avx512/fallback multi-binary build on x86.
#
# CPU_ALL_VARIANTS requires GGML_BACKEND_DL, which requires BUILD_SHARED_LIBS=ON, so we
# pass SHARED_LIBS=ON and the DL flags as make variables (NOT pre-expanded into the
# CMAKE_ARGS env string): command-line make variables propagate through every recursive
# sub-make, so the deepest VARIANT-dir build computes BUILD_SHARED_LIBS=ON consistently.
# Only ggml/llama go shared - gRPC is found via its static imported targets, so the
# grpc-server binary keeps static gRPC and only dynamically links ggml.
#
# TARGET adds "ggml": the per-microarch backends are runtime-dlopened, not link deps of
# grpc-server, so they only build because each is an add_dependencies() of the ggml target.
llama-cpp-cpu-all: llama.cpp
cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build purge
$(info ${GREEN}I llama-cpp build info:cpu-all-variants${RESET})
$(MAKE) SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" VARIANT="llama-cpp-cpu-all-build" build-llama-cpp-grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build/grpc-server llama-cpp-cpu-all
rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs
find $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \;
@echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/
llama-cpp-grpc: llama.cpp
cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build purge

View File

@@ -14,6 +14,22 @@ mkdir -p $CURDIR/package/lib
cp -avrf $CURDIR/llama-cpp-* $CURDIR/package/
cp -rfv $CURDIR/run.sh $CURDIR/package/
# Bundle the ggml shared backends produced by the CPU_ALL_VARIANTS build (libggml-base.so,
# libggml.so, libllama.so and the per-microarch libggml-cpu-*.so), all into package/lib.
#
# Two distinct resolution mechanisms both land here:
# - NEEDED deps (libggml-base/libggml/libllama): resolved by the dynamic linker via the
# LD_LIBRARY_PATH=$CURDIR/lib that run.sh exports.
# - The per-microarch libggml-cpu-*.so are NOT linked; ggml *discovers* them at runtime by
# scanning the executable's own directory (readlink /proc/self/exe). run.sh launches via
# the bundled $CURDIR/lib/ld.so, so /proc/self/exe -> .../lib/ld.so and ggml scans lib/.
# That is why the variants must sit in lib/ (next to ld.so), not just on the link path.
# No-op on builds (arm64/darwin) that don't produce the all-variants set.
if [ -d "$CURDIR/ggml-shared-libs" ]; then
echo "Bundling ggml shared backends (CPU_ALL_VARIANTS)..."
cp -avf $CURDIR/ggml-shared-libs/*.so* $CURDIR/package/lib/
fi
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture

View File

@@ -12,26 +12,11 @@ grep -e "flags" /proc/cpuinfo | head -1
BINARY=llama-cpp-fallback
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/llama-cpp-avx ]; then
BINARY=llama-cpp-avx
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/llama-cpp-avx2 ]; then
BINARY=llama-cpp-avx2
fi
fi
# Check avx 512
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/llama-cpp-avx512 ]; then
BINARY=llama-cpp-avx512
fi
# x86 ships a single llama-cpp-cpu-all built with ggml CPU_ALL_VARIANTS: ggml's backend
# registry dlopens the best libggml-cpu-*.so for this host, so no shell-side AVX probing.
# arm64/darwin builds ship only llama-cpp-fallback, so fall back to it when cpu-all absent.
if [ -e $CURDIR/llama-cpp-cpu-all ]; then
BINARY=llama-cpp-cpu-all
fi
if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then

View File

@@ -65,6 +65,29 @@ turboquant-avx:
turboquant-fallback:
$(call turboquant-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
# Single-build CPU backend via ggml CPU_ALL_VARIANTS (mirrors llama-cpp-cpu-all).
# turboquant reuses backend/cpp/llama-cpp's CMakeLists.txt (hw_grpc_proto STATIC) and
# Makefile (SHARED_LIBS make-var + EXTRA_CMAKE_ARGS), so this passes the same overrides
# through to the copied build: SHARED_LIBS=ON, the DL flags, and --target ggml (which
# pulls in the per-microarch libggml-cpu-*.so via ggml's add_dependencies). The .so set
# is collected for package.sh to bundle into package/lib.
turboquant-cpu-all:
rm -rf $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build
cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build purge
bash $(CURRENT_MAKEFILE_DIR)/patch-grpc-server.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/grpc-server.cpp
$(info $(GREEN)I turboquant build info:cpu-all-variants$(RESET))
LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build llama.cpp
bash $(CURRENT_MAKEFILE_DIR)/apply-patches.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/llama.cpp $(PATCHES_DIR)
SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" \
LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/grpc-server turboquant-cpu-all
rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs
find $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \;
@echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/
turboquant-grpc:
$(call turboquant-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target rpc-server)

View File

@@ -14,6 +14,15 @@ mkdir -p $CURDIR/package/lib
cp -avrf $CURDIR/turboquant-* $CURDIR/package/
cp -rfv $CURDIR/run.sh $CURDIR/package/
# Bundle the ggml shared backends from the CPU_ALL_VARIANTS build into package/lib. ggml
# discovers the per-microarch libggml-cpu-*.so by scanning the executable directory, which
# (via the bundled lib/ld.so that run.sh launches through) resolves to lib/. See the
# matching comment in backend/cpp/llama-cpp/package.sh. No-op on the fallback/ROCm builds.
if [ -d "$CURDIR/ggml-shared-libs" ]; then
echo "Bundling ggml shared backends (CPU_ALL_VARIANTS)..."
cp -avf $CURDIR/ggml-shared-libs/*.so* $CURDIR/package/lib/
fi
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture

View File

@@ -12,26 +12,11 @@ grep -e "flags" /proc/cpuinfo | head -1
BINARY=turboquant-fallback
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/turboquant-avx ]; then
BINARY=turboquant-avx
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/turboquant-avx2 ]; then
BINARY=turboquant-avx2
fi
fi
# Check avx 512
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/turboquant-avx512 ]; then
BINARY=turboquant-avx512
fi
# x86/arm64 ship a single turboquant-cpu-all built with ggml CPU_ALL_VARIANTS: ggml's
# backend registry dlopens the best libggml-cpu-*.so for this host, so no shell-side
# probing. ROCm ships only turboquant-fallback, so fall back to it when cpu-all is absent.
if [ -e $CURDIR/turboquant-cpu-all ]; then
BINARY=turboquant-cpu-all
fi
if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then

View File

@@ -16,7 +16,6 @@ import (
"os"
"path/filepath"
"regexp"
"runtime"
"strings"
"time"
"unicode"
@@ -944,13 +943,7 @@ func InitializeONNXRuntime() error {
}
}
if libPath == "" {
// LocalAI: default to the platform-native shared library
// extension when nothing else is found (dyld vs ld.so).
if runtime.GOOS == "darwin" {
libPath = "/usr/local/lib/libonnxruntime.dylib"
} else {
libPath = "/usr/local/lib/libonnxruntime.so"
}
libPath = "/usr/local/lib/libonnxruntime.so"
}
}
ort.SetSharedLibraryPath(libPath)

View File

@@ -32,10 +32,6 @@ elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ $(uname -s) = "Darwin" ]; then
# macOS: dyld resolves the bundled .dylib via DYLD_LIBRARY_PATH (set in
# run.sh); there is no ld.so loader nor glibc to bundle.
echo "Detected Darwin"
else
echo "Error: Could not detect architecture"
exit 1

View File

@@ -3,19 +3,12 @@ set -ex
CURDIR=$(dirname "$(realpath $0)")
if [ "$(uname)" = "Darwin" ]; then
# macOS uses dyld: there is no ld.so loader, and the search path env
# var is DYLD_LIBRARY_PATH. ONNX Runtime ships as a .dylib here.
export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
export ONNXRUNTIME_LIB_PATH=$CURDIR/lib/libonnxruntime.dylib
else
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export ONNXRUNTIME_LIB_PATH=$CURDIR/lib/libonnxruntime.so
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export ONNXRUNTIME_LIB_PATH=$CURDIR/lib/libonnxruntime.so
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
exec $CURDIR/lib/ld.so $CURDIR/supertonic "$@"
fi
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
exec $CURDIR/lib/ld.so $CURDIR/supertonic "$@"
fi
exec $CURDIR/supertonic "$@"

View File

@@ -645,7 +645,6 @@
nvidia-cuda-13: "cuda13-vllm"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm"
cpu: "cpu-vllm"
metal: "metal-vllm"
- &sglang
name: "sglang"
license: apache-2.0
@@ -1285,7 +1284,6 @@
nvidia-cuda-13: "cuda13-liquid-audio"
nvidia-cuda-12: "cuda12-liquid-audio"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-liquid-audio"
metal: "metal-liquid-audio"
icon: https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/7_6D7rWrLxp2hb6OHSV1p.png
- &qwen-tts
urls:
@@ -1571,7 +1569,6 @@
- TTS
capabilities:
default: "cpu-supertonic"
metal: "metal-supertonic"
- !!merge <<: *neutts
name: "neutts-development"
capabilities:
@@ -2930,17 +2927,6 @@
nvidia-cuda-13: "cuda13-vllm-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm-development"
cpu: "cpu-vllm-development"
metal: "metal-vllm-development"
- !!merge <<: *vllm
name: "metal-vllm"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-vllm"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-vllm
- !!merge <<: *vllm
name: "metal-vllm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-vllm"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-vllm
- !!merge <<: *vllm
name: "cuda12-vllm"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vllm"
@@ -4626,7 +4612,6 @@
nvidia-cuda-13: "cuda13-liquid-audio-development"
nvidia-cuda-12: "cuda12-liquid-audio-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-liquid-audio-development"
metal: "metal-liquid-audio-development"
- !!merge <<: *liquid-audio
name: "cpu-liquid-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-liquid-audio"
@@ -4637,16 +4622,6 @@
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-liquid-audio"
mirrors:
- localai/localai-backends:master-cpu-liquid-audio
- !!merge <<: *liquid-audio
name: "metal-liquid-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-liquid-audio"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-liquid-audio
- !!merge <<: *liquid-audio
name: "metal-liquid-audio-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-liquid-audio"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-liquid-audio
- !!merge <<: *liquid-audio
name: "cuda12-liquid-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-liquid-audio"
@@ -5509,7 +5484,6 @@
name: "supertonic-development"
capabilities:
default: "cpu-supertonic-development"
metal: "metal-supertonic-development"
- !!merge <<: *supertonic
name: "cpu-supertonic"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-supertonic"
@@ -5520,13 +5494,3 @@
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-supertonic"
mirrors:
- localai/localai-backends:master-cpu-supertonic
- !!merge <<: *supertonic
name: "metal-supertonic"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-supertonic"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-supertonic
- !!merge <<: *supertonic
name: "metal-supertonic-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-supertonic"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-supertonic

View File

@@ -14,11 +14,5 @@ else
fi
# liquid-audio's torch wheels are large; allow upgrades to satisfy transitive pins
EXTRA_PIP_INSTALL_FLAGS+=" --upgrade"
# --index-strategy is a uv-only flag. The darwin/MPS build installs with pip
# (USE_PIP=true in scripts/build/python-darwin.sh), which rejects it. Only add
# it on the uv path; Linux/CUDA resolution is unchanged.
if [ "x${USE_PIP:-}" != "xtrue" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-first-match"
fi
EXTRA_PIP_INSTALL_FLAGS+=" --upgrade --index-strategy=unsafe-first-match"
installRequirements

View File

@@ -1,4 +1,3 @@
# MPS (Apple Silicon / Metal) build profile - installed by the darwin CI job.
torch>=2.8.0
torchaudio>=2.8.0
torchcodec>=0.9.1

View File

@@ -457,14 +457,9 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
except Exception:
pass
_pl = getattr(last_output, "prompt_logprobs", None) if last_output is not None else None
# Some engines accept the prompt_logprobs request but return a
# list of all-None entries instead of computing them (observed
# with vllm-metal's MLX backend on macOS). Treat that as
# unsupported rather than silently scoring every candidate as 0.
if not _pl or all(e is None for e in _pl):
context.set_code(grpc.StatusCode.UNIMPLEMENTED)
context.set_details("This backend did not return prompt_logprobs; scoring is unsupported on this engine (e.g. vllm-metal / MLX on macOS).")
if last_output is None or not getattr(last_output, "prompt_logprobs", None):
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details("vLLM did not return prompt_logprobs")
return backend_pb2.ScoreResponse()
prompt_logprobs = last_output.prompt_logprobs

View File

@@ -43,24 +43,6 @@ if [ "x${BUILD_PROFILE}" == "xcublas13" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi
# Apple Silicon (Metal/MLX) via vllm-metal.
# vllm-metal (github.com/vllm-project/vllm-metal) brings vLLM to macOS on Apple
# Silicon: it registers through vLLM's platform-plugin entry point
# (metal -> vllm_metal:register), MetalPlatform activates, and the vLLM v1
# AsyncLLM engine runs on the GPU through MLX. LocalAI's backend.py is UNCHANGED
# on darwin — AsyncEngineArgs(...) -> AsyncLLMEngine.from_engine_args transparently
# resolves to the MLX engine (proven on a real M4 / macOS 26.5 against Qwen3-0.6B).
#
# vllm-metal REQUIRES Python 3.12, so force the portable CPython before the venv
# is created (ensureVenv reads PYTHON_VERSION/PYTHON_PATCH/PY_STANDALONE_TAG).
# The patch + standalone tag mirror the l4t13 cp312 pin — a known-good
# python-build-standalone release that also ships an aarch64-apple-darwin asset.
if [ "$(uname -s)" = "Darwin" ]; then
PYTHON_VERSION="3.12"
PYTHON_PATCH="12"
PY_STANDALONE_TAG="20251120"
fi
# JetPack 7 / L4T arm64 vllm + torch wheels come straight from PyPI now
# (torch 2.11+ ships aarch64 + cu130 manylinux wheels and vllm 0.20+ ships
# an aarch64 wheel pinned to that torch). They're cp312-only, so bump the
@@ -75,87 +57,11 @@ if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
PY_STANDALONE_TAG="20251120"
fi
# ===================== Apple Silicon (Metal/MLX) =====================
# Reproduce vllm-metal's upstream installer
# (curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh)
# but INTO LocalAI's managed venv (ensureVenv) instead of a throwaway
# ~/.venv-vllm-metal, so the backend integrates with LocalAI's venv lifecycle
# (portable CPython, _makeVenvPortable relocation, runtime activation). The
# normal CUDA/CPU installRequirements is skipped on darwin — there is no
# macOS/arm64 vLLM wheel on PyPI; vLLM is built from source and the MLX engine
# is layered on by the vllm-metal wheel.
if [ "$(uname -s)" = "Darwin" ]; then
# Create/activate the portable 3.12 venv. On darwin USE_PIP=true and
# PORTABLE_PYTHON=true (set by scripts/build/python-darwin.sh), so this is a
# `python -m venv` based, relocatable venv.
ensureVenv
# vllm-metal's installer drives everything through `uv`: building vLLM from
# the CPU requirements needs `--index-strategy unsafe-best-match` (mixes the
# pytorch CPU channel with PyPI), a flag plain pip does not have. The darwin
# venv is pip-based, so bootstrap uv into it. uv honours $VIRTUAL_ENV (set by
# libbackend's _activateVenv) and installs into THIS venv — same pattern the
# intel branch below relies on.
pip install uv
# The ONLY darwin version pin -- AUTO-BUMPED by .github/bump_vllm_metal.sh,
# which tracks vllm-project/vllm-metal releases (NOT vllm/vllm latest). Keep
# it as a plain double-quoted assignment on its own line so the bumper's sed
# can rewrite it. Darwin therefore follows vllm-metal and can lag the Linux
# vllm pin (requirements-cublas13-after.txt, bumped independently against
# vllm/vllm) until vllm-metal supports a newer vLLM.
VLLM_METAL_VERSION="v0.3.0.dev20260622062346"
# The coupled vLLM source version is whatever this vllm-metal release builds
# against -- it declares it in its own installer as `vllm_v=`. Derive it from
# the PINNED tag rather than hardcoding a second value that could drift. The
# tag is immutable, so this stays reproducible across rebuilds.
VLLM_VERSION=$(curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm-metal/${VLLM_METAL_VERSION}/install.sh" \
| grep -oE 'vllm_v="[0-9]+\.[0-9]+\.[0-9]+"' | head -n1 | cut -d'"' -f2)
if [ -z "${VLLM_VERSION}" ]; then
echo "ERROR: could not derive the vLLM version from vllm-metal ${VLLM_METAL_VERSION}" >&2
exit 1
fi
echo "vllm-metal ${VLLM_METAL_VERSION} builds against vLLM ${VLLM_VERSION}"
_vllm_src=$(mktemp -d)
trap 'rm -rf "${_vllm_src}"' EXIT
pushd "${_vllm_src}"
# 1) Build vLLM ${VLLM_VERSION} from the release source tarball against
# the CPU requirements. vllm-metal layers its MLX platform plugin on
# top of this exact build.
curl -fsSL -o "vllm-${VLLM_VERSION}.tar.gz" \
"https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}.tar.gz"
tar -xzf "vllm-${VLLM_VERSION}.tar.gz"
pushd "vllm-${VLLM_VERSION}"
uv pip install -r requirements/cpu.txt --index-strategy unsafe-best-match
# -Wno-parentheses: clang on macOS treats one of vLLM's C++ warnings
# as an error without it (matches the upstream installer's CXXFLAGS).
CXXFLAGS="-Wno-parentheses" uv pip install .
popd
popd
# 2) Install the prebuilt vllm-metal wheel for the PINNED release. It pulls
# mlx / mlx-metal as deps and registers the `metal` platform plugin that
# backend.py resolves to at engine-init time. Build the release-asset URL
# deterministically (tag + the cp312/arm64 wheel name) rather than querying
# api.github.com, whose unauthenticated rate limit (60/hr per IP) 403s on
# shared CI runners. The wheel version is the tag without its leading 'v'.
_metal_wheel="vllm_metal-${VLLM_METAL_VERSION#v}-cp312-cp312-macosx_11_0_arm64.whl"
_metal_wheel_url="https://github.com/vllm-project/vllm-metal/releases/download/${VLLM_METAL_VERSION}/${_metal_wheel}"
echo "Installing vllm-metal wheel: ${_metal_wheel_url}"
uv pip install "${_metal_wheel_url}"
# Generate the gRPC stubs (backend_pb2*). installRequirements normally does
# this via runProtogen at the end; we skipped installRequirements on darwin,
# so call it explicitly here.
runProtogen
# Intel XPU has no upstream-published vllm wheels, so we always build vllm
# from source against torch-xpu and replace the default triton with
# triton-xpu (matching torch 2.11). Mirrors the upstream procedure:
# https://github.com/vllm-project/vllm/blob/main/docs/getting_started/installation/gpu.xpu.inc.md
elif [ "x${BUILD_TYPE}" == "xintel" ]; then
if [ "x${BUILD_TYPE}" == "xintel" ]; then
# Hide requirements-intel-after.txt so installRequirements doesn't
# try `pip install vllm` (would either fail or grab a non-XPU wheel).
_intel_after="${backend_dir}/requirements-intel-after.txt"

View File

@@ -4,7 +4,4 @@
# instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match
# so uv consults this index alongside PyPI.
--extra-index-url https://wheels.vllm.ai/0.23.0/cu130
# VERSION COUPLING: darwin/Apple-Silicon builds use vllm-metal (see install.sh),
# which pins this exact vLLM version. Bumping vllm here means coordinating with a
# vllm-metal release that supports the new version, or macOS/Metal builds break.
vllm==0.23.0

View File

@@ -54,35 +54,8 @@ func (g GPU) IsNVIDIABlackwell() bool {
return maj >= 12
}
// Compute-buffer headroom guard for the raised physical batch.
//
// Raising n_ubatch grows the CUDA *compute buffer* (the scratch for the forward
// graph), which is allocated PER DEVICE — it does not benefit from a second GPU
// the way weights or KV (which are split across devices) do. The buffer scales
// ~linearly with n_ubatch * n_ctx, so a large context turns the GB10-tuned
// ub2048 into multi-GiB of extra scratch that must fit on a SINGLE card. On a
// 16 GiB consumer Blackwell with a 200k context that overflows (issue #10485),
// even though the GB10 it was measured on (128 GiB unified memory) had room.
//
// These constants size a conservative guard: only raise the batch when the
// extra scratch fits the per-device VRAM ceiling.
const (
// computeBufferBytesPerCell approximates the CUDA compute-buffer cost of one
// (n_ubatch * n_ctx) cell. Derived from an observed allocation (ub2048 *
// ctx204800 ~= 4.5 GiB => ~11 B/cell) and rounded up to 16 for margin, since
// the real cost also grows with model width (heads / embedding dim) which we
// don't know at config time.
computeBufferBytesPerCell = 16
// blackwellBatchHeadroomDivisor caps the extra compute buffer from raising the
// physical batch at VRAM/divisor. /4 keeps the bulk of a device for weights +
// KV, which already dominate VRAM use.
blackwellBatchHeadroomDivisor = 4
)
// PhysicalBatch returns the canonical physical batch (n_batch/n_ubatch) for the
// given hardware class, ignoring context/VRAM headroom. Use
// PhysicalBatchForContext when a model context and per-device VRAM are known
// (the load paths) so the raised batch can't overflow a single device.
// given hardware, used when the model config leaves batch unset.
func PhysicalBatch(g GPU) int {
if g.IsNVIDIABlackwell() {
return BlackwellPhysicalBatch
@@ -90,32 +63,6 @@ func PhysicalBatch(g GPU) int {
return DefaultPhysicalBatch
}
// PhysicalBatchForContext is PhysicalBatch gated on per-device VRAM headroom for
// the given context: it only raises the batch above the conservative default
// when the extra compute buffer (which is allocated on a single device and grows
// with n_ubatch * n_ctx) fits within blackwellBatchHeadroomDivisor of the GPU's
// VRAM. g.VRAM must be the PER-DEVICE ceiling (the smallest device on a
// multi-GPU host), not the summed total — the compute buffer can't be split.
//
// VRAM 0 (unknown) stays conservative rather than risk a per-device OOM; the
// GB10 / unified-memory path reports system RAM, so it still clears the guard.
func PhysicalBatchForContext(g GPU, ctx int) int {
if !g.IsNVIDIABlackwell() {
return DefaultPhysicalBatch
}
if ctx <= 0 {
ctx = DefaultContextSize
}
if g.VRAM == 0 {
return DefaultPhysicalBatch
}
extra := uint64(ctx) * uint64(BlackwellPhysicalBatch-DefaultPhysicalBatch) * computeBufferBytesPerCell
if extra <= g.VRAM/blackwellBatchHeadroomDivisor {
return BlackwellPhysicalBatch
}
return DefaultPhysicalBatch
}
// IsManagedPhysicalBatch reports whether n is a value PhysicalBatch assigns.
// Callers that re-tune a value chosen by an upstream host (the distributed
// router correcting the frontend's guess) use this to avoid clobbering an
@@ -175,12 +122,7 @@ func hasParallelOption(opts []string) bool {
// deterministic device — detection does a live nvidia-smi call.
var localGPU = func() GPU {
vendor, _ := xsysinfo.DetectGPUVendor()
// Use the SMALLEST device's VRAM, not the summed total: the parallel-slot
// tier and the batch headroom guard both reason about what fits on a single
// card, and per-device compute buffers can't be split across GPUs. Summing
// two 16 GiB cards into "32 GiB" is what over-provisioned multi-GPU hosts
// into OOM (issue #10485).
vram, _ := xsysinfo.MinPerGPUVRAM()
vram, _ := xsysinfo.TotalAvailableVRAM()
return GPU{
Vendor: vendor,
ComputeCapability: xsysinfo.NVIDIAComputeCapability(),
@@ -195,20 +137,10 @@ func ApplyHardwareDefaults(cfg *ModelConfig, gpu GPU) {
if cfg == nil {
return
}
// Raise the physical batch on Blackwell only when the resulting compute
// buffer fits the per-device VRAM at THIS model's context. Leaving Batch at 0
// (rather than writing the default 512) preserves the downstream single-pass
// sizing in core/backend.EffectiveBatchSize for embedding/score/rerank.
if cfg.Batch == 0 {
ctx := DefaultContextSize
if cfg.ContextSize != nil {
ctx = *cfg.ContextSize
}
if PhysicalBatchForContext(gpu, ctx) == BlackwellPhysicalBatch {
cfg.Batch = BlackwellPhysicalBatch
xlog.Debug("[hardware_defaults] Blackwell GPU: defaulting physical batch",
"batch", cfg.Batch, "compute_cap", gpu.ComputeCapability, "context", ctx, "vram_gib", gpu.VRAM>>30)
}
if cfg.Batch == 0 && gpu.IsNVIDIABlackwell() {
cfg.Batch = BlackwellPhysicalBatch
xlog.Debug("[hardware_defaults] Blackwell GPU: defaulting physical batch",
"batch", cfg.Batch, "compute_cap", gpu.ComputeCapability)
}
// Enable concurrent serving by default on a capable GPU: without this the

View File

@@ -9,37 +9,26 @@ import (
// GPU. The detection seam (localGPU) is injected so the path is deterministic
// without a real GPU.
var _ = Describe("SetDefaults hardware defaults (single-instance)", func() {
const gib = uint64(1) << 30
var orig func() GPU
BeforeEach(func() { orig = localGPU })
AfterEach(func() { localGPU = orig })
It("sets the physical batch on a local Blackwell GPU with headroom", func() {
localGPU = func() GPU { return GPU{ComputeCapability: "12.1", VRAM: 119 * gib} }
It("sets the physical batch on a local Blackwell GPU", func() {
localGPU = func() GPU { return GPU{ComputeCapability: "12.1"} }
cfg := &ModelConfig{}
cfg.SetDefaults()
Expect(cfg.Batch).To(Equal(BlackwellPhysicalBatch))
})
It("leaves batch unset when a large context would overflow the device", func() {
// Regression guard for issue #10485: 16 GiB consumer Blackwell + ~200k ctx.
localGPU = func() GPU { return GPU{ComputeCapability: "12.0", VRAM: 16 * gib} }
ctx := 204800
cfg := &ModelConfig{LLMConfig: LLMConfig{ContextSize: &ctx}}
cfg.SetDefaults()
Expect(cfg.Batch).To(Equal(0))
})
It("leaves batch unset on a non-Blackwell local GPU", func() {
localGPU = func() GPU { return GPU{ComputeCapability: "8.9", VRAM: 119 * gib} }
localGPU = func() GPU { return GPU{ComputeCapability: "8.9"} }
cfg := &ModelConfig{}
cfg.SetDefaults()
Expect(cfg.Batch).To(Equal(0))
})
It("never overrides an explicit batch", func() {
localGPU = func() GPU { return GPU{ComputeCapability: "12.1", VRAM: 119 * gib} }
localGPU = func() GPU { return GPU{ComputeCapability: "12.1"} }
cfg := &ModelConfig{}
cfg.Batch = 1024
cfg.SetDefaults()

View File

@@ -7,8 +7,6 @@ import (
)
var _ = Describe("Hardware-driven config defaults", func() {
const gib = uint64(1) << 30
DescribeTable("GPU.IsNVIDIABlackwell (sm_12x consumer family)",
func(cc string, want bool) {
Expect(GPU{ComputeCapability: cc}.IsNVIDIABlackwell()).To(Equal(want))
@@ -37,54 +35,21 @@ var _ = Describe("Hardware-driven config defaults", func() {
})
})
Describe("PhysicalBatchForContext (per-device VRAM headroom)", func() {
It("raises the batch when the compute buffer fits the device", func() {
// 16 GiB Blackwell with a small context: the extra scratch is tiny.
Expect(PhysicalBatchForContext(GPU{ComputeCapability: "12.0", VRAM: 16 * gib}, 8192)).
To(Equal(BlackwellPhysicalBatch))
})
It("keeps the default batch when a large context would overflow one device", func() {
// The issue #10485 case: 16 GiB consumer Blackwell, ~200k context.
Expect(PhysicalBatchForContext(GPU{ComputeCapability: "12.0", VRAM: 16 * gib}, 204800)).
To(Equal(DefaultPhysicalBatch))
})
It("still raises the batch on a large unified-memory device (GB10)", func() {
// GB10 reports system RAM (~119 GiB) as its single device's VRAM.
Expect(PhysicalBatchForContext(GPU{ComputeCapability: "12.1", VRAM: 119 * gib}, 204800)).
To(Equal(BlackwellPhysicalBatch))
})
It("stays conservative when VRAM is unknown", func() {
Expect(PhysicalBatchForContext(GPU{ComputeCapability: "12.1"}, 8192)).
To(Equal(DefaultPhysicalBatch))
})
It("never raises the batch on non-Blackwell", func() {
Expect(PhysicalBatchForContext(GPU{ComputeCapability: "9.0", VRAM: 80 * gib}, 8192)).
To(Equal(DefaultPhysicalBatch))
})
})
Describe("ApplyHardwareDefaults", func() {
It("raises an unset batch to 2048 on Blackwell with headroom", func() {
It("raises an unset batch to 2048 on Blackwell", func() {
cfg := &ModelConfig{}
ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1", VRAM: 119 * gib})
ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1"})
Expect(cfg.Batch).To(Equal(BlackwellPhysicalBatch))
})
It("leaves batch unset when a large context would overflow one device", func() {
// Regression guard for issue #10485: 16 GiB card + ~200k context.
ctx := 204800
cfg := &ModelConfig{LLMConfig: LLMConfig{ContextSize: &ctx}}
ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.0", VRAM: 16 * gib})
Expect(cfg.Batch).To(Equal(0))
})
It("leaves batch unset on non-Blackwell", func() {
cfg := &ModelConfig{}
ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "9.0", VRAM: 119 * gib})
ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "9.0"})
Expect(cfg.Batch).To(Equal(0))
})
It("never overrides an explicit batch", func() {
cfg := &ModelConfig{}
cfg.Batch = 1024
ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1", VRAM: 119 * gib})
ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1"})
Expect(cfg.Batch).To(Equal(1024))
})
It("no-ops on nil", func() {
@@ -92,6 +57,8 @@ var _ = Describe("Hardware-driven config defaults", func() {
})
})
const gib = uint64(1) << 30
DescribeTable("DefaultParallelSlots (by VRAM)",
func(vramGiB uint64, want int) {
Expect(DefaultParallelSlots(GPU{VRAM: vramGiB * gib})).To(Equal(want))

View File

@@ -1204,6 +1204,11 @@ func (cfg *ModelConfig) SetDefaults(opts ...ConfigLoaderOption) {
// This ensures gallery-installed and runtime-loaded models get optimal parameters.
ApplyInferenceDefaults(cfg, cfg.Name, cfg.Model)
// Apply hardware-driven defaults (e.g. a larger physical batch on Blackwell).
// Uses the local GPU here; in distributed mode the router re-applies the same
// heuristics for the selected node's GPU before loading. Explicit config wins.
ApplyHardwareDefaults(cfg, localGPU())
// Apply serving-policy defaults (device-independent): cross-request prefix
// caching. Propagates to distributed nodes via the model options.
ApplyServingDefaults(cfg)
@@ -1242,16 +1247,6 @@ func (cfg *ModelConfig) SetDefaults(opts ...ConfigLoaderOption) {
cfg.ContextSize = &ctx
}
runBackendHooks(cfg, lo.modelPath)
// Apply hardware-driven defaults (e.g. a larger physical batch on Blackwell)
// LAST, after the context size is fully resolved (explicit config, LoadOptions,
// then the GGUF guess inside runBackendHooks): the Blackwell batch guard sizes
// the per-device compute buffer against this model's context, so it must see
// the final value, not a pre-guess nil. Uses the local GPU here; in distributed
// mode the router re-applies the same heuristics for the selected node's GPU
// before loading. Explicit config always wins.
ApplyHardwareDefaults(cfg, localGPU())
cfg.syncKnownUsecasesFromString()
}

View File

@@ -86,7 +86,6 @@
"input": {
"placeholder": "Message...",
"attachFile": "Attach file",
"send": "Send message",
"stopGenerating": "Stop generating",
"canvasTitle": "Canvas — extract code blocks and media into a side panel for preview, copy, and download",
"canvasLabel": "Canvas",

View File

@@ -77,20 +77,6 @@
"noModelsTitle": "No Models Available",
"noModelsBody": "There are no models installed yet. Ask your administrator to set up models so you can start chatting."
},
"starters": {
"title": "Recommended for your hardware",
"tier": {
"cpu": "CPU-only",
"gpu-small": "GPU",
"gpu-large": "GPU"
},
"cpuNote": "No GPU detected — these small models stay responsive on CPU.",
"gpuNote": "Picked to fit your available VRAM with room for context.",
"install": "Install",
"installing": "Installing",
"installStarted": "Installing {{model}}…",
"installFailed": "Install failed: {{message}}"
},
"connect": {
"title": "One endpoint, every API",
"subtitle": "LocalAI serves its own full API — image & video generation, depth, object detection, reranking, audio, face & voice recognition, and realtime voice over WebRTC and WebSocket. On top of that, a drop-in compatibility layer lets any app built for OpenAI, Anthropic, Ollama or OpenAI Responses talk to it unchanged.",

View File

@@ -6363,59 +6363,6 @@ select.input {
justify-content: center;
}
/* ──────────────────── Home: hardware-aware starter models ──────────────────── */
.home-starters {
margin: var(--spacing-lg) 0;
padding: var(--spacing-lg);
}
.home-starters-head {
display: flex;
align-items: center;
justify-content: space-between;
gap: var(--spacing-md);
}
.home-starters-head strong {
font-size: 0.9375rem;
}
.home-starters-tier {
display: inline-flex;
align-items: center;
gap: var(--spacing-xs);
font-size: 0.75rem;
color: var(--color-text-muted);
}
.home-starters-sub {
margin: var(--spacing-xs) 0 var(--spacing-md);
font-size: 0.8125rem;
color: var(--color-text-secondary);
}
.home-starters-list {
list-style: none;
margin: 0;
padding: 0;
display: flex;
flex-direction: column;
gap: var(--spacing-xs);
}
.home-starters-item {
display: flex;
align-items: center;
gap: var(--spacing-md);
padding: var(--spacing-xs) 0;
}
.home-starters-name {
font-weight: 500;
font-size: 0.875rem;
word-break: break-all;
}
.home-starters-size {
margin-left: auto;
font-size: 0.75rem;
color: var(--color-text-muted);
white-space: nowrap;
}
/* ──────────────────── Home: drop-in endpoint / API compatibility ──────────────────── */
.home-connect {

View File

@@ -1,25 +1,8 @@
import { useEffect, useMemo, useCallback } from 'react'
import { useEffect, useMemo } from 'react'
import { useModels } from '../hooks/useModels'
import SearchableSelect from './SearchableSelect'
import { useTranslation } from 'react-i18next'
// Remember the last model the user picked, keyed by capability, so returning to
// a page (Home chat box, Image, TTS, Talk...) defaults to that model instead of
// whatever happens to sort first. Only persisted when a capability key exists —
// `externalOptions` callers pass no capability and get the old first-item
// behaviour. localStorage access is wrapped because private-browsing modes throw.
const LAST_MODEL_PREFIX = 'localai_last_model:'
function readLastModel(capability) {
if (!capability) return null
try { return localStorage.getItem(LAST_MODEL_PREFIX + capability) } catch { return null }
}
function writeLastModel(capability, model) {
if (!capability || !model) return
try { localStorage.setItem(LAST_MODEL_PREFIX + capability, model) } catch { /* ignore */ }
}
export default function ModelSelector({
value, onChange, capability, className = '',
options: externalOptions, loading: externalLoading,
@@ -36,27 +19,16 @@ export default function ModelSelector({
const isLoading = externalOptions ? (externalLoading || false) : hookLoading
const isDisabled = isLoading || (externalDisabled || false)
// Persist genuine selections so the next visit can restore them.
const handleChange = useCallback((next) => {
writeLastModel(capability, next)
onChange(next)
}, [capability, onChange])
useEffect(() => {
if (modelNames.length > 0 && (!value || !modelNames.includes(value))) {
// Prefer the remembered model when it's still available; otherwise fall
// back to the first option. Don't re-persist here — auto-select is not a
// user choice, and writing back the stored value would be a harmless but
// pointless round-trip.
const remembered = readLastModel(capability)
onChange(remembered && modelNames.includes(remembered) ? remembered : modelNames[0])
onChange(modelNames[0])
}
}, [modelNames, value, onChange, capability])
}, [modelNames, value, onChange])
return (
<SearchableSelect
value={value || ''}
onChange={handleChange}
onChange={onChange}
options={modelNames}
placeholder={isLoading ? t('selector.loading') : (modelNames.length === 0 ? t('selector.noModels') : t('selector.selectModel'))}
searchPlaceholder={searchPlaceholder || t('selector.searchPlaceholder')}

View File

@@ -1,129 +0,0 @@
import { useState, useEffect, useMemo } from 'react'
import { useTranslation } from 'react-i18next'
import { modelsApi } from '../utils/api'
import { useResources } from '../hooks/useResources'
// Curated, hardware-tiered starter models for the empty-state onboarding. Names
// are real gallery entries (gallery/index.yaml); we intersect them against the
// live gallery at render time so a custom/trimmed gallery degrades gracefully
// (unmatched entries simply don't render).
//
// The guiding rule the maintainer asked for: CPU-only machines should be
// steered to genuinely small models (1-4B, Q4) that stay responsive without a
// GPU. GPU tiers scale the suggestion up with available VRAM.
const SMALL = [
{ name: 'llama-3.2-1b-instruct:q4_k_m', size: '~0.8 GB' },
{ name: 'llama-3.2-3b-instruct:q4_k_m', size: '~2 GB' },
{ name: 'qwen3-1.7b', size: '~1.4 GB' },
{ name: 'gemma-3-1b-it', size: '~0.8 GB' },
]
const MID = [
{ name: 'qwen3-4b', size: '~2.5 GB' },
{ name: 'gemma-3-4b-it', size: '~3 GB' },
{ name: 'llama-3.2-3b-instruct:q4_k_m', size: '~2 GB' },
]
const LARGE = [
{ name: 'meta-llama-3.1-8b-instruct', size: '~5 GB' },
{ name: 'qwen3-4b', size: '~2.5 GB' },
{ name: 'mistral-7b-instruct-v0.3', size: '~4 GB' },
]
const GB = 1024 * 1024 * 1024
// Pick a tier from detected hardware. total_memory is GPU VRAM in bytes (0 when
// CPU-only). Thresholds are deliberately conservative so a suggestion that
// "fits" really does.
function pickTier(resources) {
const isGpu = resources?.type === 'gpu'
const vram = resources?.aggregate?.total_memory || 0
if (!isGpu || vram <= 0) return { id: 'cpu', list: SMALL }
if (vram < 8 * GB) return { id: 'gpu-small', list: MID }
return { id: 'gpu-large', list: LARGE }
}
export default function StarterModels({ addToast, onInstallStarted }) {
const { t } = useTranslation('home')
const { resources } = useResources()
const [available, setAvailable] = useState(null) // Set of gallery names, or null while loading
const [installing, setInstalling] = useState(() => new Set())
const tier = useMemo(() => pickTier(resources), [resources])
const candidates = tier.list
// Verify candidates exist in the live gallery. One search per name (the tier
// has at most a handful) keeps this resilient to gallery customization.
useEffect(() => {
let cancelled = false
const names = [...new Set(candidates.map(c => c.name))]
Promise.all(names.map(name =>
modelsApi.list({ search: name, page: 1 })
.then(data => (data?.models || []).some(m => (m.name || m.id) === name) ? name : null)
.catch(() => null)
)).then(found => {
if (cancelled) return
const hits = found.filter(Boolean)
// If verification yielded nothing (e.g. gallery unreachable), fall back to
// showing the curated list rather than an empty widget.
setAvailable(hits.length > 0 ? new Set(hits) : null)
})
return () => { cancelled = true }
}, [candidates])
const visible = available === null
? candidates
: candidates.filter(c => available.has(c.name))
if (visible.length === 0) return null
const install = async (name) => {
setInstalling(prev => new Set(prev).add(name))
try {
await modelsApi.install(name)
addToast?.(t('starters.installStarted', { model: name }), 'success')
onInstallStarted?.(name)
} catch (err) {
addToast?.(t('starters.installFailed', { message: err.message }), 'error')
setInstalling(prev => {
const next = new Set(prev)
next.delete(name)
return next
})
}
}
return (
<section className="home-starters card">
<div className="home-starters-head">
<strong>{t('starters.title')}</strong>
<span className="home-starters-tier">
<i className={`fas ${tier.id === 'cpu' ? 'fa-memory' : 'fa-microchip'}`} aria-hidden="true" />
{t(`starters.tier.${tier.id}`)}
</span>
</div>
<p className="home-starters-sub">
{tier.id === 'cpu' ? t('starters.cpuNote') : t('starters.gpuNote')}
</p>
<ul className="home-starters-list">
{visible.map(c => {
const busy = installing.has(c.name)
return (
<li key={c.name} className="home-starters-item">
<span className="home-starters-name">{c.name}</span>
<span className="home-starters-size">{c.size}</span>
<button
type="button"
className="btn btn-primary btn-sm"
disabled={busy}
onClick={() => install(c.name)}
>
{busy
? (<><i className="fas fa-spinner fa-spin" aria-hidden="true" /> {t('starters.installing')}</>)
: (<><i className="fas fa-download" aria-hidden="true" /> {t('starters.install')}</>)}
</button>
</li>
)
})}
</ul>
</section>
)
}

View File

@@ -1,66 +0,0 @@
import { useEffect, useRef, useCallback } from 'react'
// usePolling runs `fn` immediately and then on a fixed interval, with two
// behaviours every hand-rolled setInterval in this app was missing:
//
// 1. Visibility-aware: the timer pauses while the tab is hidden
// (document.hidden) and fires an immediate catch-up poll when the tab
// becomes visible again. A backgrounded dashboard no longer hammers the
// server every few seconds for data nobody is looking at.
// 2. Non-overlapping: if `fn` returns a promise that takes longer than the
// interval, the next tick waits for it instead of stacking requests.
//
// `enabled: false` stops polling entirely (one-shot or gated polls). The
// returned `refetch` runs `fn` on demand and is stable across renders.
export function usePolling(fn, intervalMs = 5000, { enabled = true, immediate = true } = {}) {
const fnRef = useRef(fn)
fnRef.current = fn
const runningRef = useRef(false)
const refetch = useCallback(async () => {
// Guard against overlap: a slow poll shouldn't pile up behind a fast timer.
if (runningRef.current) return
runningRef.current = true
try {
return await fnRef.current()
} finally {
runningRef.current = false
}
}, [])
useEffect(() => {
if (!enabled) return
let timer = null
const tick = () => { refetch() }
const start = () => {
if (timer != null) return
timer = setInterval(tick, intervalMs)
}
const stop = () => {
if (timer != null) { clearInterval(timer); timer = null }
}
const onVisibility = () => {
if (document.hidden) {
stop()
} else {
// Catch up immediately on return, then resume the cadence.
tick()
start()
}
}
if (immediate) tick()
if (!document.hidden) start()
document.addEventListener('visibilitychange', onVisibility)
return () => {
stop()
document.removeEventListener('visibilitychange', onVisibility)
}
}, [enabled, intervalMs, immediate, refetch])
return { refetch }
}

View File

@@ -1,11 +1,11 @@
import { useState, useCallback } from 'react'
import { useState, useEffect, useCallback, useRef } from 'react'
import { resourcesApi } from '../utils/api'
import { usePolling } from './usePolling'
export function useResources(pollInterval = 5000) {
const [resources, setResources] = useState(null)
const [loading, setLoading] = useState(true)
const [error, setError] = useState(null)
const intervalRef = useRef(null)
const fetchResources = useCallback(async () => {
try {
@@ -19,10 +19,13 @@ export function useResources(pollInterval = 5000) {
}
}, [])
// Visibility-aware polling: pauses while the tab is hidden and catches up on
// return (see usePolling). Resource stats are pure dashboard data, so there's
// no reason to keep fetching them for a backgrounded tab.
const { refetch } = usePolling(fetchResources, pollInterval)
useEffect(() => {
fetchResources()
intervalRef.current = setInterval(fetchResources, pollInterval)
return () => {
if (intervalRef.current) clearInterval(intervalRef.current)
}
}, [fetchResources, pollInterval])
return { resources, loading, error, refetch }
return { resources, loading, error, refetch: fetchResources }
}

View File

@@ -765,10 +765,8 @@ export default function AgentChat() {
className="chat-send-btn"
onClick={handleSend}
disabled={processing || !input.trim()}
aria-label="Send message"
title="Send message"
>
<i className="fas fa-paper-plane" aria-hidden="true" />
<i className="fas fa-paper-plane" />
</button>
</div>
</div>

View File

@@ -1427,10 +1427,8 @@ export default function Chat() {
className="chat-send-btn"
onClick={handleSend}
disabled={!input.trim() && files.length === 0}
aria-label={t('input.send')}
title={t('input.send')}
>
<i className="fas fa-paper-plane" aria-hidden="true" />
<i className="fas fa-paper-plane" />
</button>
)}
</div>

View File

@@ -10,7 +10,6 @@ import UnifiedMCPDropdown from '../components/UnifiedMCPDropdown'
import ConfirmDialog from '../components/ConfirmDialog'
import HomeConnect from '../components/HomeConnect'
import { useResources } from '../hooks/useResources'
import { usePolling } from '../hooks/usePolling'
import { fileToBase64, backendControlApi, systemApi, modelsApi, mcpApi, nodesApi } from '../utils/api'
import { API_CONFIG } from '../utils/config'
import { greetingKey } from '../utils/greeting'
@@ -18,7 +17,6 @@ import StatusPill from '../components/StatusPill'
import Skeleton from '../components/Skeleton'
import SectionHeading from '../components/SectionHeading'
import EmptyState from '../components/EmptyState'
import StarterModels from '../components/StarterModels'
import { staggerStyle } from '../hooks/useStagger'
export default function Home() {
@@ -70,36 +68,40 @@ export default function Home() {
.catch(() => {})
}, [])
// Poll cluster node data in distributed mode. Visibility-aware + gated on
// distributedMode so a non-distributed or backgrounded tab makes no calls.
const fetchCluster = useCallback(async () => {
try {
const data = await nodesApi.list()
const nodes = Array.isArray(data) ? data : []
const backendNodes = nodes.filter(n => !n.node_type || n.node_type === 'backend')
const totalVRAM = backendNodes.reduce((sum, n) => sum + (n.total_vram || 0), 0)
const usedVRAM = backendNodes.reduce((sum, n) => {
if (n.total_vram && n.available_vram != null) return sum + (n.total_vram - n.available_vram)
return sum
}, 0)
const totalRAM = backendNodes.reduce((sum, n) => sum + (n.total_ram || 0), 0)
const usedRAM = backendNodes.reduce((sum, n) => {
if (n.total_ram && n.available_ram != null) return sum + (n.total_ram - n.available_ram)
return sum
}, 0)
const isGPU = totalVRAM > 0
const healthyCount = backendNodes.filter(n => n.status === 'healthy').length
const totalCount = backendNodes.length
setClusterData({
totalMem: isGPU ? totalVRAM : totalRAM,
usedMem: isGPU ? usedVRAM : usedRAM,
isGPU,
healthyCount,
totalCount,
})
} catch { setClusterData(null) }
}, [])
usePolling(fetchCluster, 5000, { enabled: distributedMode })
// Poll cluster node data in distributed mode
useEffect(() => {
if (!distributedMode) return
const fetchCluster = async () => {
try {
const data = await nodesApi.list()
const nodes = Array.isArray(data) ? data : []
const backendNodes = nodes.filter(n => !n.node_type || n.node_type === 'backend')
const totalVRAM = backendNodes.reduce((sum, n) => sum + (n.total_vram || 0), 0)
const usedVRAM = backendNodes.reduce((sum, n) => {
if (n.total_vram && n.available_vram != null) return sum + (n.total_vram - n.available_vram)
return sum
}, 0)
const totalRAM = backendNodes.reduce((sum, n) => sum + (n.total_ram || 0), 0)
const usedRAM = backendNodes.reduce((sum, n) => {
if (n.total_ram && n.available_ram != null) return sum + (n.total_ram - n.available_ram)
return sum
}, 0)
const isGPU = totalVRAM > 0
const healthyCount = backendNodes.filter(n => n.status === 'healthy').length
const totalCount = backendNodes.length
setClusterData({
totalMem: isGPU ? totalVRAM : totalRAM,
usedMem: isGPU ? usedVRAM : usedRAM,
isGPU,
healthyCount,
totalCount,
})
} catch { setClusterData(null) }
}
fetchCluster()
const interval = setInterval(fetchCluster, 5000)
return () => clearInterval(interval)
}, [distributedMode])
// Fetch configured models (to know if any exist) and loaded models (currently running)
const fetchSystemInfo = useCallback(async () => {
@@ -121,7 +123,11 @@ export default function Home() {
}
}, [])
usePolling(fetchSystemInfo, 5000)
useEffect(() => {
fetchSystemInfo()
const interval = setInterval(fetchSystemInfo, 5000)
return () => clearInterval(interval)
}, [fetchSystemInfo])
// Check MCP availability when selected model changes
useEffect(() => {
@@ -517,8 +523,6 @@ export default function Home() {
</div>
</div>
<StarterModels addToast={addToast} onInstallStarted={fetchSystemInfo} />
<div className="home-wizard-actions">
<button className="btn btn-primary" onClick={() => navigate('/app/models')}>
<i className="fas fa-store" /> {t('wizard.browseGallery')}

View File

@@ -24,37 +24,7 @@ function formatNumber(n) {
return String(n)
}
// Opt-in token pricing. LocalAI is self-hosted and has no inherent monetary
// cost, but multi-user deployments use estimated cost for chargeback/budgeting.
// Prices are admin-supplied $ per 1M tokens, stored locally (per-browser), and
// the whole cost surface stays hidden until a non-zero price is set.
const TOKEN_PRICING_KEY = 'localai_token_pricing'
function loadPricing() {
try {
const p = JSON.parse(localStorage.getItem(TOKEN_PRICING_KEY) || '{}')
return { prompt: Number(p.prompt) || 0, completion: Number(p.completion) || 0 }
} catch { return { prompt: 0, completion: 0 } }
}
function savePricing(p) {
try { localStorage.setItem(TOKEN_PRICING_KEY, JSON.stringify(p)) } catch { /* ignore */ }
}
function pricingEnabled(p) { return (p?.prompt || 0) > 0 || (p?.completion || 0) > 0 }
function costOf(row, p) {
return (row.prompt_tokens / 1_000_000) * (p.prompt || 0)
+ (row.completion_tokens / 1_000_000) * (p.completion || 0)
}
function formatCost(n) {
if (!n) return '$0.00'
if (n < 0.01) return '<$0.01'
return '$' + n.toFixed(2)
}
function StatCard({ icon, label, value, muted, text }) {
function StatCard({ icon, label, value, muted }) {
return (
<div className="card" style={{ padding: 'var(--spacing-sm) var(--spacing-md)', flex: '1 1 0', minWidth: 120, opacity: muted ? 0.7 : 1 }}>
<div style={{ display: 'flex', alignItems: 'center', gap: 6, marginBottom: 2 }}>
@@ -62,7 +32,7 @@ function StatCard({ icon, label, value, muted, text }) {
<span style={{ fontSize: '0.6875rem', color: 'var(--color-text-muted)', fontWeight: 500, textTransform: 'uppercase', letterSpacing: '0.03em' }}>{label}</span>
</div>
<div style={{ fontSize: '1.375rem', fontWeight: 700, fontFamily: 'var(--font-mono)', color: muted ? 'var(--color-text-secondary)' : 'var(--color-text-primary)' }}>
{text != null ? text : `${muted ? '~' : ''}${formatNumber(value)}`}
{muted ? '~' : ''}{formatNumber(value)}
</div>
</div>
)
@@ -672,10 +642,6 @@ export default function Usage() {
const [activeTab, setActiveTab] = useState('models')
const [quotas, setQuotas] = useState([])
const [selectedUserId, setSelectedUserId] = useState(null)
const [pricing, setPricingState] = useState(loadPricing)
const [showPricing, setShowPricing] = useState(false)
const setPricing = (p) => { setPricingState(p); savePricing(p) }
const costEnabled = pricingEnabled(pricing)
const fetchUsage = useCallback(async () => {
setLoading(true)
@@ -777,50 +743,11 @@ export default function Usage() {
<i className="fas fa-key" style={{ fontSize: '0.7rem' }} /> {t('usage.sources.tab')}
</button>
<div style={{ flex: 1 }} />
<button
className={`btn btn-sm ${costEnabled ? 'btn-primary' : 'btn-secondary'}`}
onClick={() => setShowPricing(v => !v)}
style={{ gap: 4 }}
title="Set token pricing to estimate cost"
>
<i className="fas fa-dollar-sign" /> {costEnabled ? 'Pricing' : 'Set pricing'}
</button>
<button className="btn btn-secondary btn-sm" onClick={fetchUsage} disabled={loading} style={{ gap: 4 }}>
<i className={`fas fa-rotate${loading ? ' fa-spin' : ''}`} /> Refresh
</button>
</div>
{showPricing && (
<div className="card" style={{ display: 'flex', alignItems: 'flex-end', gap: 'var(--spacing-md)', flexWrap: 'wrap', padding: 'var(--spacing-md)', marginBottom: 'var(--spacing-md)' }}>
<div style={{ display: 'flex', flexDirection: 'column', gap: 2 }}>
<label style={{ fontSize: '0.6875rem', color: 'var(--color-text-muted)', textTransform: 'uppercase', letterSpacing: '0.03em' }}>Prompt $/1M tokens</label>
<input
className="input" type="number" min="0" step="0.01" style={{ width: 140 }}
value={pricing.prompt || ''}
placeholder="0.00"
onChange={e => setPricing({ ...pricing, prompt: Number(e.target.value) || 0 })}
/>
</div>
<div style={{ display: 'flex', flexDirection: 'column', gap: 2 }}>
<label style={{ fontSize: '0.6875rem', color: 'var(--color-text-muted)', textTransform: 'uppercase', letterSpacing: '0.03em' }}>Completion $/1M tokens</label>
<input
className="input" type="number" min="0" step="0.01" style={{ width: 140 }}
value={pricing.completion || ''}
placeholder="0.00"
onChange={e => setPricing({ ...pricing, completion: Number(e.target.value) || 0 })}
/>
</div>
{costEnabled && (
<button className="btn btn-secondary btn-sm" onClick={() => setPricing({ prompt: 0, completion: 0 })} style={{ gap: 4 }}>
<i className="fas fa-times" /> Clear
</button>
)}
<span style={{ fontSize: '0.75rem', color: 'var(--color-text-muted)', flex: '1 1 200px' }}>
Estimated cost only. Prices are stored in this browser and applied to recorded token counts.
</span>
</div>
)}
{loading ? (
<div style={{ display: 'flex', justifyContent: 'center', padding: 'var(--spacing-xl)' }}>
<LoadingSpinner size="lg" />
@@ -833,9 +760,6 @@ export default function Usage() {
<StatCard icon="fas fa-arrow-up" label="Prompt" value={displayTotals.prompt_tokens} />
<StatCard icon="fas fa-arrow-down" label="Completion" value={displayTotals.completion_tokens} />
<StatCard icon="fas fa-coins" label="Total" value={displayTotals.total_tokens} />
{costEnabled && (
<StatCard icon="fas fa-dollar-sign" label="Est. Cost" text={formatCost(costOf(displayTotals, pricing))} />
)}
</div>
{/* Predictions */}
@@ -865,7 +789,6 @@ export default function Usage() {
<th style={{ width: 110 }}>Prompt</th>
<th style={{ width: 110 }}>Completion</th>
<th style={{ width: 110 }}>Total</th>
{costEnabled && <th style={{ width: 100 }}>Est. Cost</th>}
<th style={{ width: 140 }}></th>
</tr>
</thead>
@@ -877,7 +800,6 @@ export default function Usage() {
<td style={monoCell}>{formatNumber(row.prompt_tokens)}</td>
<td style={monoCell}>{formatNumber(row.completion_tokens)}</td>
<td style={{ ...monoCell, fontWeight: 600 }}>{formatNumber(row.total_tokens)}</td>
{costEnabled && <td style={monoCell}>{formatCost(costOf(row, pricing))}</td>}
<td><UsageBar value={row.total_tokens} max={maxTokens} /></td>
</tr>
))}
@@ -905,7 +827,6 @@ export default function Usage() {
<th style={{ width: 110 }}>Prompt</th>
<th style={{ width: 110 }}>Completion</th>
<th style={{ width: 110 }}>Total</th>
{costEnabled && <th style={{ width: 100 }}>Est. Cost</th>}
<th style={{ width: 110 }}>Proj. Total</th>
<th style={{ width: 140 }}></th>
</tr>
@@ -928,7 +849,6 @@ export default function Usage() {
<td style={monoCell}>{formatNumber(row.prompt_tokens)}</td>
<td style={monoCell}>{formatNumber(row.completion_tokens)}</td>
<td style={{ ...monoCell, fontWeight: 600 }}>{formatNumber(row.total_tokens)}</td>
{costEnabled && <td style={monoCell}>{formatCost(costOf(row, pricing))}</td>}
<td style={{ ...monoCell, color: 'var(--color-text-muted)', fontStyle: 'italic' }}>
{up?.predictions ? `~${formatNumber(up.predictions.projectedTotals.total_tokens)}` : '-'}
</td>
@@ -936,7 +856,7 @@ export default function Usage() {
</tr>
{isExpanded && up && (
<tr>
<td colSpan={costEnabled ? 9 : 8} style={{ padding: 0, background: 'var(--color-bg-secondary)' }}>
<td colSpan={8} style={{ padding: 0, background: 'var(--color-bg-secondary)' }}>
<div style={{ padding: 'var(--spacing-md)' }}>
{up.predictions && (
<div style={{ display: 'grid', gridTemplateColumns: 'repeat(auto-fit, minmax(100px, 1fr))', gap: 'var(--spacing-xs)', marginBottom: 'var(--spacing-sm)' }}>

View File

@@ -156,10 +156,7 @@ func applyNodeHardwareDefaults(opts *pb.ModelOptions, node *BackendNode) {
VRAM: node.TotalVRAM,
}
if config.IsManagedPhysicalBatch(int(opts.NBatch)) {
// Gate the raised batch on the selected node's per-device VRAM at this
// model's context, so a large context can't overflow the node's compute
// buffer (issue #10485). node.TotalVRAM is the node's reported ceiling.
opts.NBatch = int32(config.PhysicalBatchForContext(gpu, int(opts.ContextSize)))
opts.NBatch = int32(config.PhysicalBatch(gpu))
}
// Default concurrent serving for the selected node (the frontend that built
// the options may have no GPU). Only adds when no parallel option is set.

View File

@@ -8,19 +8,12 @@ import (
)
var _ = Describe("applyNodeHardwareDefaults", func() {
It("raises a managed default batch on a Blackwell node with headroom", func() {
opts := &pb.ModelOptions{NBatch: config.DefaultPhysicalBatch, ContextSize: 8192}
applyNodeHardwareDefaults(opts, &BackendNode{GPUComputeCapability: "12.1", TotalVRAM: 119 << 30})
It("raises a managed default batch on a Blackwell node", func() {
opts := &pb.ModelOptions{NBatch: config.DefaultPhysicalBatch}
applyNodeHardwareDefaults(opts, &BackendNode{GPUComputeCapability: "12.1"})
Expect(opts.NBatch).To(BeEquivalentTo(config.BlackwellPhysicalBatch))
})
It("keeps the default batch when a large context would overflow the node", func() {
// Regression guard for issue #10485 on the distributed path.
opts := &pb.ModelOptions{NBatch: config.DefaultPhysicalBatch, ContextSize: 204800}
applyNodeHardwareDefaults(opts, &BackendNode{GPUComputeCapability: "12.0", TotalVRAM: 16 << 30})
Expect(opts.NBatch).To(BeEquivalentTo(config.DefaultPhysicalBatch))
})
It("resets a Blackwell guess on a non-Blackwell node", func() {
// frontend (Blackwell) guessed high, but the selected node is not Blackwell
opts := &pb.ModelOptions{NBatch: config.BlackwellPhysicalBatch}

View File

@@ -1,3 +1,3 @@
{
"version": "v4.5.0"
"version": "v4.4.3"
}

View File

@@ -3,7 +3,24 @@
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
urls:
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct-GGUF
description: "Try LFM • Docs • LEAP • Discord\n\n# LFM2.5-1.2B-Instruct\n\nLFM2.5 is a new family of hybrid models designed for **on-device deployment**. It builds on the LFM2 architecture with extended pre-training and reinforcement learning.\n\n - **Best-in-class performance**: A 1.2B model rivaling much larger models, bringing high-quality AI to your pocket.\n - **Fast edge inference**: 239 tok/s decode on AMD CPU, 82 tok/s on mobile NPU. Runs under 1GB of memory with day-one support for llama.cpp, MLX, and vLLM.\n - **Scaled training**: Extended pre-training from 10T to 28T tokens and large-scale multi-stage reinforcement learning.\n\nFind more information about LFM2.5 in our blog post.\n\n## \U0001F5D2 Model Details\n\nLFM2.5-1.2B-Instruct is a general-purpose text-only model with the following features:\n\n...\n"
description: |
Try LFM • Docs • LEAP • Discord
# LFM2.5-1.2B-Instruct
LFM2.5 is a new family of hybrid models designed for **on-device deployment**. It builds on the LFM2 architecture with extended pre-training and reinforcement learning.
- **Best-in-class performance**: A 1.2B model rivaling much larger models, bringing high-quality AI to your pocket.
- **Fast edge inference**: 239 tok/s decode on AMD CPU, 82 tok/s on mobile NPU. Runs under 1GB of memory with day-one support for llama.cpp, MLX, and vLLM.
- **Scaled training**: Extended pre-training from 10T to 28T tokens and large-scale multi-stage reinforcement learning.
Find more information about LFM2.5 in our blog post.
## 🗒️ Model Details
LFM2.5-1.2B-Instruct is a general-purpose text-only model with the following features:
...
license: "other"
tags:
- llm
@@ -825,8 +842,8 @@
use_tokenizer_template: true
files:
- filename: llama-cpp/models/Qwopus3.6-27B-Coder-MTP-GGUF/Qwopus3.6-27B-Coder-MTP-Q4_K_M.gguf
sha256: b2898667ed7b2388f0ab7691393833ae777f247492bbe62fdb4b2bd3e3cf3f79
uri: https://huggingface.co/Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF/resolve/main/Qwopus3.6-27B-Coder-MTP-Q4_K_M.gguf
sha256: b2b9180093496da2e00439e3fa23227c591355901bfa579bc6897bbc01b755ef
- filename: llama-cpp/mmproj/Qwopus3.6-27B-Coder-MTP-GGUF/mmproj-F32.gguf
sha256: 32f7ea0600c07272547da401d460f8abbd980f3a57b69d6df87be0e2505e0b9c
uri: https://huggingface.co/Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF/resolve/main/mmproj-F32.gguf

View File

@@ -129,61 +129,6 @@ func TotalAvailableVRAM() (uint64, error) {
return 0, nil
}
// MinPerGPUVRAM returns the total VRAM of the SMALLEST GPU on the host (in
// bytes), or 0 when no per-device VRAM is known. Unlike TotalAvailableVRAM
// (which sums across devices) this reports a single device's ceiling, which is
// the right figure for decisions about what must fit on one card: the compute
// buffer (sized by n_ubatch) and the parallel-slot tier. Summing a multi-GPU
// host's VRAM over-provisions those into a per-device OOM (issue #10485).
//
// Unified-memory devices (GB10, Apple) report system RAM as their single
// device's VRAM, so they are unaffected.
func MinPerGPUVRAM() (uint64, error) {
// Prefer per-device binary detection (nvidia-smi/rocm-smi report true
// per-card VRAM); ghw's per-card memory can reflect NUMA node RAM on some
// hosts, which is why TotalAvailableVRAM treats it as a sum.
if infos := GetGPUMemoryUsage(); len(infos) > 0 {
if v := minNonZeroVRAM(infos); v > 0 {
return v, nil
}
}
// Fallback: ghw per-card memory, taking the minimum non-zero card.
if gpus, err := GPUs(); err == nil {
var min uint64
for _, gpu := range gpus {
if gpu == nil || gpu.Node == nil || gpu.Node.Memory == nil {
continue
}
if b := gpu.Node.Memory.TotalUsableBytes; b > 0 {
if u := uint64(b); min == 0 || u < min {
min = u
}
}
}
if min > 0 {
return min, nil
}
}
return 0, nil
}
// minNonZeroVRAM returns the smallest non-zero TotalVRAM across the given GPUs,
// or 0 when none report VRAM.
func minNonZeroVRAM(infos []GPUMemoryInfo) uint64 {
var min uint64
for _, g := range infos {
if g.TotalVRAM == 0 {
continue
}
if min == 0 || g.TotalVRAM < min {
min = g.TotalVRAM
}
}
return min
}
func HasGPU(vendor string) bool {
gpus, err := GPUs()
if err != nil {

View File

@@ -1,37 +0,0 @@
package xsysinfo
import (
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("minNonZeroVRAM", func() {
const gib = uint64(1) << 30
It("returns the smallest device on a multi-GPU host", func() {
// Two unequal cards (e.g. RTX 5070 Ti + 5060 Ti, both 16 GiB, or a
// mixed pair): the smallest device is the per-card allocation ceiling.
infos := []GPUMemoryInfo{
{TotalVRAM: 16 * gib},
{TotalVRAM: 12 * gib},
}
Expect(minNonZeroVRAM(infos)).To(Equal(12 * gib))
})
It("ignores devices that report zero VRAM", func() {
infos := []GPUMemoryInfo{
{TotalVRAM: 0},
{TotalVRAM: 24 * gib},
}
Expect(minNonZeroVRAM(infos)).To(Equal(24 * gib))
})
It("returns the single device's VRAM on a one-GPU host", func() {
Expect(minNonZeroVRAM([]GPUMemoryInfo{{TotalVRAM: 16 * gib}})).To(Equal(16 * gib))
})
It("returns 0 when no device reports VRAM", func() {
Expect(minNonZeroVRAM([]GPUMemoryInfo{{TotalVRAM: 0}})).To(BeZero())
Expect(minNonZeroVRAM(nil)).To(BeZero())
})
})

View File

@@ -6,10 +6,11 @@ IMAGE_NAME="${IMAGE_NAME:-localai/llama-cpp-darwin}"
pushd backend/cpp/llama-cpp
# make llama-cpp-avx && \
# make llama-cpp-avx2 && \
# make llama-cpp-avx512 && \
make llama-cpp-fallback && \
# Single build via ggml CPU_ALL_VARIANTS: one binary plus the per-microarch Apple/arm
# dylibs (apple_m1/m2_m3/m4, armv8.x) that ggml selects at runtime. GGML_METAL stays ON
# and --target ggml also builds ggml-metal (via add_dependencies), so the Metal GPU
# backend is still produced as a loadable libggml-metal.dylib.
make llama-cpp-cpu-all && \
make llama-cpp-grpc && \
make llama-cpp-rpc-server
@@ -19,13 +20,24 @@ mkdir -p build/darwin
mkdir -p backend-images
mkdir -p build/darwin/lib
# cp -rf backend/cpp/llama-cpp/llama-cpp-avx build/darwin/
# cp -rf backend/cpp/llama-cpp/llama-cpp-avx2 build/darwin/
# cp -rf backend/cpp/llama-cpp/llama-cpp-avx512 build/darwin/
cp -rf backend/cpp/llama-cpp/llama-cpp-fallback build/darwin/
cp -rf backend/cpp/llama-cpp/llama-cpp-cpu-all build/darwin/
cp -rf backend/cpp/llama-cpp/llama-cpp-grpc build/darwin/
cp -rf backend/cpp/llama-cpp/llama-cpp-rpc-server build/darwin/
# Distribute the shared ggml/llama libraries from the CPU_ALL_VARIANTS build. Unlike the
# old fully-static fallback build, these have @rpath install names, so the otool loop below
# (which only copies deps that exist on disk) will not pick them up. The split is by suffix:
# - ggml emits its loadable backends (per-microarch CPU variants, metal, blas) with a .so
# suffix EVEN ON DARWIN. These go in the package ROOT next to the binary, because darwin
# run.sh execs the binary directly (no bundled ld.so) so ggml's executable-directory
# scan looks there.
# - the core libraries (libggml-base/libggml/libllama/libllama-common/libmtmd) use the
# platform .dylib suffix and are NEEDED deps; they go in lib/, resolved at load time via
# the DYLD_LIBRARY_PATH=lib that run.sh exports. -a preserves the version symlinks.
SHLIBS=backend/cpp/llama-cpp/ggml-shared-libs
cp -a $SHLIBS/*.so build/darwin/
cp -a $SHLIBS/*.dylib build/darwin/lib/
# Set default additional libs only for Darwin on M chips (arm64)
if [[ "$(uname -s)" == "Darwin" && "$(uname -m)" == "arm64" ]]; then
ADDITIONAL_LIBS=${ADDITIONAL_LIBS:-$(ls /opt/homebrew/Cellar/protobuf/**/lib/libutf8_validity*.dylib 2>/dev/null)}