Compare commits

...

12 Commits

Author SHA1 Message Date
Ettore Di Giacinto
4f7bf33b2d Merge origin/master into feat/darwin-vllm-metal
Resolve includeDarwin conflict in backend-matrix.yml: keep both the vllm and
the newly-merged liquid-audio darwin entries (additive).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
2026-06-24 22:15:55 +00:00
LocalAI [bot]
0d6de15ae9 fix(config): per-device VRAM headroom for Blackwell defaults (#10485) (#10494)
The hardware-tuned defaults from #10411 were measured on a GB10 / DGX Spark
(128 GiB unified memory) and over-provisioned multi-GPU consumer Blackwell
(e.g. 2x16 GiB RTX 50-series) into CUDA OOM during model init:

  - The Blackwell physical batch (512 -> 2048) sets both n_batch and n_ubatch.
    The compute buffer scales ~n_ubatch * n_ctx and is allocated PER DEVICE
    (it can't be split across GPUs), so a large context turns ub2048 into
    multi-GiB of scratch that must fit one 16 GiB card.
  - The VRAM-scaled parallel-slot default tiered off TotalAvailableVRAM(),
    which SUMS all GPUs (2x16 -> "32 GiB" -> 8 slots), but the allocations
    are per-device.

Make both decisions per-device and context-aware:

  - xsysinfo.MinPerGPUVRAM() reports the smallest device's VRAM; localGPU()
    uses it so the parallel tier and batch guard reason about one card.
  - PhysicalBatchForContext(gpu, ctx) raises the batch only when the extra
    compute buffer fits VRAM/4 at this model's context (16 GiB crosses over
    ~174k ctx, 32 GiB ~349k; GB10 reports system RAM so it still clears it).
  - Apply hardware defaults AFTER runBackendHooks in SetDefaults so the
    GGUF-guessed context is resolved before the batch decision.
  - The distributed router gates the node batch the same way.

Unified-memory devices (GB10, Apple) report system RAM as their single
device's VRAM, so they keep the prefill win.


Assisted-by: Claude:opus-4.8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-25 00:07:48 +02:00
Ettore Di Giacinto
5e3774dfe3 fix(vllm): fail Score cleanly when the engine returns no prompt_logprobs
Audit of the Score path against vllm-metal (MLX on macOS): the engine accepts
SamplingParams(prompt_logprobs=1) but returns an all-None prompt_logprobs list
rather than computing it, so scoring is not supported there. The old guard
treated the truthy [None] list as valid and silently scored every candidate as
0. Detect the all-None case and return UNIMPLEMENTED instead. No-op on
Linux/CUDA, which populate real entries.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
2026-06-24 21:31:41 +00:00
LocalAI [bot]
5c3d48ab50 feat(ui): usage & UX enhancements (last-used model, polling, starter models, usage cost, a11y) (#10496)
* feat(ui): remember last-used model per capability

ModelSelector auto-selected the first option whenever the bound value was
empty or stale, so every visit to the Home chat box, Image, TTS or Talk
pages reset the choice to whatever sorted first. Persist the user's pick
in localStorage keyed by capability and prefer it on auto-select when the
model is still available, falling back to the first option otherwise.

Because every modality picker funnels through ModelSelector, this fixes
the friction everywhere at once. External-options callers pass no
capability and keep the previous first-item behaviour.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(ui): add visibility-aware polling hook

The app had 26 hand-rolled setInterval polls, none of which paused when
the browser tab was hidden, so backgrounded dashboards kept hitting the
server every few seconds for data nobody was looking at.

Add usePolling: runs immediately, polls on a fixed interval, pauses while
document.hidden, fires a catch-up poll on return, and guards against
overlapping slow requests. Route useResources (the highest-frequency
shared poll) through it. Further callers can be migrated incrementally.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(ui): hardware-aware starter models on empty home

A fresh install dropped admins straight into a 1000+ model gallery with
no guidance. Add a StarterModels widget to the empty-state wizard that
recommends a small, curated set tuned to the detected hardware:

- CPU-only machines (no GPU VRAM) are steered to genuinely small models
  (1-4B, Q4) that stay responsive without a GPU.
- GPU machines get suggestions scaled to available VRAM.

Curated names are real gallery entries, intersected against the live
gallery at render time so a trimmed/custom gallery degrades gracefully.
Install is one click via the existing model-install API.

Also routes Home's cluster and system-info polls through usePolling so a
backgrounded home page stops fetching.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(ui): optional token-cost estimates on usage dashboard

The usage dashboard tracked tokens but had no monetary view. Multi-user
deployments that bill back or budget compute had to export and compute
cost elsewhere.

Add an opt-in pricing control: admins set $ per 1M prompt/completion
tokens (stored per-browser). When set, an estimated-cost summary card and
per-model / per-user cost columns appear, computed from recorded token
counts. The entire cost surface stays hidden until a price is entered, so
the default view is unchanged. Cost is clearly labelled an estimate -
LocalAI itself has no notion of price.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(ui): label icon-only send buttons for screen readers

The chat and agent-chat send buttons were a bare paper-plane icon with
no accessible name, so screen readers announced only "button". Add an
aria-label/title ("Send message") and mark the icon aria-hidden. An audit
of all icon-only buttons found these were the only two unlabeled controls;
the rest already carry visible text.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-24 23:30:08 +02:00
LocalAI [bot]
764b0352b9 docs: ⬆️ update docs version mudler/LocalAI (#10491)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-24 23:18:24 +02:00
LocalAI [bot]
75ba2daba1 chore(model-gallery): ⬆️ update checksum (#10495)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-24 23:18:04 +02:00
LocalAI [bot]
62b14fd635 feat(backends): add darwin/metal build for liquid-audio (#10486)
* feat(backends): add darwin/metal build for liquid-audio

Wire the already-MPS-ready liquid-audio backend (it ships
requirements-mps.txt) into the darwin CI matrix and the gallery so
metal-darwin-arm64 images are built and selectable.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* ci(liquid-audio): trigger darwin build via requirements-mps note

The changed-backends path filter only builds a backend when a file under
its directory changes. The metal wiring lived in index.yaml + the matrix,
so the darwin job was skipped. Add a documenting comment to the MPS
requirements so CI actually exercises the darwin build.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* fix(liquid-audio): guard uv-only --index-strategy for the pip/darwin path

Same fix as trl: the darwin/MPS build installs with pip (USE_PIP=true), which
rejects the uv-only --index-strategy flag and failed the darwin backend build.
Add it only on the uv path; Linux/CUDA resolution is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-24 23:16:27 +02:00
Ettore Di Giacinto
bfb9a40d58 fix(vllm): fetch the vllm-metal wheel without the GitHub API
The darwin build resolved the wheel URL via api.github.com, whose
unauthenticated rate limit (60/hr per IP) 403s on shared macOS runners
(observed after the 9-min vLLM source build). Construct the release-asset
download URL deterministically from the pinned tag and the cp312/arm64 wheel
name instead - no API call, no rate limit. Verified the URL resolves (200).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
2026-06-24 20:54:30 +00:00
Ettore Di Giacinto
af7d0e8b40 chore(vllm): derive the darwin vLLM version, drop the second pin
Follow-up: VLLM_VERSION was still a hardcoded string duplicating what
VLLM_METAL_VERSION already determines. Derive it at install time from
vllm-metal's own installer (vllm_v=) at the pinned tag - one source of truth,
no second value to drift. The bumper now touches only VLLM_METAL_VERSION;
the derivation is immutable per tag, so builds stay reproducible.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
2026-06-24 20:25:07 +00:00
LocalAI [bot]
193d0e6aef fix(backends): darwin/metal support for supertonic (#10488)
The supertonic Go TTS backend dlopens ONNX Runtime, but its runtime and
packaging scripts were Linux-only: run.sh exported LD_LIBRARY_PATH, pointed
ONNXRUNTIME_LIB_PATH at libonnxruntime.so, and always tried the ld.so exec
path, while package.sh hard-failed on any non-Linux host. On macOS dyld has
no ld.so loader, uses DYLD_LIBRARY_PATH, and ONNX Runtime ships as a .dylib.

This applies the same purego .dylib/DYLD_LIBRARY_PATH fix that PR #10481
landed for 15 other ONNX/purego backends (sherpa-onnx, silero-vad, etc.) but
which omitted supertonic:

- run.sh: on darwin export DYLD_LIBRARY_PATH and point ONNXRUNTIME_LIB_PATH
  at libonnxruntime.dylib; guard the ld.so exec path to Linux only.
- package.sh: recognize Darwin instead of erroring out; the bundled .dylib is
  resolved via DYLD_LIBRARY_PATH, no glibc/ld.so to bundle.
- helper.go: platform-native default library extension (dylib on darwin) for
  the last-resort dlopen fallback.

It also wires the darwin CI build and gallery entries, resolving the
inconsistency where backend/index.yaml advertised metal for supertonic but no
includeDarwin matrix entry built the image:

- .github/backend-matrix.yml: add the -metal-darwin-arm64-supertonic Go entry.
- backend/index.yaml: declare metal capabilities and add the concrete
  metal-supertonic / metal-supertonic-development child entries.

The Makefile already detects Darwin/osx/arm64 and stages the per-OS ONNX
Runtime tarball, mirroring sherpa-onnx, so no Makefile change is required.


Assisted-by: Claude:opus-4.8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-24 22:19:03 +02:00
Ettore Di Giacinto
7743a0abc0 chore(vllm): track the darwin vllm-metal pin via the autobumper
The Apple Silicon build pinned vLLM 0.23.0 as a hidden string in install.sh
while floating the vllm-metal wheel on releases/latest - the two could drift
apart silently. Make both a tracked, reproducible pair (VLLM_METAL_VERSION +
VLLM_VERSION), fetch the wheel by tag, and add .github/bump_vllm_metal.sh wired
into bump_deps.yaml. It tracks vllm-project/vllm-metal (not vllm/vllm latest),
reading the coupled vLLM source version from vllm-metal's own installer, and
opens a bump PR - mirroring the existing bump_vllm_wheel.sh for the cu130 wheel.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
2026-06-24 20:03:14 +00:00
Ettore Di Giacinto
3447b28bbd feat(vllm): macOS/Metal support via vllm-metal (MLX)
Add an additive Apple-Silicon path to the existing vllm Python backend so
vLLM runs on macOS via vllm-metal (github.com/vllm-project/vllm-metal).

Spike outcome (proven on a real M4 / macOS 26.5, Qwen3-0.6B):
- vllm-metal registers through vLLM's platform-plugin entry point
  (metal -> vllm_metal:register); MetalPlatform activates and runs on the
  GPU through MLX.
- LocalAI's backend.py is UNCHANGED: AsyncEngineArgs(...) ->
  AsyncLLMEngine.from_engine_args transparently resolves to vLLM 0.23's v1
  AsyncLLM MLX engine, and async generate produced correct output.
- backend.py is NOT touched: its only empty_cache() call is CUDA-only
  (guarded by torch.cuda.is_available()), so the benign shutdown-only
  "Allocator for mps is not a DeviceAllocator" noise comes from vLLM's
  internal EngineCore teardown, not from our code.

Changes (all gated behind a darwin condition; Linux/CUDA/ROCm/Intel paths
are byte-for-byte unchanged):
- install.sh: darwin branch forces PYTHON_VERSION=3.12 (vllm-metal
  requirement), creates/activates LocalAI's managed venv via ensureVenv,
  then reproduces vllm-metal's installer INTO that venv (build vLLM 0.23.0
  from the release source tarball against requirements/cpu.txt, then install
  the prebuilt vllm-metal wheel from its latest GitHub release), and runs
  runProtogen. installRequirements is skipped on darwin.
- backend-matrix.yml: add a vllm includeDarwin entry (mps, python).
- index.yaml: add metal capability + concrete metal-vllm /
  metal-vllm-development child entries mirroring the metal-kitten-tts
  template.

Version coupling: vllm-metal pins vLLM 0.23.0, equal to LocalAI's current
vllm pin. Bumping vllm must be coordinated with a supporting vllm-metal
release; documented in install.sh and requirements-cublas13-after.txt.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
2026-06-24 17:17:50 +00:00
33 changed files with 952 additions and 114 deletions

View File

@@ -4974,6 +4974,16 @@ includeDarwin:
- backend: "kitten-tts"
tag-suffix: "-metal-darwin-arm64-kitten-tts"
build-type: "mps"
# vLLM on Apple Silicon via vllm-metal (MLX). The install is custom
# (backend/python/vllm/install.sh has a darwin branch); lang stays python so
# backend_build_darwin.yml drives it through build-darwin-python-backend ->
# scripts/build/python-darwin.sh, which runs the backend's install.sh.
- backend: "vllm"
tag-suffix: "-metal-darwin-arm64-vllm"
build-type: "mps"
- backend: "liquid-audio"
tag-suffix: "-metal-darwin-arm64-liquid-audio"
build-type: "mps"
- backend: "piper"
tag-suffix: "-metal-darwin-arm64-piper"
build-type: "metal"
@@ -4990,6 +5000,10 @@ includeDarwin:
tag-suffix: "-metal-darwin-arm64-sherpa-onnx"
build-type: "metal"
lang: "go"
- backend: "supertonic"
tag-suffix: "-metal-darwin-arm64-supertonic"
build-type: "metal"
lang: "go"
- backend: "local-store"
tag-suffix: "-metal-darwin-arm64-local-store"
build-type: "metal"

55
.github/bump_vllm_metal.sh vendored Executable file
View File

@@ -0,0 +1,55 @@
#!/bin/bash
# Bump the single vllm-metal pin (VLLM_METAL_VERSION) in the vLLM backend's
# darwin (Apple Silicon) install path. The macOS/Metal build
# (backend/python/vllm/install.sh, Darwin branch) installs vllm-metal, which is
# version-locked to a specific vLLM source release. install.sh derives that vLLM
# version at build time from vllm-metal's own installer (`vllm_v=`) at the pinned
# tag, so there is only ONE value to bump here -- mirroring bump_vllm_wheel.sh,
# which bumps the Linux cu130 wheel pin.
#
# This deliberately tracks vllm-project/vllm-metal, NOT vllm-project/vllm: the
# darwin build can only use the exact vLLM version vllm-metal supports, so it may
# lag the Linux pin (requirements-cublas13-after.txt) until vllm-metal catches up.
set -xe
REPO=$1 # vllm-project/vllm-metal
FILE=$2 # backend/python/vllm/install.sh
VAR=$3 # VLLM_METAL_VERSION (used for the workflow's output file names)
if [ -z "$FILE" ] || [ -z "$REPO" ] || [ -z "$VAR" ]; then
echo "usage: $0 <repo> <install-file> <var-name>" >&2
exit 1
fi
# vllm-metal ships frequent dev releases, all flagged as non-prerelease, so
# /releases/latest returns the newest one (with its cp312 wheel asset).
LATEST_TAG=$(curl -sS -H "Accept: application/vnd.github+json" \
"https://api.github.com/repos/$REPO/releases/latest" \
| python3 -c "import json,sys; print(json.load(sys.stdin)['tag_name'])")
# The coupled vLLM source version lives in vllm-metal's installer at that tag.
NEW_VLLM_VERSION=$(curl -fsSL \
"https://raw.githubusercontent.com/$REPO/$LATEST_TAG/install.sh" \
| grep -oE 'vllm_v="[0-9]+\.[0-9]+\.[0-9]+"' | head -1 | cut -d'"' -f2)
if [ -z "$LATEST_TAG" ] || [ -z "$NEW_VLLM_VERSION" ]; then
echo "Could not resolve vllm-metal tag ($LATEST_TAG) or its vllm_v ($NEW_VLLM_VERSION)." >&2
exit 1
fi
set +e
CURRENT_TAG=$(grep -oE 'VLLM_METAL_VERSION="[^"]*"' "$FILE" | head -1 | cut -d'"' -f2)
set -e
# Rewrite the single pin. install.sh derives VLLM_VERSION from this tag at build
# time, so there is nothing else to touch. peter-evans/create-pull-request opens
# no PR on a clean tree, so a no-op rewrite (already current) is safe.
sed -i "$FILE" \
-e "s|VLLM_METAL_VERSION=\"[^\"]*\"|VLLM_METAL_VERSION=\"$LATEST_TAG\"|"
if [ -z "$CURRENT_TAG" ]; then
echo "Could not find VLLM_METAL_VERSION=\"...\" in $FILE." >&2
exit 0
fi
echo "vllm-metal ${CURRENT_TAG} -> ${LATEST_TAG} (builds vLLM ${NEW_VLLM_VERSION}): https://github.com/$REPO/releases/tag/${LATEST_TAG}" >> "${VAR}_message.txt"
echo "${LATEST_TAG}" >> "${VAR}_commit.txt"

View File

@@ -154,3 +154,39 @@ jobs:
branch: "update/VLLM_VERSION"
body: ${{ steps.bump.outputs.message }}
signoff: true
bump-vllm-metal:
# The darwin (Apple Silicon) vLLM build installs vllm-metal, which is locked
# to a specific vLLM source release. install.sh pins both VLLM_METAL_VERSION
# (the wheel release) and VLLM_VERSION (the vLLM it builds against); this job
# tracks vllm-project/vllm-metal and rewrites both atomically. Separate from
# bump-vllm-wheel because darwin follows vllm-metal, not vllm/vllm latest.
if: github.repository == 'mudler/LocalAI'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v7
- name: Bump vllm-metal pin 🔧
id: bump
run: |
bash .github/bump_vllm_metal.sh vllm-project/vllm-metal backend/python/vllm/install.sh VLLM_METAL_VERSION
{
echo 'message<<EOF'
cat "VLLM_METAL_VERSION_message.txt"
echo EOF
} >> "$GITHUB_OUTPUT"
{
echo 'commit<<EOF'
cat "VLLM_METAL_VERSION_commit.txt"
echo EOF
} >> "$GITHUB_OUTPUT"
rm -rfv VLLM_METAL_VERSION_message.txt VLLM_METAL_VERSION_commit.txt
- name: Create Pull Request
uses: peter-evans/create-pull-request@v8
with:
token: ${{ secrets.UPDATE_BOT_TOKEN }}
push-to-fork: ci-forks/LocalAI
commit-message: ':arrow_up: Update vllm-project/vllm-metal (darwin)'
title: 'chore: :arrow_up: Update vllm-metal (darwin) to `${{ steps.bump.outputs.commit }}`'
branch: "update/VLLM_METAL_VERSION"
body: ${{ steps.bump.outputs.message }}
signoff: true

View File

@@ -16,6 +16,7 @@ import (
"os"
"path/filepath"
"regexp"
"runtime"
"strings"
"time"
"unicode"
@@ -943,7 +944,13 @@ func InitializeONNXRuntime() error {
}
}
if libPath == "" {
libPath = "/usr/local/lib/libonnxruntime.so"
// LocalAI: default to the platform-native shared library
// extension when nothing else is found (dyld vs ld.so).
if runtime.GOOS == "darwin" {
libPath = "/usr/local/lib/libonnxruntime.dylib"
} else {
libPath = "/usr/local/lib/libonnxruntime.so"
}
}
}
ort.SetSharedLibraryPath(libPath)

View File

@@ -32,6 +32,10 @@ elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ $(uname -s) = "Darwin" ]; then
# macOS: dyld resolves the bundled .dylib via DYLD_LIBRARY_PATH (set in
# run.sh); there is no ld.so loader nor glibc to bundle.
echo "Detected Darwin"
else
echo "Error: Could not detect architecture"
exit 1

View File

@@ -3,12 +3,19 @@ set -ex
CURDIR=$(dirname "$(realpath $0)")
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export ONNXRUNTIME_LIB_PATH=$CURDIR/lib/libonnxruntime.so
if [ "$(uname)" = "Darwin" ]; then
# macOS uses dyld: there is no ld.so loader, and the search path env
# var is DYLD_LIBRARY_PATH. ONNX Runtime ships as a .dylib here.
export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
export ONNXRUNTIME_LIB_PATH=$CURDIR/lib/libonnxruntime.dylib
else
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export ONNXRUNTIME_LIB_PATH=$CURDIR/lib/libonnxruntime.so
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
exec $CURDIR/lib/ld.so $CURDIR/supertonic "$@"
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
exec $CURDIR/lib/ld.so $CURDIR/supertonic "$@"
fi
fi
exec $CURDIR/supertonic "$@"

View File

@@ -645,6 +645,7 @@
nvidia-cuda-13: "cuda13-vllm"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm"
cpu: "cpu-vllm"
metal: "metal-vllm"
- &sglang
name: "sglang"
license: apache-2.0
@@ -1284,6 +1285,7 @@
nvidia-cuda-13: "cuda13-liquid-audio"
nvidia-cuda-12: "cuda12-liquid-audio"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-liquid-audio"
metal: "metal-liquid-audio"
icon: https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/7_6D7rWrLxp2hb6OHSV1p.png
- &qwen-tts
urls:
@@ -1569,6 +1571,7 @@
- TTS
capabilities:
default: "cpu-supertonic"
metal: "metal-supertonic"
- !!merge <<: *neutts
name: "neutts-development"
capabilities:
@@ -2927,6 +2930,17 @@
nvidia-cuda-13: "cuda13-vllm-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm-development"
cpu: "cpu-vllm-development"
metal: "metal-vllm-development"
- !!merge <<: *vllm
name: "metal-vllm"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-vllm"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-vllm
- !!merge <<: *vllm
name: "metal-vllm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-vllm"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-vllm
- !!merge <<: *vllm
name: "cuda12-vllm"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vllm"
@@ -4612,6 +4626,7 @@
nvidia-cuda-13: "cuda13-liquid-audio-development"
nvidia-cuda-12: "cuda12-liquid-audio-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-liquid-audio-development"
metal: "metal-liquid-audio-development"
- !!merge <<: *liquid-audio
name: "cpu-liquid-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-liquid-audio"
@@ -4622,6 +4637,16 @@
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-liquid-audio"
mirrors:
- localai/localai-backends:master-cpu-liquid-audio
- !!merge <<: *liquid-audio
name: "metal-liquid-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-liquid-audio"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-liquid-audio
- !!merge <<: *liquid-audio
name: "metal-liquid-audio-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-liquid-audio"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-liquid-audio
- !!merge <<: *liquid-audio
name: "cuda12-liquid-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-liquid-audio"
@@ -5484,6 +5509,7 @@
name: "supertonic-development"
capabilities:
default: "cpu-supertonic-development"
metal: "metal-supertonic-development"
- !!merge <<: *supertonic
name: "cpu-supertonic"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-supertonic"
@@ -5494,3 +5520,13 @@
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-supertonic"
mirrors:
- localai/localai-backends:master-cpu-supertonic
- !!merge <<: *supertonic
name: "metal-supertonic"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-supertonic"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-supertonic
- !!merge <<: *supertonic
name: "metal-supertonic-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-supertonic"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-supertonic

View File

@@ -14,5 +14,11 @@ else
fi
# liquid-audio's torch wheels are large; allow upgrades to satisfy transitive pins
EXTRA_PIP_INSTALL_FLAGS+=" --upgrade --index-strategy=unsafe-first-match"
EXTRA_PIP_INSTALL_FLAGS+=" --upgrade"
# --index-strategy is a uv-only flag. The darwin/MPS build installs with pip
# (USE_PIP=true in scripts/build/python-darwin.sh), which rejects it. Only add
# it on the uv path; Linux/CUDA resolution is unchanged.
if [ "x${USE_PIP:-}" != "xtrue" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-first-match"
fi
installRequirements

View File

@@ -1,3 +1,4 @@
# MPS (Apple Silicon / Metal) build profile - installed by the darwin CI job.
torch>=2.8.0
torchaudio>=2.8.0
torchcodec>=0.9.1

View File

@@ -457,9 +457,14 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
except Exception:
pass
if last_output is None or not getattr(last_output, "prompt_logprobs", None):
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details("vLLM did not return prompt_logprobs")
_pl = getattr(last_output, "prompt_logprobs", None) if last_output is not None else None
# Some engines accept the prompt_logprobs request but return a
# list of all-None entries instead of computing them (observed
# with vllm-metal's MLX backend on macOS). Treat that as
# unsupported rather than silently scoring every candidate as 0.
if not _pl or all(e is None for e in _pl):
context.set_code(grpc.StatusCode.UNIMPLEMENTED)
context.set_details("This backend did not return prompt_logprobs; scoring is unsupported on this engine (e.g. vllm-metal / MLX on macOS).")
return backend_pb2.ScoreResponse()
prompt_logprobs = last_output.prompt_logprobs

View File

@@ -43,6 +43,24 @@ if [ "x${BUILD_PROFILE}" == "xcublas13" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi
# Apple Silicon (Metal/MLX) via vllm-metal.
# vllm-metal (github.com/vllm-project/vllm-metal) brings vLLM to macOS on Apple
# Silicon: it registers through vLLM's platform-plugin entry point
# (metal -> vllm_metal:register), MetalPlatform activates, and the vLLM v1
# AsyncLLM engine runs on the GPU through MLX. LocalAI's backend.py is UNCHANGED
# on darwin — AsyncEngineArgs(...) -> AsyncLLMEngine.from_engine_args transparently
# resolves to the MLX engine (proven on a real M4 / macOS 26.5 against Qwen3-0.6B).
#
# vllm-metal REQUIRES Python 3.12, so force the portable CPython before the venv
# is created (ensureVenv reads PYTHON_VERSION/PYTHON_PATCH/PY_STANDALONE_TAG).
# The patch + standalone tag mirror the l4t13 cp312 pin — a known-good
# python-build-standalone release that also ships an aarch64-apple-darwin asset.
if [ "$(uname -s)" = "Darwin" ]; then
PYTHON_VERSION="3.12"
PYTHON_PATCH="12"
PY_STANDALONE_TAG="20251120"
fi
# JetPack 7 / L4T arm64 vllm + torch wheels come straight from PyPI now
# (torch 2.11+ ships aarch64 + cu130 manylinux wheels and vllm 0.20+ ships
# an aarch64 wheel pinned to that torch). They're cp312-only, so bump the
@@ -57,11 +75,87 @@ if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
PY_STANDALONE_TAG="20251120"
fi
# ===================== Apple Silicon (Metal/MLX) =====================
# Reproduce vllm-metal's upstream installer
# (curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh)
# but INTO LocalAI's managed venv (ensureVenv) instead of a throwaway
# ~/.venv-vllm-metal, so the backend integrates with LocalAI's venv lifecycle
# (portable CPython, _makeVenvPortable relocation, runtime activation). The
# normal CUDA/CPU installRequirements is skipped on darwin — there is no
# macOS/arm64 vLLM wheel on PyPI; vLLM is built from source and the MLX engine
# is layered on by the vllm-metal wheel.
if [ "$(uname -s)" = "Darwin" ]; then
# Create/activate the portable 3.12 venv. On darwin USE_PIP=true and
# PORTABLE_PYTHON=true (set by scripts/build/python-darwin.sh), so this is a
# `python -m venv` based, relocatable venv.
ensureVenv
# vllm-metal's installer drives everything through `uv`: building vLLM from
# the CPU requirements needs `--index-strategy unsafe-best-match` (mixes the
# pytorch CPU channel with PyPI), a flag plain pip does not have. The darwin
# venv is pip-based, so bootstrap uv into it. uv honours $VIRTUAL_ENV (set by
# libbackend's _activateVenv) and installs into THIS venv — same pattern the
# intel branch below relies on.
pip install uv
# The ONLY darwin version pin -- AUTO-BUMPED by .github/bump_vllm_metal.sh,
# which tracks vllm-project/vllm-metal releases (NOT vllm/vllm latest). Keep
# it as a plain double-quoted assignment on its own line so the bumper's sed
# can rewrite it. Darwin therefore follows vllm-metal and can lag the Linux
# vllm pin (requirements-cublas13-after.txt, bumped independently against
# vllm/vllm) until vllm-metal supports a newer vLLM.
VLLM_METAL_VERSION="v0.3.0.dev20260622062346"
# The coupled vLLM source version is whatever this vllm-metal release builds
# against -- it declares it in its own installer as `vllm_v=`. Derive it from
# the PINNED tag rather than hardcoding a second value that could drift. The
# tag is immutable, so this stays reproducible across rebuilds.
VLLM_VERSION=$(curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm-metal/${VLLM_METAL_VERSION}/install.sh" \
| grep -oE 'vllm_v="[0-9]+\.[0-9]+\.[0-9]+"' | head -n1 | cut -d'"' -f2)
if [ -z "${VLLM_VERSION}" ]; then
echo "ERROR: could not derive the vLLM version from vllm-metal ${VLLM_METAL_VERSION}" >&2
exit 1
fi
echo "vllm-metal ${VLLM_METAL_VERSION} builds against vLLM ${VLLM_VERSION}"
_vllm_src=$(mktemp -d)
trap 'rm -rf "${_vllm_src}"' EXIT
pushd "${_vllm_src}"
# 1) Build vLLM ${VLLM_VERSION} from the release source tarball against
# the CPU requirements. vllm-metal layers its MLX platform plugin on
# top of this exact build.
curl -fsSL -o "vllm-${VLLM_VERSION}.tar.gz" \
"https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}.tar.gz"
tar -xzf "vllm-${VLLM_VERSION}.tar.gz"
pushd "vllm-${VLLM_VERSION}"
uv pip install -r requirements/cpu.txt --index-strategy unsafe-best-match
# -Wno-parentheses: clang on macOS treats one of vLLM's C++ warnings
# as an error without it (matches the upstream installer's CXXFLAGS).
CXXFLAGS="-Wno-parentheses" uv pip install .
popd
popd
# 2) Install the prebuilt vllm-metal wheel for the PINNED release. It pulls
# mlx / mlx-metal as deps and registers the `metal` platform plugin that
# backend.py resolves to at engine-init time. Build the release-asset URL
# deterministically (tag + the cp312/arm64 wheel name) rather than querying
# api.github.com, whose unauthenticated rate limit (60/hr per IP) 403s on
# shared CI runners. The wheel version is the tag without its leading 'v'.
_metal_wheel="vllm_metal-${VLLM_METAL_VERSION#v}-cp312-cp312-macosx_11_0_arm64.whl"
_metal_wheel_url="https://github.com/vllm-project/vllm-metal/releases/download/${VLLM_METAL_VERSION}/${_metal_wheel}"
echo "Installing vllm-metal wheel: ${_metal_wheel_url}"
uv pip install "${_metal_wheel_url}"
# Generate the gRPC stubs (backend_pb2*). installRequirements normally does
# this via runProtogen at the end; we skipped installRequirements on darwin,
# so call it explicitly here.
runProtogen
# Intel XPU has no upstream-published vllm wheels, so we always build vllm
# from source against torch-xpu and replace the default triton with
# triton-xpu (matching torch 2.11). Mirrors the upstream procedure:
# https://github.com/vllm-project/vllm/blob/main/docs/getting_started/installation/gpu.xpu.inc.md
if [ "x${BUILD_TYPE}" == "xintel" ]; then
elif [ "x${BUILD_TYPE}" == "xintel" ]; then
# Hide requirements-intel-after.txt so installRequirements doesn't
# try `pip install vllm` (would either fail or grab a non-XPU wheel).
_intel_after="${backend_dir}/requirements-intel-after.txt"

View File

@@ -4,4 +4,7 @@
# instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match
# so uv consults this index alongside PyPI.
--extra-index-url https://wheels.vllm.ai/0.23.0/cu130
# VERSION COUPLING: darwin/Apple-Silicon builds use vllm-metal (see install.sh),
# which pins this exact vLLM version. Bumping vllm here means coordinating with a
# vllm-metal release that supports the new version, or macOS/Metal builds break.
vllm==0.23.0

View File

@@ -54,8 +54,35 @@ func (g GPU) IsNVIDIABlackwell() bool {
return maj >= 12
}
// Compute-buffer headroom guard for the raised physical batch.
//
// Raising n_ubatch grows the CUDA *compute buffer* (the scratch for the forward
// graph), which is allocated PER DEVICE — it does not benefit from a second GPU
// the way weights or KV (which are split across devices) do. The buffer scales
// ~linearly with n_ubatch * n_ctx, so a large context turns the GB10-tuned
// ub2048 into multi-GiB of extra scratch that must fit on a SINGLE card. On a
// 16 GiB consumer Blackwell with a 200k context that overflows (issue #10485),
// even though the GB10 it was measured on (128 GiB unified memory) had room.
//
// These constants size a conservative guard: only raise the batch when the
// extra scratch fits the per-device VRAM ceiling.
const (
// computeBufferBytesPerCell approximates the CUDA compute-buffer cost of one
// (n_ubatch * n_ctx) cell. Derived from an observed allocation (ub2048 *
// ctx204800 ~= 4.5 GiB => ~11 B/cell) and rounded up to 16 for margin, since
// the real cost also grows with model width (heads / embedding dim) which we
// don't know at config time.
computeBufferBytesPerCell = 16
// blackwellBatchHeadroomDivisor caps the extra compute buffer from raising the
// physical batch at VRAM/divisor. /4 keeps the bulk of a device for weights +
// KV, which already dominate VRAM use.
blackwellBatchHeadroomDivisor = 4
)
// PhysicalBatch returns the canonical physical batch (n_batch/n_ubatch) for the
// given hardware, used when the model config leaves batch unset.
// given hardware class, ignoring context/VRAM headroom. Use
// PhysicalBatchForContext when a model context and per-device VRAM are known
// (the load paths) so the raised batch can't overflow a single device.
func PhysicalBatch(g GPU) int {
if g.IsNVIDIABlackwell() {
return BlackwellPhysicalBatch
@@ -63,6 +90,32 @@ func PhysicalBatch(g GPU) int {
return DefaultPhysicalBatch
}
// PhysicalBatchForContext is PhysicalBatch gated on per-device VRAM headroom for
// the given context: it only raises the batch above the conservative default
// when the extra compute buffer (which is allocated on a single device and grows
// with n_ubatch * n_ctx) fits within blackwellBatchHeadroomDivisor of the GPU's
// VRAM. g.VRAM must be the PER-DEVICE ceiling (the smallest device on a
// multi-GPU host), not the summed total — the compute buffer can't be split.
//
// VRAM 0 (unknown) stays conservative rather than risk a per-device OOM; the
// GB10 / unified-memory path reports system RAM, so it still clears the guard.
func PhysicalBatchForContext(g GPU, ctx int) int {
if !g.IsNVIDIABlackwell() {
return DefaultPhysicalBatch
}
if ctx <= 0 {
ctx = DefaultContextSize
}
if g.VRAM == 0 {
return DefaultPhysicalBatch
}
extra := uint64(ctx) * uint64(BlackwellPhysicalBatch-DefaultPhysicalBatch) * computeBufferBytesPerCell
if extra <= g.VRAM/blackwellBatchHeadroomDivisor {
return BlackwellPhysicalBatch
}
return DefaultPhysicalBatch
}
// IsManagedPhysicalBatch reports whether n is a value PhysicalBatch assigns.
// Callers that re-tune a value chosen by an upstream host (the distributed
// router correcting the frontend's guess) use this to avoid clobbering an
@@ -122,7 +175,12 @@ func hasParallelOption(opts []string) bool {
// deterministic device — detection does a live nvidia-smi call.
var localGPU = func() GPU {
vendor, _ := xsysinfo.DetectGPUVendor()
vram, _ := xsysinfo.TotalAvailableVRAM()
// Use the SMALLEST device's VRAM, not the summed total: the parallel-slot
// tier and the batch headroom guard both reason about what fits on a single
// card, and per-device compute buffers can't be split across GPUs. Summing
// two 16 GiB cards into "32 GiB" is what over-provisioned multi-GPU hosts
// into OOM (issue #10485).
vram, _ := xsysinfo.MinPerGPUVRAM()
return GPU{
Vendor: vendor,
ComputeCapability: xsysinfo.NVIDIAComputeCapability(),
@@ -137,10 +195,20 @@ func ApplyHardwareDefaults(cfg *ModelConfig, gpu GPU) {
if cfg == nil {
return
}
if cfg.Batch == 0 && gpu.IsNVIDIABlackwell() {
cfg.Batch = BlackwellPhysicalBatch
xlog.Debug("[hardware_defaults] Blackwell GPU: defaulting physical batch",
"batch", cfg.Batch, "compute_cap", gpu.ComputeCapability)
// Raise the physical batch on Blackwell only when the resulting compute
// buffer fits the per-device VRAM at THIS model's context. Leaving Batch at 0
// (rather than writing the default 512) preserves the downstream single-pass
// sizing in core/backend.EffectiveBatchSize for embedding/score/rerank.
if cfg.Batch == 0 {
ctx := DefaultContextSize
if cfg.ContextSize != nil {
ctx = *cfg.ContextSize
}
if PhysicalBatchForContext(gpu, ctx) == BlackwellPhysicalBatch {
cfg.Batch = BlackwellPhysicalBatch
xlog.Debug("[hardware_defaults] Blackwell GPU: defaulting physical batch",
"batch", cfg.Batch, "compute_cap", gpu.ComputeCapability, "context", ctx, "vram_gib", gpu.VRAM>>30)
}
}
// Enable concurrent serving by default on a capable GPU: without this the

View File

@@ -9,26 +9,37 @@ import (
// GPU. The detection seam (localGPU) is injected so the path is deterministic
// without a real GPU.
var _ = Describe("SetDefaults hardware defaults (single-instance)", func() {
const gib = uint64(1) << 30
var orig func() GPU
BeforeEach(func() { orig = localGPU })
AfterEach(func() { localGPU = orig })
It("sets the physical batch on a local Blackwell GPU", func() {
localGPU = func() GPU { return GPU{ComputeCapability: "12.1"} }
It("sets the physical batch on a local Blackwell GPU with headroom", func() {
localGPU = func() GPU { return GPU{ComputeCapability: "12.1", VRAM: 119 * gib} }
cfg := &ModelConfig{}
cfg.SetDefaults()
Expect(cfg.Batch).To(Equal(BlackwellPhysicalBatch))
})
It("leaves batch unset when a large context would overflow the device", func() {
// Regression guard for issue #10485: 16 GiB consumer Blackwell + ~200k ctx.
localGPU = func() GPU { return GPU{ComputeCapability: "12.0", VRAM: 16 * gib} }
ctx := 204800
cfg := &ModelConfig{LLMConfig: LLMConfig{ContextSize: &ctx}}
cfg.SetDefaults()
Expect(cfg.Batch).To(Equal(0))
})
It("leaves batch unset on a non-Blackwell local GPU", func() {
localGPU = func() GPU { return GPU{ComputeCapability: "8.9"} }
localGPU = func() GPU { return GPU{ComputeCapability: "8.9", VRAM: 119 * gib} }
cfg := &ModelConfig{}
cfg.SetDefaults()
Expect(cfg.Batch).To(Equal(0))
})
It("never overrides an explicit batch", func() {
localGPU = func() GPU { return GPU{ComputeCapability: "12.1"} }
localGPU = func() GPU { return GPU{ComputeCapability: "12.1", VRAM: 119 * gib} }
cfg := &ModelConfig{}
cfg.Batch = 1024
cfg.SetDefaults()

View File

@@ -7,6 +7,8 @@ import (
)
var _ = Describe("Hardware-driven config defaults", func() {
const gib = uint64(1) << 30
DescribeTable("GPU.IsNVIDIABlackwell (sm_12x consumer family)",
func(cc string, want bool) {
Expect(GPU{ComputeCapability: cc}.IsNVIDIABlackwell()).To(Equal(want))
@@ -35,21 +37,54 @@ var _ = Describe("Hardware-driven config defaults", func() {
})
})
Describe("PhysicalBatchForContext (per-device VRAM headroom)", func() {
It("raises the batch when the compute buffer fits the device", func() {
// 16 GiB Blackwell with a small context: the extra scratch is tiny.
Expect(PhysicalBatchForContext(GPU{ComputeCapability: "12.0", VRAM: 16 * gib}, 8192)).
To(Equal(BlackwellPhysicalBatch))
})
It("keeps the default batch when a large context would overflow one device", func() {
// The issue #10485 case: 16 GiB consumer Blackwell, ~200k context.
Expect(PhysicalBatchForContext(GPU{ComputeCapability: "12.0", VRAM: 16 * gib}, 204800)).
To(Equal(DefaultPhysicalBatch))
})
It("still raises the batch on a large unified-memory device (GB10)", func() {
// GB10 reports system RAM (~119 GiB) as its single device's VRAM.
Expect(PhysicalBatchForContext(GPU{ComputeCapability: "12.1", VRAM: 119 * gib}, 204800)).
To(Equal(BlackwellPhysicalBatch))
})
It("stays conservative when VRAM is unknown", func() {
Expect(PhysicalBatchForContext(GPU{ComputeCapability: "12.1"}, 8192)).
To(Equal(DefaultPhysicalBatch))
})
It("never raises the batch on non-Blackwell", func() {
Expect(PhysicalBatchForContext(GPU{ComputeCapability: "9.0", VRAM: 80 * gib}, 8192)).
To(Equal(DefaultPhysicalBatch))
})
})
Describe("ApplyHardwareDefaults", func() {
It("raises an unset batch to 2048 on Blackwell", func() {
It("raises an unset batch to 2048 on Blackwell with headroom", func() {
cfg := &ModelConfig{}
ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1"})
ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1", VRAM: 119 * gib})
Expect(cfg.Batch).To(Equal(BlackwellPhysicalBatch))
})
It("leaves batch unset when a large context would overflow one device", func() {
// Regression guard for issue #10485: 16 GiB card + ~200k context.
ctx := 204800
cfg := &ModelConfig{LLMConfig: LLMConfig{ContextSize: &ctx}}
ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.0", VRAM: 16 * gib})
Expect(cfg.Batch).To(Equal(0))
})
It("leaves batch unset on non-Blackwell", func() {
cfg := &ModelConfig{}
ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "9.0"})
ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "9.0", VRAM: 119 * gib})
Expect(cfg.Batch).To(Equal(0))
})
It("never overrides an explicit batch", func() {
cfg := &ModelConfig{}
cfg.Batch = 1024
ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1"})
ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1", VRAM: 119 * gib})
Expect(cfg.Batch).To(Equal(1024))
})
It("no-ops on nil", func() {
@@ -57,8 +92,6 @@ var _ = Describe("Hardware-driven config defaults", func() {
})
})
const gib = uint64(1) << 30
DescribeTable("DefaultParallelSlots (by VRAM)",
func(vramGiB uint64, want int) {
Expect(DefaultParallelSlots(GPU{VRAM: vramGiB * gib})).To(Equal(want))

View File

@@ -1204,11 +1204,6 @@ func (cfg *ModelConfig) SetDefaults(opts ...ConfigLoaderOption) {
// This ensures gallery-installed and runtime-loaded models get optimal parameters.
ApplyInferenceDefaults(cfg, cfg.Name, cfg.Model)
// Apply hardware-driven defaults (e.g. a larger physical batch on Blackwell).
// Uses the local GPU here; in distributed mode the router re-applies the same
// heuristics for the selected node's GPU before loading. Explicit config wins.
ApplyHardwareDefaults(cfg, localGPU())
// Apply serving-policy defaults (device-independent): cross-request prefix
// caching. Propagates to distributed nodes via the model options.
ApplyServingDefaults(cfg)
@@ -1247,6 +1242,16 @@ func (cfg *ModelConfig) SetDefaults(opts ...ConfigLoaderOption) {
cfg.ContextSize = &ctx
}
runBackendHooks(cfg, lo.modelPath)
// Apply hardware-driven defaults (e.g. a larger physical batch on Blackwell)
// LAST, after the context size is fully resolved (explicit config, LoadOptions,
// then the GGUF guess inside runBackendHooks): the Blackwell batch guard sizes
// the per-device compute buffer against this model's context, so it must see
// the final value, not a pre-guess nil. Uses the local GPU here; in distributed
// mode the router re-applies the same heuristics for the selected node's GPU
// before loading. Explicit config always wins.
ApplyHardwareDefaults(cfg, localGPU())
cfg.syncKnownUsecasesFromString()
}

View File

@@ -86,6 +86,7 @@
"input": {
"placeholder": "Message...",
"attachFile": "Attach file",
"send": "Send message",
"stopGenerating": "Stop generating",
"canvasTitle": "Canvas — extract code blocks and media into a side panel for preview, copy, and download",
"canvasLabel": "Canvas",

View File

@@ -77,6 +77,20 @@
"noModelsTitle": "No Models Available",
"noModelsBody": "There are no models installed yet. Ask your administrator to set up models so you can start chatting."
},
"starters": {
"title": "Recommended for your hardware",
"tier": {
"cpu": "CPU-only",
"gpu-small": "GPU",
"gpu-large": "GPU"
},
"cpuNote": "No GPU detected — these small models stay responsive on CPU.",
"gpuNote": "Picked to fit your available VRAM with room for context.",
"install": "Install",
"installing": "Installing",
"installStarted": "Installing {{model}}…",
"installFailed": "Install failed: {{message}}"
},
"connect": {
"title": "One endpoint, every API",
"subtitle": "LocalAI serves its own full API — image & video generation, depth, object detection, reranking, audio, face & voice recognition, and realtime voice over WebRTC and WebSocket. On top of that, a drop-in compatibility layer lets any app built for OpenAI, Anthropic, Ollama or OpenAI Responses talk to it unchanged.",

View File

@@ -6363,6 +6363,59 @@ select.input {
justify-content: center;
}
/* ──────────────────── Home: hardware-aware starter models ──────────────────── */
.home-starters {
margin: var(--spacing-lg) 0;
padding: var(--spacing-lg);
}
.home-starters-head {
display: flex;
align-items: center;
justify-content: space-between;
gap: var(--spacing-md);
}
.home-starters-head strong {
font-size: 0.9375rem;
}
.home-starters-tier {
display: inline-flex;
align-items: center;
gap: var(--spacing-xs);
font-size: 0.75rem;
color: var(--color-text-muted);
}
.home-starters-sub {
margin: var(--spacing-xs) 0 var(--spacing-md);
font-size: 0.8125rem;
color: var(--color-text-secondary);
}
.home-starters-list {
list-style: none;
margin: 0;
padding: 0;
display: flex;
flex-direction: column;
gap: var(--spacing-xs);
}
.home-starters-item {
display: flex;
align-items: center;
gap: var(--spacing-md);
padding: var(--spacing-xs) 0;
}
.home-starters-name {
font-weight: 500;
font-size: 0.875rem;
word-break: break-all;
}
.home-starters-size {
margin-left: auto;
font-size: 0.75rem;
color: var(--color-text-muted);
white-space: nowrap;
}
/* ──────────────────── Home: drop-in endpoint / API compatibility ──────────────────── */
.home-connect {

View File

@@ -1,8 +1,25 @@
import { useEffect, useMemo } from 'react'
import { useEffect, useMemo, useCallback } from 'react'
import { useModels } from '../hooks/useModels'
import SearchableSelect from './SearchableSelect'
import { useTranslation } from 'react-i18next'
// Remember the last model the user picked, keyed by capability, so returning to
// a page (Home chat box, Image, TTS, Talk...) defaults to that model instead of
// whatever happens to sort first. Only persisted when a capability key exists —
// `externalOptions` callers pass no capability and get the old first-item
// behaviour. localStorage access is wrapped because private-browsing modes throw.
const LAST_MODEL_PREFIX = 'localai_last_model:'
function readLastModel(capability) {
if (!capability) return null
try { return localStorage.getItem(LAST_MODEL_PREFIX + capability) } catch { return null }
}
function writeLastModel(capability, model) {
if (!capability || !model) return
try { localStorage.setItem(LAST_MODEL_PREFIX + capability, model) } catch { /* ignore */ }
}
export default function ModelSelector({
value, onChange, capability, className = '',
options: externalOptions, loading: externalLoading,
@@ -19,16 +36,27 @@ export default function ModelSelector({
const isLoading = externalOptions ? (externalLoading || false) : hookLoading
const isDisabled = isLoading || (externalDisabled || false)
// Persist genuine selections so the next visit can restore them.
const handleChange = useCallback((next) => {
writeLastModel(capability, next)
onChange(next)
}, [capability, onChange])
useEffect(() => {
if (modelNames.length > 0 && (!value || !modelNames.includes(value))) {
onChange(modelNames[0])
// Prefer the remembered model when it's still available; otherwise fall
// back to the first option. Don't re-persist here — auto-select is not a
// user choice, and writing back the stored value would be a harmless but
// pointless round-trip.
const remembered = readLastModel(capability)
onChange(remembered && modelNames.includes(remembered) ? remembered : modelNames[0])
}
}, [modelNames, value, onChange])
}, [modelNames, value, onChange, capability])
return (
<SearchableSelect
value={value || ''}
onChange={onChange}
onChange={handleChange}
options={modelNames}
placeholder={isLoading ? t('selector.loading') : (modelNames.length === 0 ? t('selector.noModels') : t('selector.selectModel'))}
searchPlaceholder={searchPlaceholder || t('selector.searchPlaceholder')}

View File

@@ -0,0 +1,129 @@
import { useState, useEffect, useMemo } from 'react'
import { useTranslation } from 'react-i18next'
import { modelsApi } from '../utils/api'
import { useResources } from '../hooks/useResources'
// Curated, hardware-tiered starter models for the empty-state onboarding. Names
// are real gallery entries (gallery/index.yaml); we intersect them against the
// live gallery at render time so a custom/trimmed gallery degrades gracefully
// (unmatched entries simply don't render).
//
// The guiding rule the maintainer asked for: CPU-only machines should be
// steered to genuinely small models (1-4B, Q4) that stay responsive without a
// GPU. GPU tiers scale the suggestion up with available VRAM.
const SMALL = [
{ name: 'llama-3.2-1b-instruct:q4_k_m', size: '~0.8 GB' },
{ name: 'llama-3.2-3b-instruct:q4_k_m', size: '~2 GB' },
{ name: 'qwen3-1.7b', size: '~1.4 GB' },
{ name: 'gemma-3-1b-it', size: '~0.8 GB' },
]
const MID = [
{ name: 'qwen3-4b', size: '~2.5 GB' },
{ name: 'gemma-3-4b-it', size: '~3 GB' },
{ name: 'llama-3.2-3b-instruct:q4_k_m', size: '~2 GB' },
]
const LARGE = [
{ name: 'meta-llama-3.1-8b-instruct', size: '~5 GB' },
{ name: 'qwen3-4b', size: '~2.5 GB' },
{ name: 'mistral-7b-instruct-v0.3', size: '~4 GB' },
]
const GB = 1024 * 1024 * 1024
// Pick a tier from detected hardware. total_memory is GPU VRAM in bytes (0 when
// CPU-only). Thresholds are deliberately conservative so a suggestion that
// "fits" really does.
function pickTier(resources) {
const isGpu = resources?.type === 'gpu'
const vram = resources?.aggregate?.total_memory || 0
if (!isGpu || vram <= 0) return { id: 'cpu', list: SMALL }
if (vram < 8 * GB) return { id: 'gpu-small', list: MID }
return { id: 'gpu-large', list: LARGE }
}
export default function StarterModels({ addToast, onInstallStarted }) {
const { t } = useTranslation('home')
const { resources } = useResources()
const [available, setAvailable] = useState(null) // Set of gallery names, or null while loading
const [installing, setInstalling] = useState(() => new Set())
const tier = useMemo(() => pickTier(resources), [resources])
const candidates = tier.list
// Verify candidates exist in the live gallery. One search per name (the tier
// has at most a handful) keeps this resilient to gallery customization.
useEffect(() => {
let cancelled = false
const names = [...new Set(candidates.map(c => c.name))]
Promise.all(names.map(name =>
modelsApi.list({ search: name, page: 1 })
.then(data => (data?.models || []).some(m => (m.name || m.id) === name) ? name : null)
.catch(() => null)
)).then(found => {
if (cancelled) return
const hits = found.filter(Boolean)
// If verification yielded nothing (e.g. gallery unreachable), fall back to
// showing the curated list rather than an empty widget.
setAvailable(hits.length > 0 ? new Set(hits) : null)
})
return () => { cancelled = true }
}, [candidates])
const visible = available === null
? candidates
: candidates.filter(c => available.has(c.name))
if (visible.length === 0) return null
const install = async (name) => {
setInstalling(prev => new Set(prev).add(name))
try {
await modelsApi.install(name)
addToast?.(t('starters.installStarted', { model: name }), 'success')
onInstallStarted?.(name)
} catch (err) {
addToast?.(t('starters.installFailed', { message: err.message }), 'error')
setInstalling(prev => {
const next = new Set(prev)
next.delete(name)
return next
})
}
}
return (
<section className="home-starters card">
<div className="home-starters-head">
<strong>{t('starters.title')}</strong>
<span className="home-starters-tier">
<i className={`fas ${tier.id === 'cpu' ? 'fa-memory' : 'fa-microchip'}`} aria-hidden="true" />
{t(`starters.tier.${tier.id}`)}
</span>
</div>
<p className="home-starters-sub">
{tier.id === 'cpu' ? t('starters.cpuNote') : t('starters.gpuNote')}
</p>
<ul className="home-starters-list">
{visible.map(c => {
const busy = installing.has(c.name)
return (
<li key={c.name} className="home-starters-item">
<span className="home-starters-name">{c.name}</span>
<span className="home-starters-size">{c.size}</span>
<button
type="button"
className="btn btn-primary btn-sm"
disabled={busy}
onClick={() => install(c.name)}
>
{busy
? (<><i className="fas fa-spinner fa-spin" aria-hidden="true" /> {t('starters.installing')}</>)
: (<><i className="fas fa-download" aria-hidden="true" /> {t('starters.install')}</>)}
</button>
</li>
)
})}
</ul>
</section>
)
}

View File

@@ -0,0 +1,66 @@
import { useEffect, useRef, useCallback } from 'react'
// usePolling runs `fn` immediately and then on a fixed interval, with two
// behaviours every hand-rolled setInterval in this app was missing:
//
// 1. Visibility-aware: the timer pauses while the tab is hidden
// (document.hidden) and fires an immediate catch-up poll when the tab
// becomes visible again. A backgrounded dashboard no longer hammers the
// server every few seconds for data nobody is looking at.
// 2. Non-overlapping: if `fn` returns a promise that takes longer than the
// interval, the next tick waits for it instead of stacking requests.
//
// `enabled: false` stops polling entirely (one-shot or gated polls). The
// returned `refetch` runs `fn` on demand and is stable across renders.
export function usePolling(fn, intervalMs = 5000, { enabled = true, immediate = true } = {}) {
const fnRef = useRef(fn)
fnRef.current = fn
const runningRef = useRef(false)
const refetch = useCallback(async () => {
// Guard against overlap: a slow poll shouldn't pile up behind a fast timer.
if (runningRef.current) return
runningRef.current = true
try {
return await fnRef.current()
} finally {
runningRef.current = false
}
}, [])
useEffect(() => {
if (!enabled) return
let timer = null
const tick = () => { refetch() }
const start = () => {
if (timer != null) return
timer = setInterval(tick, intervalMs)
}
const stop = () => {
if (timer != null) { clearInterval(timer); timer = null }
}
const onVisibility = () => {
if (document.hidden) {
stop()
} else {
// Catch up immediately on return, then resume the cadence.
tick()
start()
}
}
if (immediate) tick()
if (!document.hidden) start()
document.addEventListener('visibilitychange', onVisibility)
return () => {
stop()
document.removeEventListener('visibilitychange', onVisibility)
}
}, [enabled, intervalMs, immediate, refetch])
return { refetch }
}

View File

@@ -1,11 +1,11 @@
import { useState, useEffect, useCallback, useRef } from 'react'
import { useState, useCallback } from 'react'
import { resourcesApi } from '../utils/api'
import { usePolling } from './usePolling'
export function useResources(pollInterval = 5000) {
const [resources, setResources] = useState(null)
const [loading, setLoading] = useState(true)
const [error, setError] = useState(null)
const intervalRef = useRef(null)
const fetchResources = useCallback(async () => {
try {
@@ -19,13 +19,10 @@ export function useResources(pollInterval = 5000) {
}
}, [])
useEffect(() => {
fetchResources()
intervalRef.current = setInterval(fetchResources, pollInterval)
return () => {
if (intervalRef.current) clearInterval(intervalRef.current)
}
}, [fetchResources, pollInterval])
// Visibility-aware polling: pauses while the tab is hidden and catches up on
// return (see usePolling). Resource stats are pure dashboard data, so there's
// no reason to keep fetching them for a backgrounded tab.
const { refetch } = usePolling(fetchResources, pollInterval)
return { resources, loading, error, refetch: fetchResources }
return { resources, loading, error, refetch }
}

View File

@@ -765,8 +765,10 @@ export default function AgentChat() {
className="chat-send-btn"
onClick={handleSend}
disabled={processing || !input.trim()}
aria-label="Send message"
title="Send message"
>
<i className="fas fa-paper-plane" />
<i className="fas fa-paper-plane" aria-hidden="true" />
</button>
</div>
</div>

View File

@@ -1427,8 +1427,10 @@ export default function Chat() {
className="chat-send-btn"
onClick={handleSend}
disabled={!input.trim() && files.length === 0}
aria-label={t('input.send')}
title={t('input.send')}
>
<i className="fas fa-paper-plane" />
<i className="fas fa-paper-plane" aria-hidden="true" />
</button>
)}
</div>

View File

@@ -10,6 +10,7 @@ import UnifiedMCPDropdown from '../components/UnifiedMCPDropdown'
import ConfirmDialog from '../components/ConfirmDialog'
import HomeConnect from '../components/HomeConnect'
import { useResources } from '../hooks/useResources'
import { usePolling } from '../hooks/usePolling'
import { fileToBase64, backendControlApi, systemApi, modelsApi, mcpApi, nodesApi } from '../utils/api'
import { API_CONFIG } from '../utils/config'
import { greetingKey } from '../utils/greeting'
@@ -17,6 +18,7 @@ import StatusPill from '../components/StatusPill'
import Skeleton from '../components/Skeleton'
import SectionHeading from '../components/SectionHeading'
import EmptyState from '../components/EmptyState'
import StarterModels from '../components/StarterModels'
import { staggerStyle } from '../hooks/useStagger'
export default function Home() {
@@ -68,40 +70,36 @@ export default function Home() {
.catch(() => {})
}, [])
// Poll cluster node data in distributed mode
useEffect(() => {
if (!distributedMode) return
const fetchCluster = async () => {
try {
const data = await nodesApi.list()
const nodes = Array.isArray(data) ? data : []
const backendNodes = nodes.filter(n => !n.node_type || n.node_type === 'backend')
const totalVRAM = backendNodes.reduce((sum, n) => sum + (n.total_vram || 0), 0)
const usedVRAM = backendNodes.reduce((sum, n) => {
if (n.total_vram && n.available_vram != null) return sum + (n.total_vram - n.available_vram)
return sum
}, 0)
const totalRAM = backendNodes.reduce((sum, n) => sum + (n.total_ram || 0), 0)
const usedRAM = backendNodes.reduce((sum, n) => {
if (n.total_ram && n.available_ram != null) return sum + (n.total_ram - n.available_ram)
return sum
}, 0)
const isGPU = totalVRAM > 0
const healthyCount = backendNodes.filter(n => n.status === 'healthy').length
const totalCount = backendNodes.length
setClusterData({
totalMem: isGPU ? totalVRAM : totalRAM,
usedMem: isGPU ? usedVRAM : usedRAM,
isGPU,
healthyCount,
totalCount,
})
} catch { setClusterData(null) }
}
fetchCluster()
const interval = setInterval(fetchCluster, 5000)
return () => clearInterval(interval)
}, [distributedMode])
// Poll cluster node data in distributed mode. Visibility-aware + gated on
// distributedMode so a non-distributed or backgrounded tab makes no calls.
const fetchCluster = useCallback(async () => {
try {
const data = await nodesApi.list()
const nodes = Array.isArray(data) ? data : []
const backendNodes = nodes.filter(n => !n.node_type || n.node_type === 'backend')
const totalVRAM = backendNodes.reduce((sum, n) => sum + (n.total_vram || 0), 0)
const usedVRAM = backendNodes.reduce((sum, n) => {
if (n.total_vram && n.available_vram != null) return sum + (n.total_vram - n.available_vram)
return sum
}, 0)
const totalRAM = backendNodes.reduce((sum, n) => sum + (n.total_ram || 0), 0)
const usedRAM = backendNodes.reduce((sum, n) => {
if (n.total_ram && n.available_ram != null) return sum + (n.total_ram - n.available_ram)
return sum
}, 0)
const isGPU = totalVRAM > 0
const healthyCount = backendNodes.filter(n => n.status === 'healthy').length
const totalCount = backendNodes.length
setClusterData({
totalMem: isGPU ? totalVRAM : totalRAM,
usedMem: isGPU ? usedVRAM : usedRAM,
isGPU,
healthyCount,
totalCount,
})
} catch { setClusterData(null) }
}, [])
usePolling(fetchCluster, 5000, { enabled: distributedMode })
// Fetch configured models (to know if any exist) and loaded models (currently running)
const fetchSystemInfo = useCallback(async () => {
@@ -123,11 +121,7 @@ export default function Home() {
}
}, [])
useEffect(() => {
fetchSystemInfo()
const interval = setInterval(fetchSystemInfo, 5000)
return () => clearInterval(interval)
}, [fetchSystemInfo])
usePolling(fetchSystemInfo, 5000)
// Check MCP availability when selected model changes
useEffect(() => {
@@ -523,6 +517,8 @@ export default function Home() {
</div>
</div>
<StarterModels addToast={addToast} onInstallStarted={fetchSystemInfo} />
<div className="home-wizard-actions">
<button className="btn btn-primary" onClick={() => navigate('/app/models')}>
<i className="fas fa-store" /> {t('wizard.browseGallery')}

View File

@@ -24,7 +24,37 @@ function formatNumber(n) {
return String(n)
}
function StatCard({ icon, label, value, muted }) {
// Opt-in token pricing. LocalAI is self-hosted and has no inherent monetary
// cost, but multi-user deployments use estimated cost for chargeback/budgeting.
// Prices are admin-supplied $ per 1M tokens, stored locally (per-browser), and
// the whole cost surface stays hidden until a non-zero price is set.
const TOKEN_PRICING_KEY = 'localai_token_pricing'
function loadPricing() {
try {
const p = JSON.parse(localStorage.getItem(TOKEN_PRICING_KEY) || '{}')
return { prompt: Number(p.prompt) || 0, completion: Number(p.completion) || 0 }
} catch { return { prompt: 0, completion: 0 } }
}
function savePricing(p) {
try { localStorage.setItem(TOKEN_PRICING_KEY, JSON.stringify(p)) } catch { /* ignore */ }
}
function pricingEnabled(p) { return (p?.prompt || 0) > 0 || (p?.completion || 0) > 0 }
function costOf(row, p) {
return (row.prompt_tokens / 1_000_000) * (p.prompt || 0)
+ (row.completion_tokens / 1_000_000) * (p.completion || 0)
}
function formatCost(n) {
if (!n) return '$0.00'
if (n < 0.01) return '<$0.01'
return '$' + n.toFixed(2)
}
function StatCard({ icon, label, value, muted, text }) {
return (
<div className="card" style={{ padding: 'var(--spacing-sm) var(--spacing-md)', flex: '1 1 0', minWidth: 120, opacity: muted ? 0.7 : 1 }}>
<div style={{ display: 'flex', alignItems: 'center', gap: 6, marginBottom: 2 }}>
@@ -32,7 +62,7 @@ function StatCard({ icon, label, value, muted }) {
<span style={{ fontSize: '0.6875rem', color: 'var(--color-text-muted)', fontWeight: 500, textTransform: 'uppercase', letterSpacing: '0.03em' }}>{label}</span>
</div>
<div style={{ fontSize: '1.375rem', fontWeight: 700, fontFamily: 'var(--font-mono)', color: muted ? 'var(--color-text-secondary)' : 'var(--color-text-primary)' }}>
{muted ? '~' : ''}{formatNumber(value)}
{text != null ? text : `${muted ? '~' : ''}${formatNumber(value)}`}
</div>
</div>
)
@@ -642,6 +672,10 @@ export default function Usage() {
const [activeTab, setActiveTab] = useState('models')
const [quotas, setQuotas] = useState([])
const [selectedUserId, setSelectedUserId] = useState(null)
const [pricing, setPricingState] = useState(loadPricing)
const [showPricing, setShowPricing] = useState(false)
const setPricing = (p) => { setPricingState(p); savePricing(p) }
const costEnabled = pricingEnabled(pricing)
const fetchUsage = useCallback(async () => {
setLoading(true)
@@ -743,11 +777,50 @@ export default function Usage() {
<i className="fas fa-key" style={{ fontSize: '0.7rem' }} /> {t('usage.sources.tab')}
</button>
<div style={{ flex: 1 }} />
<button
className={`btn btn-sm ${costEnabled ? 'btn-primary' : 'btn-secondary'}`}
onClick={() => setShowPricing(v => !v)}
style={{ gap: 4 }}
title="Set token pricing to estimate cost"
>
<i className="fas fa-dollar-sign" /> {costEnabled ? 'Pricing' : 'Set pricing'}
</button>
<button className="btn btn-secondary btn-sm" onClick={fetchUsage} disabled={loading} style={{ gap: 4 }}>
<i className={`fas fa-rotate${loading ? ' fa-spin' : ''}`} /> Refresh
</button>
</div>
{showPricing && (
<div className="card" style={{ display: 'flex', alignItems: 'flex-end', gap: 'var(--spacing-md)', flexWrap: 'wrap', padding: 'var(--spacing-md)', marginBottom: 'var(--spacing-md)' }}>
<div style={{ display: 'flex', flexDirection: 'column', gap: 2 }}>
<label style={{ fontSize: '0.6875rem', color: 'var(--color-text-muted)', textTransform: 'uppercase', letterSpacing: '0.03em' }}>Prompt $/1M tokens</label>
<input
className="input" type="number" min="0" step="0.01" style={{ width: 140 }}
value={pricing.prompt || ''}
placeholder="0.00"
onChange={e => setPricing({ ...pricing, prompt: Number(e.target.value) || 0 })}
/>
</div>
<div style={{ display: 'flex', flexDirection: 'column', gap: 2 }}>
<label style={{ fontSize: '0.6875rem', color: 'var(--color-text-muted)', textTransform: 'uppercase', letterSpacing: '0.03em' }}>Completion $/1M tokens</label>
<input
className="input" type="number" min="0" step="0.01" style={{ width: 140 }}
value={pricing.completion || ''}
placeholder="0.00"
onChange={e => setPricing({ ...pricing, completion: Number(e.target.value) || 0 })}
/>
</div>
{costEnabled && (
<button className="btn btn-secondary btn-sm" onClick={() => setPricing({ prompt: 0, completion: 0 })} style={{ gap: 4 }}>
<i className="fas fa-times" /> Clear
</button>
)}
<span style={{ fontSize: '0.75rem', color: 'var(--color-text-muted)', flex: '1 1 200px' }}>
Estimated cost only. Prices are stored in this browser and applied to recorded token counts.
</span>
</div>
)}
{loading ? (
<div style={{ display: 'flex', justifyContent: 'center', padding: 'var(--spacing-xl)' }}>
<LoadingSpinner size="lg" />
@@ -760,6 +833,9 @@ export default function Usage() {
<StatCard icon="fas fa-arrow-up" label="Prompt" value={displayTotals.prompt_tokens} />
<StatCard icon="fas fa-arrow-down" label="Completion" value={displayTotals.completion_tokens} />
<StatCard icon="fas fa-coins" label="Total" value={displayTotals.total_tokens} />
{costEnabled && (
<StatCard icon="fas fa-dollar-sign" label="Est. Cost" text={formatCost(costOf(displayTotals, pricing))} />
)}
</div>
{/* Predictions */}
@@ -789,6 +865,7 @@ export default function Usage() {
<th style={{ width: 110 }}>Prompt</th>
<th style={{ width: 110 }}>Completion</th>
<th style={{ width: 110 }}>Total</th>
{costEnabled && <th style={{ width: 100 }}>Est. Cost</th>}
<th style={{ width: 140 }}></th>
</tr>
</thead>
@@ -800,6 +877,7 @@ export default function Usage() {
<td style={monoCell}>{formatNumber(row.prompt_tokens)}</td>
<td style={monoCell}>{formatNumber(row.completion_tokens)}</td>
<td style={{ ...monoCell, fontWeight: 600 }}>{formatNumber(row.total_tokens)}</td>
{costEnabled && <td style={monoCell}>{formatCost(costOf(row, pricing))}</td>}
<td><UsageBar value={row.total_tokens} max={maxTokens} /></td>
</tr>
))}
@@ -827,6 +905,7 @@ export default function Usage() {
<th style={{ width: 110 }}>Prompt</th>
<th style={{ width: 110 }}>Completion</th>
<th style={{ width: 110 }}>Total</th>
{costEnabled && <th style={{ width: 100 }}>Est. Cost</th>}
<th style={{ width: 110 }}>Proj. Total</th>
<th style={{ width: 140 }}></th>
</tr>
@@ -849,6 +928,7 @@ export default function Usage() {
<td style={monoCell}>{formatNumber(row.prompt_tokens)}</td>
<td style={monoCell}>{formatNumber(row.completion_tokens)}</td>
<td style={{ ...monoCell, fontWeight: 600 }}>{formatNumber(row.total_tokens)}</td>
{costEnabled && <td style={monoCell}>{formatCost(costOf(row, pricing))}</td>}
<td style={{ ...monoCell, color: 'var(--color-text-muted)', fontStyle: 'italic' }}>
{up?.predictions ? `~${formatNumber(up.predictions.projectedTotals.total_tokens)}` : '-'}
</td>
@@ -856,7 +936,7 @@ export default function Usage() {
</tr>
{isExpanded && up && (
<tr>
<td colSpan={8} style={{ padding: 0, background: 'var(--color-bg-secondary)' }}>
<td colSpan={costEnabled ? 9 : 8} style={{ padding: 0, background: 'var(--color-bg-secondary)' }}>
<div style={{ padding: 'var(--spacing-md)' }}>
{up.predictions && (
<div style={{ display: 'grid', gridTemplateColumns: 'repeat(auto-fit, minmax(100px, 1fr))', gap: 'var(--spacing-xs)', marginBottom: 'var(--spacing-sm)' }}>

View File

@@ -156,7 +156,10 @@ func applyNodeHardwareDefaults(opts *pb.ModelOptions, node *BackendNode) {
VRAM: node.TotalVRAM,
}
if config.IsManagedPhysicalBatch(int(opts.NBatch)) {
opts.NBatch = int32(config.PhysicalBatch(gpu))
// Gate the raised batch on the selected node's per-device VRAM at this
// model's context, so a large context can't overflow the node's compute
// buffer (issue #10485). node.TotalVRAM is the node's reported ceiling.
opts.NBatch = int32(config.PhysicalBatchForContext(gpu, int(opts.ContextSize)))
}
// Default concurrent serving for the selected node (the frontend that built
// the options may have no GPU). Only adds when no parallel option is set.

View File

@@ -8,12 +8,19 @@ import (
)
var _ = Describe("applyNodeHardwareDefaults", func() {
It("raises a managed default batch on a Blackwell node", func() {
opts := &pb.ModelOptions{NBatch: config.DefaultPhysicalBatch}
applyNodeHardwareDefaults(opts, &BackendNode{GPUComputeCapability: "12.1"})
It("raises a managed default batch on a Blackwell node with headroom", func() {
opts := &pb.ModelOptions{NBatch: config.DefaultPhysicalBatch, ContextSize: 8192}
applyNodeHardwareDefaults(opts, &BackendNode{GPUComputeCapability: "12.1", TotalVRAM: 119 << 30})
Expect(opts.NBatch).To(BeEquivalentTo(config.BlackwellPhysicalBatch))
})
It("keeps the default batch when a large context would overflow the node", func() {
// Regression guard for issue #10485 on the distributed path.
opts := &pb.ModelOptions{NBatch: config.DefaultPhysicalBatch, ContextSize: 204800}
applyNodeHardwareDefaults(opts, &BackendNode{GPUComputeCapability: "12.0", TotalVRAM: 16 << 30})
Expect(opts.NBatch).To(BeEquivalentTo(config.DefaultPhysicalBatch))
})
It("resets a Blackwell guess on a non-Blackwell node", func() {
// frontend (Blackwell) guessed high, but the selected node is not Blackwell
opts := &pb.ModelOptions{NBatch: config.BlackwellPhysicalBatch}

View File

@@ -1,3 +1,3 @@
{
"version": "v4.4.3"
"version": "v4.5.0"
}

View File

@@ -3,24 +3,7 @@
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
urls:
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct-GGUF
description: |
Try LFM • Docs • LEAP • Discord
# LFM2.5-1.2B-Instruct
LFM2.5 is a new family of hybrid models designed for **on-device deployment**. It builds on the LFM2 architecture with extended pre-training and reinforcement learning.
- **Best-in-class performance**: A 1.2B model rivaling much larger models, bringing high-quality AI to your pocket.
- **Fast edge inference**: 239 tok/s decode on AMD CPU, 82 tok/s on mobile NPU. Runs under 1GB of memory with day-one support for llama.cpp, MLX, and vLLM.
- **Scaled training**: Extended pre-training from 10T to 28T tokens and large-scale multi-stage reinforcement learning.
Find more information about LFM2.5 in our blog post.
## 🗒️ Model Details
LFM2.5-1.2B-Instruct is a general-purpose text-only model with the following features:
...
description: "Try LFM • Docs • LEAP • Discord\n\n# LFM2.5-1.2B-Instruct\n\nLFM2.5 is a new family of hybrid models designed for **on-device deployment**. It builds on the LFM2 architecture with extended pre-training and reinforcement learning.\n\n - **Best-in-class performance**: A 1.2B model rivaling much larger models, bringing high-quality AI to your pocket.\n - **Fast edge inference**: 239 tok/s decode on AMD CPU, 82 tok/s on mobile NPU. Runs under 1GB of memory with day-one support for llama.cpp, MLX, and vLLM.\n - **Scaled training**: Extended pre-training from 10T to 28T tokens and large-scale multi-stage reinforcement learning.\n\nFind more information about LFM2.5 in our blog post.\n\n## \U0001F5D2 Model Details\n\nLFM2.5-1.2B-Instruct is a general-purpose text-only model with the following features:\n\n...\n"
license: "other"
tags:
- llm
@@ -842,8 +825,8 @@
use_tokenizer_template: true
files:
- filename: llama-cpp/models/Qwopus3.6-27B-Coder-MTP-GGUF/Qwopus3.6-27B-Coder-MTP-Q4_K_M.gguf
sha256: b2898667ed7b2388f0ab7691393833ae777f247492bbe62fdb4b2bd3e3cf3f79
uri: https://huggingface.co/Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF/resolve/main/Qwopus3.6-27B-Coder-MTP-Q4_K_M.gguf
sha256: b2b9180093496da2e00439e3fa23227c591355901bfa579bc6897bbc01b755ef
- filename: llama-cpp/mmproj/Qwopus3.6-27B-Coder-MTP-GGUF/mmproj-F32.gguf
sha256: 32f7ea0600c07272547da401d460f8abbd980f3a57b69d6df87be0e2505e0b9c
uri: https://huggingface.co/Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF/resolve/main/mmproj-F32.gguf

View File

@@ -129,6 +129,61 @@ func TotalAvailableVRAM() (uint64, error) {
return 0, nil
}
// MinPerGPUVRAM returns the total VRAM of the SMALLEST GPU on the host (in
// bytes), or 0 when no per-device VRAM is known. Unlike TotalAvailableVRAM
// (which sums across devices) this reports a single device's ceiling, which is
// the right figure for decisions about what must fit on one card: the compute
// buffer (sized by n_ubatch) and the parallel-slot tier. Summing a multi-GPU
// host's VRAM over-provisions those into a per-device OOM (issue #10485).
//
// Unified-memory devices (GB10, Apple) report system RAM as their single
// device's VRAM, so they are unaffected.
func MinPerGPUVRAM() (uint64, error) {
// Prefer per-device binary detection (nvidia-smi/rocm-smi report true
// per-card VRAM); ghw's per-card memory can reflect NUMA node RAM on some
// hosts, which is why TotalAvailableVRAM treats it as a sum.
if infos := GetGPUMemoryUsage(); len(infos) > 0 {
if v := minNonZeroVRAM(infos); v > 0 {
return v, nil
}
}
// Fallback: ghw per-card memory, taking the minimum non-zero card.
if gpus, err := GPUs(); err == nil {
var min uint64
for _, gpu := range gpus {
if gpu == nil || gpu.Node == nil || gpu.Node.Memory == nil {
continue
}
if b := gpu.Node.Memory.TotalUsableBytes; b > 0 {
if u := uint64(b); min == 0 || u < min {
min = u
}
}
}
if min > 0 {
return min, nil
}
}
return 0, nil
}
// minNonZeroVRAM returns the smallest non-zero TotalVRAM across the given GPUs,
// or 0 when none report VRAM.
func minNonZeroVRAM(infos []GPUMemoryInfo) uint64 {
var min uint64
for _, g := range infos {
if g.TotalVRAM == 0 {
continue
}
if min == 0 || g.TotalVRAM < min {
min = g.TotalVRAM
}
}
return min
}
func HasGPU(vendor string) bool {
gpus, err := GPUs()
if err != nil {

View File

@@ -0,0 +1,37 @@
package xsysinfo
import (
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("minNonZeroVRAM", func() {
const gib = uint64(1) << 30
It("returns the smallest device on a multi-GPU host", func() {
// Two unequal cards (e.g. RTX 5070 Ti + 5060 Ti, both 16 GiB, or a
// mixed pair): the smallest device is the per-card allocation ceiling.
infos := []GPUMemoryInfo{
{TotalVRAM: 16 * gib},
{TotalVRAM: 12 * gib},
}
Expect(minNonZeroVRAM(infos)).To(Equal(12 * gib))
})
It("ignores devices that report zero VRAM", func() {
infos := []GPUMemoryInfo{
{TotalVRAM: 0},
{TotalVRAM: 24 * gib},
}
Expect(minNonZeroVRAM(infos)).To(Equal(24 * gib))
})
It("returns the single device's VRAM on a one-GPU host", func() {
Expect(minNonZeroVRAM([]GPUMemoryInfo{{TotalVRAM: 16 * gib}})).To(Equal(16 * gib))
})
It("returns 0 when no device reports VRAM", func() {
Expect(minNonZeroVRAM([]GPUMemoryInfo{{TotalVRAM: 0}})).To(BeZero())
Expect(minNonZeroVRAM(nil)).To(BeZero())
})
})