fix(distributed): per-replica backend logs (store aggregation + UI)

The multi-replica refactor (PR #9583) changed the worker's process key from `modelID` to `modelID#replicaIndex`, but the BackendLogStore kept the bare-modelID lookup. Result: every distributed deployment lost backend logs in the Nodes UI — single-replica too, since even the default capacity of 1 produces a `#0` suffix. Two changes wired together: * pkg/model: BackendLogStore.GetLines/Subscribe now treat a modelID without `#` as a model prefix and merge across all `modelID#N` replica buffers (timestamp-sorted for GetLines; fan-in for Subscribe). Calls with a full `modelID#N` key resolve exactly. ListModels strips replica suffixes and deduplicates so the listing surfaces one entry per loaded model. * react-ui: per-replica log streams as the default. Loaded Models table disambiguates each row with a `rep N` pill (only when the node hosts >1 replica of a model). Each row's "View logs" link routes to the per-replica process key so operators see only that replica's output. The logs page renders the replica context as a chip in the title and surfaces a segmented control — `Replica 0 / 1 / … / All merged` — when the model has multiple replicas; the merged segment uses the bare-modelID URL (delegating to the store's prefix aggregation) for the side-by-side comparison case. Single-replica deployments see no extra UI. Tests added first (TDD): the regression set in backend_log_store_test.go reproduces the bug at the exact failure point — GetLines/ListModels/Subscribe assertions all fail against the broken code, all pass against the fix. TestSubscribe_PerReplicaFilter pins the exact-key path so a future change can't silently break it. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [Skill:critique] [Skill:audit] [Skill:polish] [Skill:distill]
fix(ui): hide meta-dev backends in System → Backends Development toggle
2026-05-23 16:20:01 -04:00 · 2026-04-27 20:55:24 +00:00 · 2026-04-27 20:38:20 +00:00 · 2026-04-27 20:17:36 +00:00 · 2026-04-27 21:20:05 +02:00 · 2026-04-27 14:21:11 +00:00
213 changed files with 14900 additions and 2047 deletions
--- a/.agents/adding-backends.md
+++ b/.agents/adding-backends.md
@@ -43,7 +43,7 @@ If you add a new language bucket, `scripts/changed-backends.js` also needs a bra

 **Additional build types you may need:**
 - ROCm/HIP: Use `build-type: 'hipblas'` with `base-image: "rocm/dev-ubuntu-24.04:7.2.1"`
- Intel/SYCL: Use `build-type: 'intel'` or `build-type: 'sycl_f16'`/`sycl_f32` with `base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"`
+- Intel/SYCL: Use `build-type: 'intel'` or `build-type: 'sycl_f16'`/`sycl_f32` with `base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"`
 - L4T (ARM): Use `build-type: 'l4t'` with `platforms: 'linux/arm64'` and `runs-on: 'ubuntu-24.04-arm'`

 ## 3. Add Backend Metadata to `backend/index.yaml`
--- a/.agents/ci-caching.md
+++ b/.agents/ci-caching.md
@@ -0,0 +1,111 @@
+# CI Build Caching
+
+Container builds — both the root LocalAI image (`Dockerfile`) and the per-backend images (`backend/Dockerfile.*`) — share a registry-backed BuildKit cache. This file explains how that cache is laid out, what invalidates it, and how to bypass it.
+
+## Cache layout
+
+- **Cache registry**: `quay.io/go-skynet/ci-cache`
+- **One tag per matrix entry**, derived from the existing `tag-suffix`:
+  - Backend builds (`backend_build.yml`): `cache<tag-suffix>`
+    - e.g. `cache-gpu-nvidia-cuda-12-llama-cpp`, `cache-cpu-vllm`, `cache-nvidia-l4t-cuda-13-arm64-vllm`
+  - Root image builds (`image_build.yml`): `cache-localai<tag-suffix>`
+    - e.g. `cache-localai-gpu-nvidia-cuda-12`, `cache-localai-gpu-vulkan`
+- Each tag stores a multi-arch BuildKit cache manifest (`mode=max`), so every intermediate stage is re-usable, not just the final image.
+
+## Read/write semantics
+
+| Trigger | `cache-from` | `cache-to` |
+|---|---|---|
+| `push` to `master` / tag | yes | yes (`mode=max,ignore-error=true`) |
+| `pull_request` | yes | **no** |
+
+PR builds read master's warm cache but never write — this prevents PRs from polluting the shared cache with their experimental state. After merge, the master build for that matrix entry refreshes the cache.
+
+`ignore-error=true` on the write side means a transient quay push failure does not fail the build; the next master push retries.
+
+## Self-warming, no separate populator
+
+There is no cron job that pre-warms the cache. The production builds *are* the populator. The first master build of a given matrix entry pays the cold cost; subsequent same-entry master builds reuse everything that hasn't changed (apt installs, gRPC compile in `Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}`, Python wheel installs, etc.).
+
+Historically there was a `generate_grpc_cache.yaml` cron that targeted a `grpc` stage in the root Dockerfile. That stage was removed in July 2025 and the cron silently failed every night for 9 months without writing anything. It was deleted along with the registry-cache rollout.
+
+## The `DEPS_REFRESH` cache-buster (Python backends)
+
+Every Python backend goes through the shared `backend/Dockerfile.python`, which ends with:
+
+```dockerfile
+ARG DEPS_REFRESH=initial
+RUN cd /${BACKEND} && PORTABLE_PYTHON=true make
+```
+
+Most Python backends ship `requirements*.txt` files that **do not pin every transitive dep** (`torch`, `transformers`, `vllm`, `diffusers`, etc. are listed without a `==` pin, or with `>=` lower bounds only). With a warm BuildKit cache, the `make` layer hashes only on Dockerfile instructions + COPYed source — not on what `pip install` resolves at runtime. So a warm cache would ship the *first* version of `vllm` ever cached and never pick up upstream releases.
+
+`DEPS_REFRESH` defends against that:
+
+- `backend_build.yml` computes `date -u +%Y-W%V` (ISO week, e.g. `2026-W17`) before each build and passes it as a build-arg.
+- The `RUN ... make` layer's BuildKit hash now includes that string, so the layer invalidates **at most once per week**, automatically picking up newer wheels.
+- Within a week, builds stay warm.
+
+This applies only to `Dockerfile.python` because:
+- Go (`Dockerfile.golang`) pins versions in `go.mod` / `go.sum`.
+- Rust (`Dockerfile.rust`) pins via `Cargo.lock`.
+- C++ backends (`Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}`) clone gRPC at a pinned tag (`v1.65.0`) and llama.cpp at a pinned commit; their inputs don't drift between rebuilds.
+
+### Adjusting the cadence
+
+If you need a faster refresh (e.g. while debugging an upstream flake), bump the format to daily (`+%Y-%m-%d`) or hourly (`+%Y-%m-%d-%H`). If you need a one-shot rebuild for a specific backend without changing the schedule, append a marker to the tag-suffix in the matrix or temporarily delete that backend's cache tag in quay.
+
+## Manually evicting cache
+
+To force a fully cold build for one backend or the whole image:
+
+```bash
+# Delete a single tag (requires quay credentials with admin on the repo)
+curl -X DELETE \
+  -H "Authorization: Bearer ${QUAY_TOKEN}" \
+  https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/cache-gpu-nvidia-cuda-12-vllm
+
+# List all tags
+curl -s -H "Authorization: Bearer ${QUAY_TOKEN}" \
+  "https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/?limit=100" | jq '.tags[].name'
+```
+
+Eviction is rarely needed in normal operation — `DEPS_REFRESH` handles weekly drift, source changes invalidate naturally, and `mode=max` keeps the cache scoped per matrix entry so a stale tag never bleeds into a different build.
+
+## What the cache **does not** cover
+
+- The "Free Disk Space" / "Release space from worker" steps run on every job — these reclaim ~6 GB on `ubuntu-latest` runners. They are runner-state cleanup, not Docker, and BuildKit caches don't apply.
+- Intermediate artifacts of `Build and push (PR)` are not pushed anywhere — PRs only build for verification.
+- Darwin builds (see below) — macOS runners have no Docker daemon, so the registry-backed BuildKit cache cannot apply.
+
+## Darwin native caches
+
+`backend_build_darwin.yml` runs natively on `macOS-14` GitHub-hosted runners — there is no Docker, no BuildKit, no cross-job registry cache. Instead, the reusable workflow uses `actions/cache@v4` for four native caches that mirror the spirit of the Linux cache (warm by default, weekly refresh for unpinned Python deps, PRs read-only).
+
+| Cache | Path(s) | Key | Scope |
+|---|---|---|---|
+| Go modules + build | `~/go/pkg/mod`, `~/Library/Caches/go-build` | `go.sum` (managed by `actions/setup-go@v5` `cache: true`) | All darwin jobs |
+| Homebrew | `~/Library/Caches/Homebrew/downloads`, selected `/opt/homebrew/Cellar/*` | hash of `backend_build_darwin.yml` | All darwin jobs |
+| ccache (llama.cpp CMake) | `~/Library/Caches/ccache` | pinned `LLAMA_VERSION` from `backend/cpp/llama-cpp/Makefile` | `inputs.backend == 'llama-cpp'` only |
+| Python wheels (uv + pip) | `~/Library/Caches/pip`, `~/Library/Caches/uv` | `inputs.backend` + ISO week (`+%Y-W%V`) + hash of that backend's `requirements*.txt` | `inputs.lang == 'python'` only |
+
+Read/write semantics match the BuildKit cache: `actions/cache/restore` runs every time, `actions/cache/save` is gated on `github.event_name != 'pull_request'`. PRs read master's warm cache but never write back.
+
+The Python wheel cache uses the same ISO-week cache-buster as the Linux `DEPS_REFRESH` build-arg — same problem (unpinned `torch`/`mlx`/`diffusers`/`transformers` resolve to fresh wheels weekly), same ~one-cold-rebuild-per-week solution.
+
+The brew Cellar cache requires `HOMEBREW_NO_AUTO_UPDATE=1` and `HOMEBREW_NO_INSTALL_CLEANUP=1` (set as job-level env). Without those, `brew install` would mutate the very directories that were just restored, defeating the cache.
+
+For ccache, the workflow exports `CMAKE_ARGS=… -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache` via `$GITHUB_ENV` before running `make build-darwin-go-backend`. The Makefile in `backend/cpp/llama-cpp/` already forwards `CMAKE_ARGS` through to each variant build (`fallback`, `grpc`, `rpc-server`), so no script changes are needed. The three variants share most TUs, so ccache dedupes object files across them.
+
+### Cache budget on Darwin
+
+GitHub Actions caches are limited to 10 GB per repo. Steady-state worst case: ~800 MB Go cache + ~2 GB brew Cellar + up to 2 GB ccache + ~1.5 GB × 5 python backends. If the cap is hit, prefer collapsing the per-backend Python keys into a shared `pyenv-darwin-shared-<week>` key (accepts more cross-backend churn for a smaller footprint) before reducing other caches.
+
+## Touching the cache pipeline
+
+When changing `image_build.yml`, `backend_build.yml`, or any of the `backend/Dockerfile.*` files:
+
+1. **Don't drop `DEPS_REFRESH=...` from the build-args** without a replacement strategy (lockfiles, pinned requirements). Otherwise master will silently freeze on whichever versions were cached at the time.
+2. **Keep `tag-suffix` unique per matrix entry** — it's the cache namespace. Two matrix entries sharing a tag-suffix would clobber each other's cache.
+3. **Keep `cache-to` gated on `github.event_name != 'pull_request'`** — PRs must not write.
+4. **Keep `ignore-error=true` on `cache-to`** — quay registry hiccups must not fail builds.
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -141,7 +141,7 @@ jobs:
          - build-type: ''
            cuda-major-version: ""
            cuda-minor-version: ""
-            platforms: 'linux/amd64'
+            platforms: 'linux/amd64,linux/arm64'
            tag-latest: 'auto'
            tag-suffix: '-cpu-whisperx'
            runs-on: 'ubuntu-latest'
@@ -154,7 +154,7 @@ jobs:
          - build-type: ''
            cuda-major-version: ""
            cuda-minor-version: ""
-            platforms: 'linux/amd64'
+            platforms: 'linux/amd64,linux/arm64'
            tag-latest: 'auto'
            tag-suffix: '-cpu-faster-whisper'
            runs-on: 'ubuntu-latest'
@@ -920,6 +920,32 @@ jobs:
            backend: "turboquant"
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
+          - build-type: 'cublas'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-nvidia-cuda-13-vllm'
+            runs-on: 'arc-runner-set'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "vllm"
+            dockerfile: "./backend/Dockerfile.python"
+            context: "./"
+            ubuntu-version: '2404'
+          - build-type: 'cublas'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-nvidia-cuda-13-vllm-omni'
+            runs-on: 'arc-runner-set'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "vllm-omni"
+            dockerfile: "./backend/Dockerfile.python"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'cublas'
            cuda-major-version: "13"
            cuda-minor-version: "0"
@@ -1076,6 +1102,45 @@ jobs:
            backend: "diffusers"
            dockerfile: "./backend/Dockerfile.python"
            context: "./"
+          - build-type: 'l4t'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/arm64'
+            tag-latest: 'auto'
+            tag-suffix: '-nvidia-l4t-cuda-13-arm64-vllm'
+            runs-on: 'ubuntu-24.04-arm'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            ubuntu-version: '2404'
+            backend: "vllm"
+            dockerfile: "./backend/Dockerfile.python"
+            context: "./"
+          - build-type: 'l4t'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/arm64'
+            tag-latest: 'auto'
+            tag-suffix: '-nvidia-l4t-cuda-13-arm64-vllm-omni'
+            runs-on: 'ubuntu-24.04-arm'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            ubuntu-version: '2404'
+            backend: "vllm-omni"
+            dockerfile: "./backend/Dockerfile.python"
+            context: "./"
+          - build-type: 'l4t'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/arm64'
+            tag-latest: 'auto'
+            tag-suffix: '-nvidia-l4t-cuda-13-arm64-sglang'
+            runs-on: 'ubuntu-24.04-arm'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            ubuntu-version: '2404'
+            backend: "sglang"
+            dockerfile: "./backend/Dockerfile.python"
+            context: "./"
          - build-type: 'l4t'
            cuda-major-version: "13"
            cuda-minor-version: "0"
@@ -1671,7 +1736,7 @@ jobs:
            tag-latest: 'auto'
            tag-suffix: '-gpu-intel-rerankers'
            runs-on: 'ubuntu-latest'
-            base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+            base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
            skip-drivers: 'false'
            backend: "rerankers"
            dockerfile: "./backend/Dockerfile.python"
@@ -1684,7 +1749,7 @@ jobs:
            tag-latest: 'auto'
            tag-suffix: '-gpu-intel-sycl-f32-llama-cpp'
            runs-on: 'ubuntu-latest'
-            base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+            base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
            skip-drivers: 'false'
            backend: "llama-cpp"
            dockerfile: "./backend/Dockerfile.llama-cpp"
@@ -2877,6 +2942,49 @@ jobs:
            dockerfile: "./backend/Dockerfile.python"
            context: "./"
            ubuntu-version: '2404'
+          # sherpa-onnx CPU
+          - build-type: ''
+            cuda-major-version: ""
+            cuda-minor-version: ""
+            platforms: 'linux/amd64,linux/arm64'
+            tag-latest: 'auto'
+            tag-suffix: '-cpu-sherpa-onnx'
+            runs-on: 'ubuntu-latest'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "sherpa-onnx"
+            dockerfile: "./backend/Dockerfile.golang"
+            context: "./"
+            ubuntu-version: '2404'
+          # sherpa-onnx CUDA 12
+          - build-type: 'cublas'
+            cuda-major-version: "12"
+            cuda-minor-version: "8"
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-nvidia-cuda-12-sherpa-onnx'
+            runs-on: 'ubuntu-latest'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "sherpa-onnx"
+            dockerfile: "./backend/Dockerfile.golang"
+            context: "./"
+            ubuntu-version: '2404'
+          # sherpa-onnx CUDA 13 — requires onnxruntime 1.24.x+ for the
+          # gpu_cuda13 tarball; sherpa-onnx SHERPA_COMMIT pins to v1.12.39.
+          - build-type: 'cublas'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-nvidia-cuda-13-sherpa-onnx'
+            runs-on: 'ubuntu-latest'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "sherpa-onnx"
+            dockerfile: "./backend/Dockerfile.golang"
+            context: "./"
+            ubuntu-version: '2404'
  backend-jobs-darwin:
    uses: ./.github/workflows/backend_build_darwin.yml
    strategy:
--- a/.github/workflows/backend_build.yml
+++ b/.github/workflows/backend_build.yml
@@ -208,6 +208,15 @@ jobs:
          username: ${{ secrets.quayUsername }}
          password: ${{ secrets.quayPassword }}

+      # Weekly cache-buster for the per-backend `make` step. Most Python
+      # backends list unpinned deps (torch, transformers, vllm, ...), so a
+      # warm cache freezes upstream versions indefinitely. Rolling this
+      # weekly forces a re-resolve of the install layer at most once per
+      # week, picking up newer wheels without a full cold rebuild.
+      - name: Compute deps refresh key
+        id: deps_refresh
+        run: echo "key=$(date -u +%Y-W%V)" >> "$GITHUB_OUTPUT"
+
      - name: Build and push
        uses: docker/build-push-action@v7
        if: github.event_name != 'pull_request'
@@ -222,9 +231,11 @@ jobs:
            BACKEND=${{ inputs.backend }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
            AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
+            DEPS_REFRESH=${{ steps.deps_refresh.outputs.key }}
          context: ${{ inputs.context }}
          file: ${{ inputs.dockerfile }}
-          cache-from: type=gha
+          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}
+          cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }},mode=max,ignore-error=true
          platforms: ${{ inputs.platforms }}
          push: ${{ github.event_name != 'pull_request' }}
          tags: ${{ steps.meta.outputs.tags }}
@@ -244,9 +255,10 @@ jobs:
            BACKEND=${{ inputs.backend }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
            AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
+            DEPS_REFRESH=${{ steps.deps_refresh.outputs.key }}
          context: ${{ inputs.context }}
          file: ${{ inputs.dockerfile }}
-          cache-from: type=gha
+          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}
          platforms: ${{ inputs.platforms }}
          push: ${{ env.quay_username != '' }}
          tags: ${{ steps.meta_pull_request.outputs.tags }}
--- a/.github/workflows/backend_build_darwin.yml
+++ b/.github/workflows/backend_build_darwin.yml
@@ -48,6 +48,13 @@ jobs:
    strategy:
      matrix:
        go-version: ['${{ inputs.go-version }}']
+    env:
+      # Keep the brew Cellar stable across cache restores. Without these,
+      # `brew install` would auto-update brew itself and re-link formulas,
+      # mutating the very paths the cache just restored.
+      HOMEBREW_NO_AUTO_UPDATE: '1'
+      HOMEBREW_NO_INSTALL_CLEANUP: '1'
+      HOMEBREW_NO_ANALYTICS: '1'
    steps:
      - name: Clone
        uses: actions/checkout@v6
@@ -58,21 +65,141 @@ jobs:
        uses: actions/setup-go@v5
        with:
          go-version: ${{ matrix.go-version }}
-          cache: false
+          # Caches ~/go/pkg/mod and ~/Library/Caches/go-build keyed on go.sum.
+          # Shared across every darwin matrix entry — first job in a run warms
+          # it, the rest hit warm.
+          cache: true

      # You can test your matrix by printing the current Go version
      - name: Display Go version
        run: go version

+      # ---- Homebrew cache ----
+      # macOS runners have no Docker daemon, so the BuildKit registry cache used
+      # for Linux backend images (see .agents/ci-caching.md) doesn't apply here.
+      # We cache the brew downloads + Cellar entries for the formulas we install
+      # below. Read on every run, write only on master/tag pushes — same policy
+      # as the Linux registry cache.
+      - name: Restore Homebrew cache
+        id: brew-cache
+        uses: actions/cache/restore@v4
+        with:
+          path: |
+            ~/Library/Caches/Homebrew/downloads
+            /opt/homebrew/Cellar/protobuf
+            /opt/homebrew/Cellar/grpc
+            /opt/homebrew/Cellar/protoc-gen-go
+            /opt/homebrew/Cellar/protoc-gen-go-grpc
+            /opt/homebrew/Cellar/libomp
+            /opt/homebrew/Cellar/llvm
+            /opt/homebrew/Cellar/ccache
+          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
+
      - name: Dependencies
        run: |
-          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm
+          # ccache is always installed (used by the llama-cpp variant build) so
+          # the brew cache content stays stable across every backend in the
+          # matrix — they all share one cache key.
+          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache
+
+      - name: Save Homebrew cache
+        if: github.event_name != 'pull_request' && steps.brew-cache.outputs.cache-hit != 'true'
+        uses: actions/cache/save@v4
+        with:
+          path: |
+            ~/Library/Caches/Homebrew/downloads
+            /opt/homebrew/Cellar/protobuf
+            /opt/homebrew/Cellar/grpc
+            /opt/homebrew/Cellar/protoc-gen-go
+            /opt/homebrew/Cellar/protoc-gen-go-grpc
+            /opt/homebrew/Cellar/libomp
+            /opt/homebrew/Cellar/llvm
+            /opt/homebrew/Cellar/ccache
+          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
+
+      # ---- ccache for llama.cpp CMake builds ----
+      # Three CMake variants (fallback, grpc, rpc-server) compile the same
+      # llama.cpp source tree with overlapping flags — ccache dedupes object
+      # files across them. Key on the pinned LLAMA_VERSION so a pin bump
+      # invalidates cleanly; restore-keys fall back to the latest entry for the
+      # same pin so unchanged TUs stay warm even when the cache is fresh.
+      - name: Compute llama.cpp version
+        if: inputs.backend == 'llama-cpp'
+        id: llama-version
+        run: |
+          version=$(grep '^LLAMA_VERSION' backend/cpp/llama-cpp/Makefile | head -1 | cut -d= -f2 | cut -d'?' -f1 | tr -d ' ')
+          echo "version=${version}" >> "$GITHUB_OUTPUT"
+
+      - name: Restore ccache
+        if: inputs.backend == 'llama-cpp'
+        id: ccache-cache
+        uses: actions/cache/restore@v4
+        with:
+          path: ~/Library/Caches/ccache
+          key: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-${{ github.run_id }}
+          restore-keys: |
+            ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-
+
+      - name: Configure ccache
+        if: inputs.backend == 'llama-cpp'
+        run: |
+          mkdir -p "$HOME/Library/Caches/ccache"
+          ccache -M 2G
+          ccache -z
+          # llama-cpp-darwin.sh reads CMAKE_ARGS / CCACHE_DIR from env.
+          {
+            echo "CMAKE_ARGS=${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache"
+            echo "CCACHE_DIR=$HOME/Library/Caches/ccache"
+          } >> "$GITHUB_ENV"
+
+      # ---- Python wheel cache (uv + pip) ----
+      # Mirrors the Linux DEPS_REFRESH cadence (see .agents/ci-caching.md): the
+      # ISO-week segment of the cache key forces at most one cold rebuild per
+      # backend per week, automatically picking up newer wheels for unpinned
+      # deps (torch, mlx, diffusers, …). Restore-keys fall back to the most
+      # recent build of the same backend so off-week PRs still hit warm.
+      - name: Compute weekly cache bucket
+        if: inputs.lang == 'python'
+        id: weekly
+        run: echo "bucket=$(date -u +%Y-W%V)" >> "$GITHUB_OUTPUT"
+
+      - name: Restore Python wheel cache
+        if: inputs.lang == 'python'
+        id: pyenv-cache
+        uses: actions/cache/restore@v4
+        with:
+          path: |
+            ~/Library/Caches/pip
+            ~/Library/Caches/uv
+          key: pyenv-darwin-${{ inputs.backend }}-${{ steps.weekly.outputs.bucket }}-${{ hashFiles(format('backend/python/{0}/requirements*.txt', inputs.backend)) }}
+          restore-keys: |
+            pyenv-darwin-${{ inputs.backend }}-

      - name: Build ${{ inputs.backend }}-darwin
        run: |
          make protogen-go
          BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend

+      - name: ccache stats
+        if: inputs.backend == 'llama-cpp'
+        run: ccache -s
+
+      - name: Save ccache
+        if: inputs.backend == 'llama-cpp' && github.event_name != 'pull_request'
+        uses: actions/cache/save@v4
+        with:
+          path: ~/Library/Caches/ccache
+          key: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-${{ github.run_id }}
+
+      - name: Save Python wheel cache
+        if: inputs.lang == 'python' && github.event_name != 'pull_request' && steps.pyenv-cache.outputs.cache-hit != 'true'
+        uses: actions/cache/save@v4
+        with:
+          path: |
+            ~/Library/Caches/pip
+            ~/Library/Caches/uv
+          key: pyenv-darwin-${{ inputs.backend }}-${{ steps.weekly.outputs.bucket }}-${{ hashFiles(format('backend/python/{0}/requirements*.txt', inputs.backend)) }}
+
      - name: Upload ${{ inputs.backend }}.tar
        uses: actions/upload-artifact@v7
        with:
--- a/.github/workflows/gallery-agent.yaml
+++ b/.github/workflows/gallery-agent.yaml
@@ -2,7 +2,7 @@ name: Gallery Agent
 on:

  schedule:
-    - cron: '0 */3 * * *'  # Run every 4 hours
+    - cron: '0 */12 * * *'  # Run every 4 hours
  workflow_dispatch:
    inputs:
      search_term:
--- a/.github/workflows/generate_grpc_cache.yaml
+++ b/.github/workflows/generate_grpc_cache.yaml
@@ -1,96 +0,0 @@
-name: 'generate and publish GRPC docker caches'
-
-on:
-  workflow_dispatch:
-
-  schedule:
-    # daily at midnight
-    - cron: '0 0 * * *'
-
-concurrency:
-  group: grpc-cache-${{ github.head_ref || github.ref }}-${{ github.repository }}
-  cancel-in-progress: true
-
-jobs:
-  generate_caches:
-    if: github.repository == 'mudler/LocalAI'
-    strategy:
-      matrix:
-        include:
-          - grpc-base-image: ubuntu:24.04
-            runs-on: 'ubuntu-latest'
-            platforms: 'linux/amd64,linux/arm64'
-    runs-on: ${{matrix.runs-on}}
-    steps:
-      - name: Release space from worker
-        if: matrix.runs-on == 'ubuntu-latest'
-        run: |
-          echo "Listing top largest packages"
-          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
-          head -n 30 <<< "${pkgs}"
-          echo
-          df -h
-          echo
-          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
-          sudo apt-get remove --auto-remove android-sdk-platform-tools || true
-          sudo apt-get purge --auto-remove android-sdk-platform-tools || true
-          sudo rm -rf /usr/local/lib/android
-          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
-          sudo rm -rf /usr/share/dotnet
-          sudo apt-get remove -y '^mono-.*' || true
-          sudo apt-get remove -y '^ghc-.*' || true
-          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
-          sudo apt-get remove -y 'php.*' || true
-          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
-          sudo apt-get remove -y '^google-.*' || true
-          sudo apt-get remove -y azure-cli || true
-          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
-          sudo apt-get remove -y '^gfortran-.*' || true
-          sudo apt-get remove -y microsoft-edge-stable || true
-          sudo apt-get remove -y firefox || true
-          sudo apt-get remove -y powershell || true
-          sudo apt-get remove -y r-base-core || true
-          sudo apt-get autoremove -y
-          sudo apt-get clean
-          echo
-          echo "Listing top largest packages"
-          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
-          head -n 30 <<< "${pkgs}"
-          echo
-          sudo rm -rfv build || true
-          sudo rm -rf /usr/share/dotnet || true
-          sudo rm -rf /opt/ghc || true
-          sudo rm -rf "/usr/local/share/boost" || true
-          sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
-          df -h
-
-      - name: Set up QEMU
-        uses: docker/setup-qemu-action@master
-        with:
-          platforms: all
-
-      - name: Set up Docker Buildx
-        id: buildx
-        uses: docker/setup-buildx-action@master
-
-      - name: Checkout
-        uses: actions/checkout@v6
-
-      - name: Cache GRPC
-        uses: docker/build-push-action@v7
-        with:
-          builder: ${{ steps.buildx.outputs.name }}
-          # The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
-          # This means that even the MAKEFLAGS have to be an EXACT match.
-          # If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
-          build-args: |
-            GRPC_BASE_IMAGE=${{ matrix.grpc-base-image }}
-            GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
-            GRPC_VERSION=v1.65.0
-          context: .
-          file: ./Dockerfile
-          cache-to: type=gha,ignore-error=true
-          cache-from: type=gha
-          target: grpc
-          platforms: ${{ matrix.platforms }}
-          push: false
--- a/.github/workflows/generate_intel_image.yaml
+++ b/.github/workflows/generate_intel_image.yaml
@@ -16,7 +16,7 @@ jobs:
    strategy:
      matrix:
        include:
-          - base-image: intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04
+          - base-image: intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04
            runs-on: 'arc-runner-set'
            platforms: 'linux/amd64'
    runs-on: ${{matrix.runs-on}}
--- a/.github/workflows/image-pr.yml
+++ b/.github/workflows/image-pr.yml
@@ -20,7 +20,6 @@
        platforms: ${{ matrix.platforms }}
        runs-on: ${{ matrix.runs-on }}
        base-image: ${{ matrix.base-image }}
-        grpc-base-image: ${{ matrix.grpc-base-image }}
        makeflags: ${{ matrix.makeflags }}
        ubuntu-version: ${{ matrix.ubuntu-version }}
      secrets:
@@ -60,15 +59,13 @@
              tag-latest: 'false'
              tag-suffix: '-hipblas'
              base-image: "rocm/dev-ubuntu-24.04:7.2.1"
-              grpc-base-image: "ubuntu:24.04"
              runs-on: 'ubuntu-latest'
              makeflags: "--jobs=3 --output-sync=target"
              ubuntu-version: '2404'
            - build-type: 'sycl'
              platforms: 'linux/amd64'
              tag-latest: 'false'
-              base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
-              grpc-base-image: "ubuntu:24.04"
+              base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
              tag-suffix: 'sycl'
              runs-on: 'ubuntu-latest'
              makeflags: "--jobs=3 --output-sync=target"
--- a/.github/workflows/image.yml
+++ b/.github/workflows/image.yml
@@ -25,7 +25,6 @@
        platforms: ${{ matrix.platforms }}
        runs-on: ${{ matrix.runs-on }}
        base-image: ${{ matrix.base-image }}
-        grpc-base-image: ${{ matrix.grpc-base-image }}
        makeflags: ${{ matrix.makeflags }}
        ubuntu-version: ${{ matrix.ubuntu-version }}
        ubuntu-codename: ${{ matrix.ubuntu-codename }}
@@ -42,12 +41,11 @@
              tag-latest: 'auto'
              tag-suffix: '-gpu-hipblas'
              base-image: "rocm/dev-ubuntu-24.04:7.2.1"
-              grpc-base-image: "ubuntu:24.04"
              runs-on: 'ubuntu-latest'
              makeflags: "--jobs=3 --output-sync=target"
              ubuntu-version: '2404'
              ubuntu-codename: 'noble'
-  
+
    core-image-build:
      if: github.repository == 'mudler/LocalAI'
      uses: ./.github/workflows/image_build.yml
@@ -60,7 +58,6 @@
        platforms: ${{ matrix.platforms }}
        runs-on: ${{ matrix.runs-on }}
        base-image: ${{ matrix.base-image }}
-        grpc-base-image: ${{ matrix.grpc-base-image }}
        makeflags: ${{ matrix.makeflags }}
        skip-drivers: ${{ matrix.skip-drivers }}
        ubuntu-version: ${{ matrix.ubuntu-version }}
@@ -121,8 +118,7 @@
            - build-type: 'intel'
              platforms: 'linux/amd64'
              tag-latest: 'auto'
-              base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
-              grpc-base-image: "ubuntu:24.04"
+              base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
              tag-suffix: '-gpu-intel'
              runs-on: 'ubuntu-latest'
              makeflags: "--jobs=3 --output-sync=target"
@@ -141,7 +137,6 @@
        platforms: ${{ matrix.platforms }}
        runs-on: ${{ matrix.runs-on }}
        base-image: ${{ matrix.base-image }}
-        grpc-base-image: ${{ matrix.grpc-base-image }}
        makeflags: ${{ matrix.makeflags }}
        skip-drivers: ${{ matrix.skip-drivers }}
        ubuntu-version: ${{ matrix.ubuntu-version }}
--- a/.github/workflows/image_build.yml
+++ b/.github/workflows/image_build.yml
@@ -8,11 +8,6 @@ on:
        description: 'Base image'
        required: true
        type: string
-      grpc-base-image:
-        description: 'GRPC Base image, must be a compatible image with base-image'
-        required: false
-        default: ''
-        type: string
      build-type:
        description: 'Build type'
        default: ''
@@ -201,25 +196,19 @@ jobs:
        if: github.event_name != 'pull_request'
        with:
          builder: ${{ steps.buildx.outputs.name }}
-          # The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
-          # This means that even the MAKEFLAGS have to be an EXACT match.
-          # If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
-          # This is why some build args like GRPC_VERSION and MAKEFLAGS are hardcoded
          build-args: |
            BUILD_TYPE=${{ inputs.build-type }}
            CUDA_MAJOR_VERSION=${{ inputs.cuda-major-version }}
            CUDA_MINOR_VERSION=${{ inputs.cuda-minor-version }}
            BASE_IMAGE=${{ inputs.base-image }}
-            GRPC_BASE_IMAGE=${{ inputs.grpc-base-image || inputs.base-image }}
-            GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
-            GRPC_VERSION=v1.65.0
            MAKEFLAGS=${{ inputs.makeflags }}
            SKIP_DRIVERS=${{ inputs.skip-drivers }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
            UBUNTU_CODENAME=${{ inputs.ubuntu-codename }}
          context: .
          file: ./Dockerfile
-          cache-from: type=gha
+          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}
+          cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }},mode=max,ignore-error=true
          platforms: ${{ inputs.platforms }}
          push: ${{ github.event_name != 'pull_request' }}
          tags: ${{ steps.meta.outputs.tags }}
@@ -230,25 +219,18 @@ jobs:
        if: github.event_name == 'pull_request'
        with:
          builder: ${{ steps.buildx.outputs.name }}
-          # The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
-          # This means that even the MAKEFLAGS have to be an EXACT match.
-          # If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
-          # This is why some build args like GRPC_VERSION and MAKEFLAGS are hardcoded
          build-args: |
            BUILD_TYPE=${{ inputs.build-type }}
            CUDA_MAJOR_VERSION=${{ inputs.cuda-major-version }}
            CUDA_MINOR_VERSION=${{ inputs.cuda-minor-version }}
            BASE_IMAGE=${{ inputs.base-image }}
-            GRPC_BASE_IMAGE=${{ inputs.grpc-base-image || inputs.base-image }}
-            GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
-            GRPC_VERSION=v1.65.0
            MAKEFLAGS=${{ inputs.makeflags }}
            SKIP_DRIVERS=${{ inputs.skip-drivers }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
            UBUNTU_CODENAME=${{ inputs.ubuntu-codename }}
          context: .
          file: ./Dockerfile
-          cache-from: type=gha
+          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}
          platforms: ${{ inputs.platforms }}
          #push: true
          tags: ${{ steps.meta_pull_request.outputs.tags }}
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -40,6 +40,7 @@ jobs:
      kokoros: ${{ steps.detect.outputs.kokoros }}
      insightface: ${{ steps.detect.outputs.insightface }}
      speaker-recognition: ${{ steps.detect.outputs.speaker-recognition }}
+      sherpa-onnx: ${{ steps.detect.outputs.sherpa-onnx }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v6
@@ -506,6 +507,72 @@ jobs:
      - name: Build llama-cpp backend image and run audio transcription gRPC e2e tests
        run: |
          make test-extra-backend-llama-cpp-transcription
+  # Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked LLM.
+  # Builds the sherpa-onnx Docker image, extracts the rootfs so the e2e suite
+  # can discover the backend binary + shared libs, downloads the three model
+  # bundles (silero-vad, omnilingual-asr, vits-ljs) and drives the realtime
+  # websocket spec end-to-end.
+  tests-sherpa-onnx-realtime:
+    needs: detect-changes
+    if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
+    runs-on: ubuntu-latest
+    timeout-minutes: 90
+    steps:
+      - name: Clone
+        uses: actions/checkout@v6
+        with:
+          submodules: true
+      - name: Setup Go
+        uses: actions/setup-go@v5
+        with:
+          go-version: '1.25.4'
+      - name: Setup Node.js
+        uses: actions/setup-node@v6
+        with:
+          node-version: '22'
+      - name: Build sherpa-onnx backend image and run realtime e2e tests
+        run: |
+          make test-extra-e2e-realtime-sherpa
+  # Streaming ASR via the sherpa-onnx online recognizer (zipformer
+  # transducer). Exercises both AudioTranscription (buffered) and
+  # AudioTranscriptionStream (real-time deltas) on the e2e-backends
+  # harness.
+  tests-sherpa-onnx-grpc-transcription:
+    needs: detect-changes
+    if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
+    runs-on: ubuntu-latest
+    timeout-minutes: 90
+    steps:
+      - name: Clone
+        uses: actions/checkout@v6
+        with:
+          submodules: true
+      - name: Setup Go
+        uses: actions/setup-go@v5
+        with:
+          go-version: '1.25.4'
+      - name: Build sherpa-onnx backend image and run streaming ASR gRPC e2e tests
+        run: |
+          make test-extra-backend-sherpa-onnx-transcription
+  # VITS TTS via the sherpa-onnx backend. Drives both TTS (file write) and
+  # TTSStream (PCM chunks) on the e2e-backends harness.
+  tests-sherpa-onnx-grpc-tts:
+    needs: detect-changes
+    if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
+    runs-on: ubuntu-latest
+    timeout-minutes: 90
+    steps:
+      - name: Clone
+        uses: actions/checkout@v6
+        with:
+          submodules: true
+      - name: Setup Go
+        uses: actions/setup-go@v5
+        with:
+          go-version: '1.25.4'
+      - name: Build sherpa-onnx backend image and run TTS gRPC e2e tests
+        run: |
+          make test-extra-backend-sherpa-onnx-tts
  tests-ik-llama-cpp-grpc:
    needs: detect-changes
    if: needs.detect-changes.outputs.ik-llama-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -9,9 +9,6 @@ on:
    tags:
      - '*'

-env:
-  GRPC_VERSION: v1.65.0
-
 concurrency:
  group: ci-tests-${{ github.head_ref || github.ref }}-${{ github.repository }}
  cancel-in-progress: true
@@ -195,7 +192,7 @@ jobs:
        run: go version
      - name: Dependencies
        run: |
-          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm opus
+          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm opus ffmpeg
          pip install --user --no-cache-dir grpcio-tools grpcio
      - name: Setup Node.js
        uses: actions/setup-node@v6
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -19,6 +19,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 |------|-------------|
 | [.agents/ai-coding-assistants.md](.agents/ai-coding-assistants.md) | Policy for AI-assisted contributions — licensing, DCO, attribution |
 | [.agents/building-and-testing.md](.agents/building-and-testing.md) | Building the project, running tests, Docker builds for specific platforms |
+| [.agents/ci-caching.md](.agents/ci-caching.md) | CI build cache layout (registry-backed BuildKit cache on quay.io/go-skynet/ci-cache), `DEPS_REFRESH` weekly cache-buster for unpinned Python deps, manual eviction |
 | [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist, including importer integration (the `/import-model` dropdown is server-driven from `GET /backends/known`) |
 | [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
 | [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
--- a/2
+++ b/2
@@ -1,5 +1,4 @@
 ARG BASE_IMAGE=ubuntu:24.04
-ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
 ARG INTEL_BASE_IMAGE=${BASE_IMAGE}
 ARG UBUNTU_CODENAME=noble

@@ -149,6 +148,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
+            hipblaslt-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
--- a/54
+++ b/54
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad backends/sherpa-onnx

 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -394,7 +394,13 @@ protoc:
 .PHONY: protogen-go
 protogen-go: protoc install-go-tools
 	mkdir -p pkg/grpc/proto
-	./protoc --experimental_allow_proto3_optional -Ibackend/ --go_out=pkg/grpc/proto/ --go_opt=paths=source_relative --go-grpc_out=pkg/grpc/proto/ --go-grpc_opt=paths=source_relative \
+	# install-go-tools writes protoc-gen-go and protoc-gen-go-grpc into
+	# $(shell go env GOPATH)/bin, which isn't on every dev's PATH. protoc
+	# resolves its code-gen plugins via PATH, so without this prefix the
+	# generate step fails with "protoc-gen-go: program not found". Prepend
+	# GOPATH/bin so the freshly-installed plugins win without requiring a
+	# shell-profile change.
+	PATH="$$(go env GOPATH)/bin:$$PATH" ./protoc --experimental_allow_proto3_optional -Ibackend/ --go_out=pkg/grpc/proto/ --go_opt=paths=source_relative --go-grpc_out=pkg/grpc/proto/ --go-grpc_opt=paths=source_relative \
    backend/backend.proto

 core/config/inference_defaults.json: ## Fetch inference defaults from unsloth (only if missing)
@@ -780,6 +786,44 @@ test-extra-backend-speaker-recognition-ecapa: docker-build-speaker-recognition
 test-extra-backend-speaker-recognition-all: \
 	test-extra-backend-speaker-recognition-ecapa

+## Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked
+## LLM. Extracts the sherpa-onnx Docker image rootfs, downloads the three
+## gallery-referenced model bundles (silero-vad, omnilingual-asr, vits-ljs),
+## writes the corresponding model config YAMLs, and runs the realtime
+## websocket spec in tests/e2e with REALTIME_* env vars wiring the sherpa
+## slots into the pipeline. The LLM slot stays on the in-repo mock-backend
+## registered unconditionally by tests/e2e/e2e_suite_test.go. See
+## tests/e2e/run-realtime-sherpa.sh for the full orchestration.
+test-extra-e2e-realtime-sherpa: build-mock-backend docker-build-sherpa-onnx protogen-go react-ui
+	bash tests/e2e/run-realtime-sherpa.sh
+
+## Streaming ASR via the sherpa-onnx online recognizer. Uses the streaming
+## zipformer English model (encoder/decoder/joiner int8 + tokens) from the
+## sherpa-onnx gallery entry. Drives both AudioTranscription and
+## AudioTranscriptionStream via the e2e-backends gRPC harness; streaming
+## emits real partial deltas during decode. Each file is renamed on download
+## to the shape sherpa-onnx's online loader expects (encoder.int8.onnx etc.).
+test-extra-backend-sherpa-onnx-transcription: docker-build-sherpa-onnx
+	BACKEND_IMAGE=local-ai-backend:sherpa-onnx \
+	BACKEND_TEST_MODEL_URL='https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/encoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx#encoder.int8.onnx' \
+	BACKEND_TEST_EXTRA_FILES='https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/decoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx#decoder.int8.onnx|https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/joiner-epoch-99-avg-1-chunk-16-left-128.int8.onnx#joiner.int8.onnx|https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/tokens.txt' \
+	BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
+	BACKEND_TEST_CAPS=health,load,transcription \
+	BACKEND_TEST_OPTIONS=subtype=online \
+	$(MAKE) test-extra-backend
+
+## VITS TTS via the sherpa-onnx backend. Pulls the individual files from
+## HuggingFace (the vits-ljs release tarball lives on the k2-fsa github
+## but is also mirrored as discrete files on HF). Exercises both
+## TTS (write-to-file) and TTSStream (PCM chunks + WAV header) via the
+## e2e-backends gRPC harness.
+test-extra-backend-sherpa-onnx-tts: docker-build-sherpa-onnx
+	BACKEND_IMAGE=local-ai-backend:sherpa-onnx \
+	BACKEND_TEST_MODEL_URL='https://huggingface.co/csukuangfj/vits-ljs/resolve/main/vits-ljs.onnx#vits-ljs.onnx' \
+	BACKEND_TEST_EXTRA_FILES='https://huggingface.co/csukuangfj/vits-ljs/resolve/main/tokens.txt|https://huggingface.co/csukuangfj/vits-ljs/resolve/main/lexicon.txt' \
+	BACKEND_TEST_CAPS=health,load,tts \
+	$(MAKE) test-extra-backend
+
 ## sglang mirrors the vllm setup: HuggingFace model id, same tiny Qwen,
 ## tool-call extraction via sglang's native qwen parser. CPU builds use
 ## sglang's upstream pyproject_cpu.toml recipe (see backend/python/sglang/install.sh).
@@ -839,7 +883,7 @@ docker-cuda12:

 docker-image-intel:
 	docker build \
-		--build-arg BASE_IMAGE=intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04 \
+		--build-arg BASE_IMAGE=intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04 \
 		--build-arg IMAGE_TYPE=$(IMAGE_TYPE) \
 		--build-arg GO_TAGS="$(GO_TAGS)" \
 		--build-arg MAKEFLAGS="$(DOCKER_MAKEFLAGS)" \
@@ -917,6 +961,7 @@ BACKEND_VOXTRAL = voxtral|golang|.|false|true
 BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
 BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
 BACKEND_OPUS = opus|golang|.|false|true
+BACKEND_SHERPA_ONNX = sherpa-onnx|golang|.|false|true

 # Python backends with root context
 BACKEND_RERANKERS = rerankers|python|.|false|true
@@ -1029,12 +1074,13 @@ $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_QUANTIZATION)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TINYGRAD)))
 $(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
+$(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))

 # Pattern rule for docker-save targets
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar

-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx

 ########################################################
 ### Mock Backend for E2E Tests
--- a/backend/Dockerfile.golang
+++ b/backend/Dockerfile.golang
@@ -147,6 +147,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
+            hipblaslt-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
--- a/backend/Dockerfile.ik-llama-cpp
+++ b/backend/Dockerfile.ik-llama-cpp
@@ -204,6 +204,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
+            hipblaslt-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
--- a/backend/Dockerfile.llama-cpp
+++ b/backend/Dockerfile.llama-cpp
@@ -206,6 +206,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
+            hipblaslt-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
--- a/backend/Dockerfile.python
+++ b/backend/Dockerfile.python
@@ -162,6 +162,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
+            hipblaslt-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
@@ -202,6 +203,13 @@ COPY scripts/build/package-gpu-libs.sh /package-gpu-libs.sh
 ARG FROM_SOURCE=""
 ENV FROM_SOURCE=${FROM_SOURCE}

+# Cache-buster for the per-backend `make` step. Most Python backends list
+# unpinned deps (torch, transformers, vllm, ...), so a warm registry cache
+# would otherwise freeze upstream versions indefinitely. CI passes a value
+# that rolls weekly so the install layer is rebuilt at most once per week
+# and picks up newer wheels from PyPI / nightly indexes.
+ARG DEPS_REFRESH=initial
+
 RUN cd /${BACKEND} && PORTABLE_PYTHON=true make

 # Package GPU libraries into the backend's lib directory
@@ -216,4 +224,4 @@ RUN if [ -f "/${BACKEND}/package.sh" ]; then \

 FROM scratch
 ARG BACKEND=rerankers
-COPY --from=builder /${BACKEND}/ /
+COPY --from=builder /${BACKEND}/ /
--- a/backend/Dockerfile.turboquant
+++ b/backend/Dockerfile.turboquant
@@ -204,6 +204,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
+            hipblaslt-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=286ce324baed17c95faec77792eaa6bdb1c7a5f5
+IK_LLAMA_VERSION?=3a945af45d45936341a45bbf7deda56776a4af26
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/ik-llama-cpp/patches/0002-clip-ggml-quantize-chunk-user-data.patch
+++ b/backend/cpp/ik-llama-cpp/patches/0002-clip-ggml-quantize-chunk-user-data.patch
@@ -0,0 +1,11 @@
+--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
+@@ -2494,7 +2494,7 @@
+             }
+             new_data = work.data();
+
+-            new_size = ggml_quantize_chunk(new_type, f32_data, new_data, 0, n_elms/cur->ne[0], cur->ne[0], nullptr);
+            new_size = ggml_quantize_chunk(new_type, f32_data, new_data, 0, n_elms/cur->ne[0], cur->ne[0], nullptr, nullptr);
+         } else {
+             new_type = cur->type;
+             new_data = cur->data;
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@

-LLAMA_VERSION?=0d0764dfd257c0ae862525c05778207f87b99b1c
+LLAMA_VERSION?=f53577432541bb9edc1588c4ef45c66bf07e4468
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -642,6 +642,21 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            } else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
                params.no_op_offload = false;
            }
+        } else if (!strcmp(optname, "split_mode") || !strcmp(optname, "sm")) {
+            // Accepts: none | layer | row | tensor (the latter requires a llama.cpp build
+            // that includes ggml-org/llama.cpp#19378, FlashAttention enabled, and KV-cache
+            // quantization disabled).
+            if (optval != NULL) {
+                if (optval_str == "none") {
+                    params.split_mode = LLAMA_SPLIT_MODE_NONE;
+                } else if (optval_str == "layer") {
+                    params.split_mode = LLAMA_SPLIT_MODE_LAYER;
+                } else if (optval_str == "row") {
+                    params.split_mode = LLAMA_SPLIT_MODE_ROW;
+                } else if (optval_str == "tensor") {
+                    params.split_mode = LLAMA_SPLIT_MODE_TENSOR;
+                }
+            }
        } else if (!strcmp(optname, "kv_unified") || !strcmp(optname, "unified_kv")) {
            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
                params.kv_unified = true;
--- a/backend/cpp/turboquant/Makefile
+++ b/backend/cpp/turboquant/Makefile
@@ -1,7 +1,7 @@

 # Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
 # Auto-bumped nightly by .github/workflows/bump_deps.yaml.
-TURBOQUANT_VERSION?=627ebbc6e27727bd4f65422d8aa60b13404993c8
+TURBOQUANT_VERSION?=11a241d0db78a68e0a5b99fe6f36de6683100f6a
 LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant

 CMAKE_ARGS?=
--- a/backend/go/local-store/store.go
+++ b/backend/go/local-store/store.go
@@ -4,7 +4,6 @@ package main
 // It is meant to be used by the main executable that is the server for the specific backend type (falcon, gpt3, etc)
 import (
 	"container/heap"
-	"errors"
 	"fmt"
 	"math"
 	"slices"
@@ -100,9 +99,16 @@ func sortIntoKeySlicese(keys []*pb.StoresKey) [][]float32 {
 }

 func (s *Store) Load(opts *pb.ModelOptions) error {
-	if opts.Model != "" {
-		return errors.New("not implemented")
-	}
+	// local-store is an in-memory vector store with no on-disk artefact to
+	// load — opts.Model is just a namespace identifier. The old `!= ""` guard
+	// rejected any non-empty model name with "not implemented", which broke
+	// callers that pass a namespace to isolate embedding spaces (face vs.
+	// voice biometrics both go through local-store but need distinct stores
+	// so ArcFace 512-D and ECAPA-TDNN 192-D don't collide). Namespace
+	// isolation is already handled upstream: ModelLoader spawns a fresh
+	// local-store process per (backend, model) tuple, so each namespace is
+	// its own Store{} instance. Nothing to do here beyond accepting the load.
+	_ = opts
 	return nil
 }

--- a/backend/go/sherpa-onnx/.gitignore
+++ b/backend/go/sherpa-onnx/.gitignore
@@ -0,0 +1,11 @@
+.cache/
+sources/
+build*/
+package/
+backend-assets/
+sherpa-onnx
+*.so
+compile_commands.json
+sherpa-onnx-whisper-*
+vits-ljs/
+streaming-zipformer-en/
--- a/backend/go/sherpa-onnx/Makefile
+++ b/backend/go/sherpa-onnx/Makefile
@@ -0,0 +1,120 @@
+CURRENT_DIR=$(abspath ./)
+GOCMD=go
+
+ONNX_VERSION?=1.24.4
+# v1.12.39 — includes upstream's onnxruntime 1.24.4 bump (#3501). Earlier
+# pinned commits only support onnxruntime 1.23.2, which has no CUDA 13
+# pre-built tarball, blocking the -gpu-nvidia-cuda-13 build matrix entry.
+SHERPA_COMMIT?=7288d15e3e31a7bd589b2ba88828d521e7a6b140
+ONNX_ARCH?=x64
+ONNX_OS?=linux
+
+ifneq (,$(findstring aarch64,$(shell uname -m)))
+	ONNX_ARCH=aarch64
+endif
+
+ifeq ($(OS),Darwin)
+	ONNX_OS=osx
+	ifneq (,$(findstring aarch64,$(shell uname -m)))
+		ONNX_ARCH=arm64
+	else ifneq (,$(findstring arm64,$(shell uname -m)))
+		ONNX_ARCH=arm64
+	else
+		ONNX_ARCH=x86_64
+	endif
+endif
+
+# Upstream onnxruntime ships CUDA 12 and CUDA 13 variants under different
+# names: -gpu-<ver>.tgz for CUDA 12, -gpu_cuda13-<ver>.tgz for CUDA 13
+# (note underscore vs dash). CUDA 13 tarballs only exist from 1.24.x onward.
+ifeq ($(BUILD_TYPE),cublas)
+	SHERPA_GPU=ON
+	ONNX_PROVIDER=cuda
+	ifeq ($(CUDA_MAJOR_VERSION),13)
+		ONNX_VARIANT=-gpu_cuda13
+	else
+		ONNX_VARIANT=-gpu
+	endif
+else
+	ONNX_VARIANT=
+	SHERPA_GPU=OFF
+	ONNX_PROVIDER=cpu
+endif
+
+JOBS?=$(shell nproc --ignore=1 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
+
+sources/onnxruntime:
+	mkdir -p sources/onnxruntime
+	curl -L https://github.com/microsoft/onnxruntime/releases/download/v$(ONNX_VERSION)/onnxruntime-$(ONNX_OS)-$(ONNX_ARCH)$(ONNX_VARIANT)-$(ONNX_VERSION).tgz \
+	  -o sources/onnxruntime/onnxruntime.tgz
+	cd sources/onnxruntime && tar -xf onnxruntime.tgz --strip-components=1 && rm onnxruntime.tgz
+
+sources/sherpa-onnx: sources/onnxruntime
+	git clone https://github.com/k2-fsa/sherpa-onnx.git sources/sherpa-onnx
+	cd sources/sherpa-onnx && git checkout $(SHERPA_COMMIT)
+	mkdir -p sources/sherpa-onnx/build
+	# sherpa-onnx's cmake detects a pre-installed onnxruntime via the
+	# SHERPA_ONNXRUNTIME_{INCLUDE,LIB}_DIR env vars (not via -D flags).
+	# Point them at our locally-downloaded Microsoft tarball — without
+	# this, sherpa-onnx falls through to download_onnxruntime() which
+	# fetches from csukuangfj/onnxruntime-libs. For the GPU 1.24.4
+	# build that release mirror publishes `-patched.zip` instead of the
+	# expected `.tgz`, so the download 404s and the build fails.
+	cd sources/sherpa-onnx/build && \
+	SHERPA_ONNXRUNTIME_INCLUDE_DIR=$(CURRENT_DIR)/sources/onnxruntime/include \
+	SHERPA_ONNXRUNTIME_LIB_DIR=$(CURRENT_DIR)/sources/onnxruntime/lib \
+	cmake \
+	  -DCMAKE_BUILD_TYPE=Release \
+	  -DCMAKE_C_FLAGS="-Wno-error=format-security" \
+	  -DCMAKE_CXX_FLAGS="-Wno-error=format-security" \
+	  -DSHERPA_ONNX_ENABLE_GPU=$(SHERPA_GPU) \
+	  -DSHERPA_ONNX_ENABLE_TTS=ON \
+	  -DSHERPA_ONNX_ENABLE_BINARY=OFF \
+	  -DSHERPA_ONNX_ENABLE_PYTHON=OFF \
+	  -DSHERPA_ONNX_ENABLE_TESTS=OFF \
+	  -DSHERPA_ONNX_ENABLE_C_API=ON \
+	  -DBUILD_SHARED_LIBS=ON \
+	  -DSHERPA_ONNX_USE_PRE_INSTALLED_ONNXRUNTIME_IF_AVAILABLE=ON \
+	  ..
+	cd sources/sherpa-onnx/build && make -j$(JOBS)
+
+backend-assets/lib: sources/sherpa-onnx sources/onnxruntime
+	mkdir -p backend-assets/lib
+	cp -rfLv sources/onnxruntime/lib/* backend-assets/lib/
+	cp -rfLv sources/sherpa-onnx/build/lib/*.so* backend-assets/lib/ 2>/dev/null || true
+	cp -rfLv sources/sherpa-onnx/build/lib/*.dylib backend-assets/lib/ 2>/dev/null || true
+
+# libsherpa-shim wraps sherpa-onnx's nested config structs and TTS
+# callback plumbing behind a purego-friendly API: opaque handles plus
+# fixed-signature setters/getters/trampoline. Plain C compile — no cgo.
+SHIM_EXT=so
+ifeq ($(OS),Darwin)
+	SHIM_EXT=dylib
+endif
+
+backend-assets/lib/libsherpa-shim.$(SHIM_EXT): csrc/shim.c csrc/shim.h backend-assets/lib
+	$(CC) -shared -fPIC -O2 \
+	  -I$(CURRENT_DIR)/sources/sherpa-onnx/sherpa-onnx/c-api \
+	  -o $@ csrc/shim.c \
+	  -L$(CURRENT_DIR)/backend-assets/lib \
+	  -lsherpa-onnx-c-api \
+	  -Wl,-rpath,'$$ORIGIN'
+
+sherpa-onnx: backend-assets/lib backend-assets/lib/libsherpa-shim.$(SHIM_EXT)
+	CGO_ENABLED=0 $(GOCMD) build \
+	  -ldflags "$(LD_FLAGS) -X main.onnxProvider=$(ONNX_PROVIDER)" \
+	  -tags "$(GO_TAGS)" -o sherpa-onnx ./
+
+package:
+	bash package.sh
+
+build: sherpa-onnx package
+
+clean:
+	rm -rf sherpa-onnx sources/ backend-assets/ package/ vits-ljs/ sherpa-onnx-whisper-*/
+
+test: sherpa-onnx
+	LD_LIBRARY_PATH=$(CURRENT_DIR)/backend-assets/lib \
+	bash test.sh
+
+.PHONY: build package clean test
--- a/backend/go/sherpa-onnx/backend.go
+++ b/backend/go/sherpa-onnx/backend.go
--- a/backend/go/sherpa-onnx/backend_test.go
+++ b/backend/go/sherpa-onnx/backend_test.go
@@ -0,0 +1,169 @@
+package main
+
+import (
+	"os"
+	"path/filepath"
+	"testing"
+
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+func TestSherpaBackend(t *testing.T) {
+	RegisterFailHandler(Fail)
+	RunSpecs(t, "Sherpa-ONNX Backend Suite")
+}
+
+// Load libsherpa-shim + libsherpa-onnx-c-api via purego before any spec
+// runs — otherwise any Load/TTS/VAD/AudioTranscription call hits a nil
+// function pointer. LD_LIBRARY_PATH must contain the directory holding
+// both .so files; test.sh sets this.
+var _ = BeforeSuite(func() {
+	Expect(loadSherpaLibs()).To(Succeed())
+})
+
+var _ = Describe("Sherpa-ONNX", func() {
+	Context("lifecycle", func() {
+		It("is locking (C API is not thread safe)", func() {
+			Expect((&SherpaBackend{}).Locking()).To(BeTrue())
+		})
+
+		It("errors loading a non-existent model", func() {
+			tmpDir, err := os.MkdirTemp("", "sherpa-test-nonexistent")
+			Expect(err).ToNot(HaveOccurred())
+			defer os.RemoveAll(tmpDir)
+
+			err = (&SherpaBackend{}).Load(&pb.ModelOptions{
+				ModelFile: filepath.Join(tmpDir, "non-existent-model.onnx"),
+			})
+			Expect(err).To(HaveOccurred())
+		})
+
+		It("errors loading a non-existent ASR model", func() {
+			tmpDir, err := os.MkdirTemp("", "sherpa-test-asr")
+			Expect(err).ToNot(HaveOccurred())
+			defer os.RemoveAll(tmpDir)
+
+			err = (&SherpaBackend{}).Load(&pb.ModelOptions{
+				ModelFile: filepath.Join(tmpDir, "model.onnx"),
+				Type:      "asr",
+			})
+			Expect(err).To(HaveOccurred())
+		})
+
+		It("dispatches Load by Type", func() {
+			tmpDir, err := os.MkdirTemp("", "sherpa-test-dispatch")
+			Expect(err).ToNot(HaveOccurred())
+			defer os.RemoveAll(tmpDir)
+
+			modelFile := filepath.Join(tmpDir, "model.onnx")
+			for _, typ := range []string{"", "asr", "vad"} {
+				err := (&SherpaBackend{}).Load(&pb.ModelOptions{ModelFile: modelFile, Type: typ})
+				Expect(err).To(HaveOccurred(), "Type=%q", typ)
+			}
+		})
+	})
+
+	Context("method errors without loaded model", func() {
+		It("rejects TTS", func() {
+			tmpDir, err := os.MkdirTemp("", "sherpa-test-tts")
+			Expect(err).ToNot(HaveOccurred())
+			defer os.RemoveAll(tmpDir)
+
+			err = (&SherpaBackend{}).TTS(&pb.TTSRequest{
+				Text: "should fail — no model loaded",
+				Dst:  filepath.Join(tmpDir, "output.wav"),
+			})
+			Expect(err).To(HaveOccurred())
+		})
+
+		It("rejects AudioTranscription", func() {
+			_, err := (&SherpaBackend{}).AudioTranscription(&pb.TranscriptRequest{
+				Dst: "/tmp/nonexistent.wav",
+			})
+			Expect(err).To(HaveOccurred())
+		})
+
+		It("rejects VAD", func() {
+			_, err := (&SherpaBackend{}).VAD(&pb.VADRequest{
+				Audio: []float32{0.1, 0.2, 0.3},
+			})
+			Expect(err).To(HaveOccurred())
+		})
+	})
+
+	Context("type detection", func() {
+		DescribeTable("isASRType",
+			func(input string, want bool) {
+				Expect(isASRType(input)).To(Equal(want))
+			},
+			Entry("asr", "asr", true),
+			Entry("ASR", "ASR", true),
+			Entry("Asr", "Asr", true),
+			Entry("transcription", "transcription", true),
+			Entry("Transcription", "Transcription", true),
+			Entry("transcribe", "transcribe", true),
+			Entry("Transcribe", "Transcribe", true),
+			Entry("tts", "tts", false),
+			Entry("empty", "", false),
+			Entry("other", "other", false),
+			Entry("vad", "vad", false),
+		)
+
+		DescribeTable("isVADType",
+			func(input string, want bool) {
+				Expect(isVADType(input)).To(Equal(want))
+			},
+			Entry("vad", "vad", true),
+			Entry("VAD", "VAD", true),
+			Entry("Vad", "Vad", true),
+			Entry("asr", "asr", false),
+			Entry("tts", "tts", false),
+			Entry("empty", "", false),
+			Entry("other", "other", false),
+		)
+	})
+
+	Context("option parsing", func() {
+		It("parses float options with fallback on bad input", func() {
+			opts := &pb.ModelOptions{Options: []string{
+				"vad.threshold=0.3",
+				"tts.length_scale=1.25",
+				"bad.number=not-a-float",
+			}}
+			Expect(findOptionFloat(opts, "vad.threshold=", 0.5)).To(BeNumerically("~", 0.3, 1e-6))
+			Expect(findOptionFloat(opts, "tts.length_scale=", 1.0)).To(BeNumerically("~", 1.25, 1e-6))
+			Expect(findOptionFloat(opts, "missing.key=", 0.7)).To(BeNumerically("~", 0.7, 1e-6))
+			Expect(findOptionFloat(opts, "bad.number=", 9.9)).To(BeNumerically("~", 9.9, 1e-6))
+		})
+
+		It("parses int options with fallback on bad input", func() {
+			opts := &pb.ModelOptions{Options: []string{
+				"asr.sample_rate=22050",
+				"online.chunk_samples=800",
+				"bad.int=4.2",
+			}}
+			Expect(findOptionInt(opts, "asr.sample_rate=", 16000)).To(Equal(int32(22050)))
+			Expect(findOptionInt(opts, "online.chunk_samples=", 1600)).To(Equal(int32(800)))
+			Expect(findOptionInt(opts, "missing.key=", 42)).To(Equal(int32(42)))
+			Expect(findOptionInt(opts, "bad.int=", 100)).To(Equal(int32(100)))
+		})
+
+		It("parses bool options (0/1, true/false, yes/no, on/off)", func() {
+			opts := &pb.ModelOptions{Options: []string{
+				"online.enable_endpoint=0",
+				"asr.sense_voice.use_itn=True",
+				"feature.on=yes",
+				"feature.off=Off",
+				"feature.bad=maybe",
+			}}
+			Expect(findOptionBool(opts, "online.enable_endpoint=", 1)).To(Equal(int32(0)))
+			Expect(findOptionBool(opts, "asr.sense_voice.use_itn=", 0)).To(Equal(int32(1)))
+			Expect(findOptionBool(opts, "feature.on=", 0)).To(Equal(int32(1)))
+			Expect(findOptionBool(opts, "feature.off=", 1)).To(Equal(int32(0)))
+			Expect(findOptionBool(opts, "feature.bad=", 1)).To(Equal(int32(1)))
+			Expect(findOptionBool(opts, "missing.key=", 1)).To(Equal(int32(1)))
+		})
+	})
+})
--- a/backend/go/sherpa-onnx/csrc/shim.c
+++ b/backend/go/sherpa-onnx/csrc/shim.c
@@ -0,0 +1,325 @@
+#include "shim.h"
+#include "c-api.h"
+
+#include <stdlib.h>
+#include <string.h>
+
+// Replace the char* field pointed to by `slot` with a strdup of `s`
+// (or NULL if s is NULL). Frees any prior value. Silently no-ops when
+// strdup fails — the caller will see a Create* failure downstream.
+static void shim_set_str(const char **slot, const char *s) {
+    free((char *)*slot);
+    *slot = s ? strdup(s) : NULL;
+}
+
+// ==================================================================
+// VAD config
+// ==================================================================
+
+void *sherpa_shim_vad_config_new(void) {
+    return calloc(1, sizeof(SherpaOnnxVadModelConfig));
+}
+
+void sherpa_shim_vad_config_free(void *h) {
+    if (!h) return;
+    SherpaOnnxVadModelConfig *c = (SherpaOnnxVadModelConfig *)h;
+    free((char *)c->silero_vad.model);
+    free((char *)c->provider);
+    free(c);
+}
+
+void sherpa_shim_vad_config_set_silero_model(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxVadModelConfig *)h)->silero_vad.model, v);
+}
+void sherpa_shim_vad_config_set_silero_threshold(void *h, float v) {
+    ((SherpaOnnxVadModelConfig *)h)->silero_vad.threshold = v;
+}
+void sherpa_shim_vad_config_set_silero_min_silence_duration(void *h, float v) {
+    ((SherpaOnnxVadModelConfig *)h)->silero_vad.min_silence_duration = v;
+}
+void sherpa_shim_vad_config_set_silero_min_speech_duration(void *h, float v) {
+    ((SherpaOnnxVadModelConfig *)h)->silero_vad.min_speech_duration = v;
+}
+void sherpa_shim_vad_config_set_silero_window_size(void *h, int32_t v) {
+    ((SherpaOnnxVadModelConfig *)h)->silero_vad.window_size = v;
+}
+void sherpa_shim_vad_config_set_silero_max_speech_duration(void *h, float v) {
+    ((SherpaOnnxVadModelConfig *)h)->silero_vad.max_speech_duration = v;
+}
+void sherpa_shim_vad_config_set_sample_rate(void *h, int32_t v) {
+    ((SherpaOnnxVadModelConfig *)h)->sample_rate = v;
+}
+void sherpa_shim_vad_config_set_num_threads(void *h, int32_t v) {
+    ((SherpaOnnxVadModelConfig *)h)->num_threads = v;
+}
+void sherpa_shim_vad_config_set_provider(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxVadModelConfig *)h)->provider, v);
+}
+void sherpa_shim_vad_config_set_debug(void *h, int32_t v) {
+    ((SherpaOnnxVadModelConfig *)h)->debug = v;
+}
+
+void *sherpa_shim_create_vad(void *h, float buffer_size_seconds) {
+    return (void *)SherpaOnnxCreateVoiceActivityDetector(
+        (const SherpaOnnxVadModelConfig *)h, buffer_size_seconds);
+}
+
+// ==================================================================
+// Offline TTS config (VITS)
+// ==================================================================
+
+void *sherpa_shim_tts_config_new(void) {
+    return calloc(1, sizeof(SherpaOnnxOfflineTtsConfig));
+}
+
+void sherpa_shim_tts_config_free(void *h) {
+    if (!h) return;
+    SherpaOnnxOfflineTtsConfig *c = (SherpaOnnxOfflineTtsConfig *)h;
+    free((char *)c->model.vits.model);
+    free((char *)c->model.vits.tokens);
+    free((char *)c->model.vits.lexicon);
+    free((char *)c->model.vits.data_dir);
+    free((char *)c->model.provider);
+    free(c);
+}
+
+void sherpa_shim_tts_config_set_vits_model(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.model, v);
+}
+void sherpa_shim_tts_config_set_vits_tokens(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.tokens, v);
+}
+void sherpa_shim_tts_config_set_vits_lexicon(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.lexicon, v);
+}
+void sherpa_shim_tts_config_set_vits_data_dir(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.data_dir, v);
+}
+void sherpa_shim_tts_config_set_vits_noise_scale(void *h, float v) {
+    ((SherpaOnnxOfflineTtsConfig *)h)->model.vits.noise_scale = v;
+}
+void sherpa_shim_tts_config_set_vits_noise_scale_w(void *h, float v) {
+    ((SherpaOnnxOfflineTtsConfig *)h)->model.vits.noise_scale_w = v;
+}
+void sherpa_shim_tts_config_set_vits_length_scale(void *h, float v) {
+    ((SherpaOnnxOfflineTtsConfig *)h)->model.vits.length_scale = v;
+}
+void sherpa_shim_tts_config_set_num_threads(void *h, int32_t v) {
+    ((SherpaOnnxOfflineTtsConfig *)h)->model.num_threads = v;
+}
+void sherpa_shim_tts_config_set_debug(void *h, int32_t v) {
+    ((SherpaOnnxOfflineTtsConfig *)h)->model.debug = v;
+}
+void sherpa_shim_tts_config_set_provider(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.provider, v);
+}
+void sherpa_shim_tts_config_set_max_num_sentences(void *h, int32_t v) {
+    ((SherpaOnnxOfflineTtsConfig *)h)->max_num_sentences = v;
+}
+
+void *sherpa_shim_create_offline_tts(void *h) {
+    return (void *)SherpaOnnxCreateOfflineTts(
+        (const SherpaOnnxOfflineTtsConfig *)h);
+}
+
+// ==================================================================
+// Offline recognizer config
+// ==================================================================
+
+void *sherpa_shim_offline_recog_config_new(void) {
+    return calloc(1, sizeof(SherpaOnnxOfflineRecognizerConfig));
+}
+
+void sherpa_shim_offline_recog_config_free(void *h) {
+    if (!h) return;
+    SherpaOnnxOfflineRecognizerConfig *c = (SherpaOnnxOfflineRecognizerConfig *)h;
+    free((char *)c->model_config.provider);
+    free((char *)c->model_config.tokens);
+    free((char *)c->model_config.whisper.encoder);
+    free((char *)c->model_config.whisper.decoder);
+    free((char *)c->model_config.whisper.language);
+    free((char *)c->model_config.whisper.task);
+    free((char *)c->model_config.paraformer.model);
+    free((char *)c->model_config.sense_voice.model);
+    free((char *)c->model_config.sense_voice.language);
+    free((char *)c->model_config.omnilingual.model);
+    free((char *)c->decoding_method);
+    free(c);
+}
+
+void sherpa_shim_offline_recog_config_set_num_threads(void *h, int32_t v) {
+    ((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.num_threads = v;
+}
+void sherpa_shim_offline_recog_config_set_debug(void *h, int32_t v) {
+    ((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.debug = v;
+}
+void sherpa_shim_offline_recog_config_set_provider(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.provider, v);
+}
+void sherpa_shim_offline_recog_config_set_tokens(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.tokens, v);
+}
+void sherpa_shim_offline_recog_config_set_feat_sample_rate(void *h, int32_t v) {
+    ((SherpaOnnxOfflineRecognizerConfig *)h)->feat_config.sample_rate = v;
+}
+void sherpa_shim_offline_recog_config_set_feat_feature_dim(void *h, int32_t v) {
+    ((SherpaOnnxOfflineRecognizerConfig *)h)->feat_config.feature_dim = v;
+}
+void sherpa_shim_offline_recog_config_set_decoding_method(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->decoding_method, v);
+}
+void sherpa_shim_offline_recog_config_set_whisper_encoder(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.encoder, v);
+}
+void sherpa_shim_offline_recog_config_set_whisper_decoder(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.decoder, v);
+}
+void sherpa_shim_offline_recog_config_set_whisper_language(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.language, v);
+}
+void sherpa_shim_offline_recog_config_set_whisper_task(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.task, v);
+}
+void sherpa_shim_offline_recog_config_set_whisper_tail_paddings(void *h, int32_t v) {
+    ((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.tail_paddings = v;
+}
+void sherpa_shim_offline_recog_config_set_paraformer_model(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.paraformer.model, v);
+}
+void sherpa_shim_offline_recog_config_set_sense_voice_model(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.sense_voice.model, v);
+}
+void sherpa_shim_offline_recog_config_set_sense_voice_language(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.sense_voice.language, v);
+}
+void sherpa_shim_offline_recog_config_set_sense_voice_use_itn(void *h, int32_t v) {
+    ((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.sense_voice.use_itn = v;
+}
+void sherpa_shim_offline_recog_config_set_omnilingual_model(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.omnilingual.model, v);
+}
+
+void *sherpa_shim_create_offline_recognizer(void *h) {
+    return (void *)SherpaOnnxCreateOfflineRecognizer(
+        (const SherpaOnnxOfflineRecognizerConfig *)h);
+}
+
+// ==================================================================
+// Online recognizer config
+// ==================================================================
+
+void *sherpa_shim_online_recog_config_new(void) {
+    return calloc(1, sizeof(SherpaOnnxOnlineRecognizerConfig));
+}
+
+void sherpa_shim_online_recog_config_free(void *h) {
+    if (!h) return;
+    SherpaOnnxOnlineRecognizerConfig *c = (SherpaOnnxOnlineRecognizerConfig *)h;
+    free((char *)c->model_config.transducer.encoder);
+    free((char *)c->model_config.transducer.decoder);
+    free((char *)c->model_config.transducer.joiner);
+    free((char *)c->model_config.tokens);
+    free((char *)c->model_config.provider);
+    free((char *)c->decoding_method);
+    free(c);
+}
+
+void sherpa_shim_online_recog_config_set_transducer_encoder(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.transducer.encoder, v);
+}
+void sherpa_shim_online_recog_config_set_transducer_decoder(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.transducer.decoder, v);
+}
+void sherpa_shim_online_recog_config_set_transducer_joiner(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.transducer.joiner, v);
+}
+void sherpa_shim_online_recog_config_set_tokens(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.tokens, v);
+}
+void sherpa_shim_online_recog_config_set_num_threads(void *h, int32_t v) {
+    ((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.num_threads = v;
+}
+void sherpa_shim_online_recog_config_set_debug(void *h, int32_t v) {
+    ((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.debug = v;
+}
+void sherpa_shim_online_recog_config_set_provider(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.provider, v);
+}
+void sherpa_shim_online_recog_config_set_feat_sample_rate(void *h, int32_t v) {
+    ((SherpaOnnxOnlineRecognizerConfig *)h)->feat_config.sample_rate = v;
+}
+void sherpa_shim_online_recog_config_set_feat_feature_dim(void *h, int32_t v) {
+    ((SherpaOnnxOnlineRecognizerConfig *)h)->feat_config.feature_dim = v;
+}
+void sherpa_shim_online_recog_config_set_decoding_method(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->decoding_method, v);
+}
+void sherpa_shim_online_recog_config_set_enable_endpoint(void *h, int32_t v) {
+    ((SherpaOnnxOnlineRecognizerConfig *)h)->enable_endpoint = v;
+}
+void sherpa_shim_online_recog_config_set_rule1_min_trailing_silence(void *h, float v) {
+    ((SherpaOnnxOnlineRecognizerConfig *)h)->rule1_min_trailing_silence = v;
+}
+void sherpa_shim_online_recog_config_set_rule2_min_trailing_silence(void *h, float v) {
+    ((SherpaOnnxOnlineRecognizerConfig *)h)->rule2_min_trailing_silence = v;
+}
+void sherpa_shim_online_recog_config_set_rule3_min_utterance_length(void *h, float v) {
+    ((SherpaOnnxOnlineRecognizerConfig *)h)->rule3_min_utterance_length = v;
+}
+
+void *sherpa_shim_create_online_recognizer(void *h) {
+    return (void *)SherpaOnnxCreateOnlineRecognizer(
+        (const SherpaOnnxOnlineRecognizerConfig *)h);
+}
+
+// ==================================================================
+// Result-struct accessors
+// ==================================================================
+
+int32_t sherpa_shim_wave_sample_rate(const void *h) {
+    return ((const SherpaOnnxWave *)h)->sample_rate;
+}
+int32_t sherpa_shim_wave_num_samples(const void *h) {
+    return ((const SherpaOnnxWave *)h)->num_samples;
+}
+const float *sherpa_shim_wave_samples(const void *h) {
+    return ((const SherpaOnnxWave *)h)->samples;
+}
+
+const char *sherpa_shim_offline_result_text(const void *h) {
+    return ((const SherpaOnnxOfflineRecognizerResult *)h)->text;
+}
+const char *sherpa_shim_online_result_text(const void *h) {
+    return ((const SherpaOnnxOnlineRecognizerResult *)h)->text;
+}
+
+int32_t sherpa_shim_generated_audio_sample_rate(const void *h) {
+    return ((const SherpaOnnxGeneratedAudio *)h)->sample_rate;
+}
+int32_t sherpa_shim_generated_audio_n(const void *h) {
+    return ((const SherpaOnnxGeneratedAudio *)h)->n;
+}
+const float *sherpa_shim_generated_audio_samples(const void *h) {
+    return ((const SherpaOnnxGeneratedAudio *)h)->samples;
+}
+
+int32_t sherpa_shim_speech_segment_start(const void *h) {
+    return ((const SherpaOnnxSpeechSegment *)h)->start;
+}
+int32_t sherpa_shim_speech_segment_n(const void *h) {
+    return ((const SherpaOnnxSpeechSegment *)h)->n;
+}
+
+// ==================================================================
+// TTS streaming callback trampoline
+// ==================================================================
+
+void *sherpa_shim_tts_generate_with_callback(
+    void *tts, const char *text, int32_t sid, float speed,
+    uintptr_t callback_ptr, uintptr_t user_data) {
+    SherpaOnnxGeneratedAudioCallbackWithArg cb =
+        (SherpaOnnxGeneratedAudioCallbackWithArg)callback_ptr;
+    return (void *)SherpaOnnxOfflineTtsGenerateWithCallbackWithArg(
+        (const SherpaOnnxOfflineTts *)tts, text, sid, speed, cb,
+        (void *)user_data);
+}
--- a/backend/go/sherpa-onnx/csrc/shim.h
+++ b/backend/go/sherpa-onnx/csrc/shim.h
@@ -0,0 +1,129 @@
+#ifndef LOCALAI_SHERPA_ONNX_SHIM_H
+#define LOCALAI_SHERPA_ONNX_SHIM_H
+
+#include <stdint.h>
+
+// libsherpa-shim: purego-friendly wrapper around sherpa-onnx's C API.
+// Purego can't access C struct fields and can't route C callbacks to Go
+// funcs directly. Every function here is a fixed-signature trampoline
+// that replaces one field read/write or callback handoff that the Go
+// backend would otherwise have to do through cgo.
+//
+// String lifetime: setters strdup; _free walks every owned string and
+// frees it. Callers may discard their input buffers the moment a setter
+// returns.
+//
+// Opaque handles are `void *` in both directions. Nothing here holds a
+// reference across calls except config handles (freed via _free) and
+// sherpa-allocated results (freed via sherpa's own Destroy* entry
+// points, which Go calls through purego pass-through).
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// --- VAD config -----------------------------------------------------
+void *sherpa_shim_vad_config_new(void);
+void  sherpa_shim_vad_config_free(void *cfg);
+void  sherpa_shim_vad_config_set_silero_model(void *cfg, const char *path);
+void  sherpa_shim_vad_config_set_silero_threshold(void *cfg, float v);
+void  sherpa_shim_vad_config_set_silero_min_silence_duration(void *cfg, float v);
+void  sherpa_shim_vad_config_set_silero_min_speech_duration(void *cfg, float v);
+void  sherpa_shim_vad_config_set_silero_window_size(void *cfg, int32_t v);
+void  sherpa_shim_vad_config_set_silero_max_speech_duration(void *cfg, float v);
+void  sherpa_shim_vad_config_set_sample_rate(void *cfg, int32_t v);
+void  sherpa_shim_vad_config_set_num_threads(void *cfg, int32_t v);
+void  sherpa_shim_vad_config_set_provider(void *cfg, const char *v);
+void  sherpa_shim_vad_config_set_debug(void *cfg, int32_t v);
+void *sherpa_shim_create_vad(void *cfg, float buffer_size_seconds);
+
+// --- Offline TTS config (VITS path — the only TTS family the backend uses) ---
+void *sherpa_shim_tts_config_new(void);
+void  sherpa_shim_tts_config_free(void *cfg);
+void  sherpa_shim_tts_config_set_vits_model(void *cfg, const char *v);
+void  sherpa_shim_tts_config_set_vits_tokens(void *cfg, const char *v);
+void  sherpa_shim_tts_config_set_vits_lexicon(void *cfg, const char *v);
+void  sherpa_shim_tts_config_set_vits_data_dir(void *cfg, const char *v);
+void  sherpa_shim_tts_config_set_vits_noise_scale(void *cfg, float v);
+void  sherpa_shim_tts_config_set_vits_noise_scale_w(void *cfg, float v);
+void  sherpa_shim_tts_config_set_vits_length_scale(void *cfg, float v);
+void  sherpa_shim_tts_config_set_num_threads(void *cfg, int32_t v);
+void  sherpa_shim_tts_config_set_debug(void *cfg, int32_t v);
+void  sherpa_shim_tts_config_set_provider(void *cfg, const char *v);
+void  sherpa_shim_tts_config_set_max_num_sentences(void *cfg, int32_t v);
+void *sherpa_shim_create_offline_tts(void *cfg);
+
+// --- Offline recognizer config (Whisper / Paraformer / SenseVoice / Omnilingual) ---
+void *sherpa_shim_offline_recog_config_new(void);
+void  sherpa_shim_offline_recog_config_free(void *cfg);
+void  sherpa_shim_offline_recog_config_set_num_threads(void *cfg, int32_t v);
+void  sherpa_shim_offline_recog_config_set_debug(void *cfg, int32_t v);
+void  sherpa_shim_offline_recog_config_set_provider(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_tokens(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_feat_sample_rate(void *cfg, int32_t v);
+void  sherpa_shim_offline_recog_config_set_feat_feature_dim(void *cfg, int32_t v);
+void  sherpa_shim_offline_recog_config_set_decoding_method(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_whisper_encoder(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_whisper_decoder(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_whisper_language(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_whisper_task(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_whisper_tail_paddings(void *cfg, int32_t v);
+void  sherpa_shim_offline_recog_config_set_paraformer_model(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_sense_voice_model(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_sense_voice_language(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_sense_voice_use_itn(void *cfg, int32_t v);
+void  sherpa_shim_offline_recog_config_set_omnilingual_model(void *cfg, const char *v);
+void *sherpa_shim_create_offline_recognizer(void *cfg);
+
+// --- Online recognizer config (streaming zipformer transducer) ---
+void *sherpa_shim_online_recog_config_new(void);
+void  sherpa_shim_online_recog_config_free(void *cfg);
+void  sherpa_shim_online_recog_config_set_transducer_encoder(void *cfg, const char *v);
+void  sherpa_shim_online_recog_config_set_transducer_decoder(void *cfg, const char *v);
+void  sherpa_shim_online_recog_config_set_transducer_joiner(void *cfg, const char *v);
+void  sherpa_shim_online_recog_config_set_tokens(void *cfg, const char *v);
+void  sherpa_shim_online_recog_config_set_num_threads(void *cfg, int32_t v);
+void  sherpa_shim_online_recog_config_set_debug(void *cfg, int32_t v);
+void  sherpa_shim_online_recog_config_set_provider(void *cfg, const char *v);
+void  sherpa_shim_online_recog_config_set_feat_sample_rate(void *cfg, int32_t v);
+void  sherpa_shim_online_recog_config_set_feat_feature_dim(void *cfg, int32_t v);
+void  sherpa_shim_online_recog_config_set_decoding_method(void *cfg, const char *v);
+void  sherpa_shim_online_recog_config_set_enable_endpoint(void *cfg, int32_t v);
+void  sherpa_shim_online_recog_config_set_rule1_min_trailing_silence(void *cfg, float v);
+void  sherpa_shim_online_recog_config_set_rule2_min_trailing_silence(void *cfg, float v);
+void  sherpa_shim_online_recog_config_set_rule3_min_utterance_length(void *cfg, float v);
+void *sherpa_shim_create_online_recognizer(void *cfg);
+
+// --- Result accessors (sherpa-allocated; caller destroys via sherpa's own Destroy*) ---
+int32_t      sherpa_shim_wave_sample_rate(const void *wave);
+int32_t      sherpa_shim_wave_num_samples(const void *wave);
+const float *sherpa_shim_wave_samples(const void *wave);
+
+const char *sherpa_shim_offline_result_text(const void *result);
+const char *sherpa_shim_online_result_text(const void *result);
+
+int32_t      sherpa_shim_generated_audio_sample_rate(const void *audio);
+int32_t      sherpa_shim_generated_audio_n(const void *audio);
+const float *sherpa_shim_generated_audio_samples(const void *audio);
+
+int32_t sherpa_shim_speech_segment_start(const void *seg);
+int32_t sherpa_shim_speech_segment_n(const void *seg);
+
+// --- TTS streaming callback trampoline -----------------------------
+// Replaces the //export sherpaTtsGoCallback + callbacks.c bridge pattern.
+// `callback_ptr` is the C-callable function pointer returned by
+// purego.NewCallback. `user_data` is an integer the Go side uses to
+// look up its state (sync.Map keyed by uint64).
+//
+// Returns the sherpa-allocated SherpaOnnxGeneratedAudio. Destroy with
+// SherpaOnnxDestroyOfflineTtsGeneratedAudio (callable directly from
+// Go via purego).
+void *sherpa_shim_tts_generate_with_callback(
+    void *tts, const char *text, int32_t sid, float speed,
+    uintptr_t callback_ptr, uintptr_t user_data);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
--- a/backend/go/sherpa-onnx/main.go
+++ b/backend/go/sherpa-onnx/main.go
@@ -0,0 +1,23 @@
+package main
+
+import (
+	"flag"
+
+	grpc "github.com/mudler/LocalAI/pkg/grpc"
+)
+
+var (
+	addr = flag.String("addr", "localhost:50051", "the address to connect to")
+)
+
+func main() {
+	flag.Parse()
+
+	if err := loadSherpaLibs(); err != nil {
+		panic(err)
+	}
+
+	if err := grpc.StartServer(*addr, &SherpaBackend{}); err != nil {
+		panic(err)
+	}
+}
--- a/backend/go/sherpa-onnx/package.sh
+++ b/backend/go/sherpa-onnx/package.sh
@@ -0,0 +1,51 @@
+#!/bin/bash
+set -e
+
+CURDIR=$(dirname "$(realpath $0)")
+REPO_ROOT="${CURDIR}/../../.."
+
+mkdir -p $CURDIR/package/lib
+
+cp -avf $CURDIR/sherpa-onnx $CURDIR/package/
+cp -avf $CURDIR/run.sh $CURDIR/package/
+cp -rfLv $CURDIR/backend-assets/lib/* $CURDIR/package/lib/
+
+if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
+    echo "Detected x86_64 architecture, copying x86_64 libraries..."
+    cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
+    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
+    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
+elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
+    echo "Detected ARM64 architecture, copying ARM64 libraries..."
+    cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
+    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
+    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
+elif [ $(uname -s) = "Darwin" ]; then
+    echo "Detected Darwin"
+else
+    echo "Error: Could not detect architecture"
+    exit 1
+fi
+
+GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
+if [ -f "$GPU_LIB_SCRIPT" ]; then
+    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
+    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
+    package_gpu_libs
+fi
+
+echo "Packaging completed successfully"
+ls -liah $CURDIR/package/
+ls -liah $CURDIR/package/lib/
--- a/backend/go/sherpa-onnx/run.sh
+++ b/backend/go/sherpa-onnx/run.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+set -ex
+
+CURDIR=$(dirname "$(realpath $0)")
+
+export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
+
+if [ -f $CURDIR/lib/ld.so ]; then
+	echo "Using lib/ld.so"
+	exec $CURDIR/lib/ld.so $CURDIR/sherpa-onnx "$@"
+fi
+
+exec $CURDIR/sherpa-onnx "$@"
--- a/backend/go/sherpa-onnx/test.sh
+++ b/backend/go/sherpa-onnx/test.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+# Unit tests for the sherpa-onnx backend. Exercises error-path and
+# dispatch logic via SherpaBackend directly (no gRPC). Integration
+# coverage (gRPC TTS / streaming ASR / realtime pipeline) lives in
+# tests/e2e-backends and tests/e2e and runs against the Docker image.
+set -e
+
+CURDIR=$(dirname "$(realpath $0)")
+cd "$CURDIR"
+
+PACKAGES=$(go list ./... | grep -v /sources/)
+go test -v -timeout 60s $PACKAGES
--- a/backend/go/stablediffusion-ggml/Makefile
+++ b/backend/go/stablediffusion-ggml/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # stablediffusion.cpp (ggml)
 STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
-STABLEDIFFUSION_GGML_VERSION?=c97702e1057c2fe13a7074cd9069cb9dd6edc1bf
+STABLEDIFFUSION_GGML_VERSION?=b8bdffc19962be7e5a84bfefeb2e31bd885b571a

 CMAKE_ARGS+=-DGGML_MAX_NAME=128

--- a/backend/go/whisper/gowhisper.go
+++ b/backend/go/whisper/gowhisper.go
@@ -139,7 +139,10 @@ func (w *Whisper) AudioTranscription(opts *pb.TranscriptRequest) (pb.TranscriptR
 		// segment start/end conversion factor taken from https://github.com/ggml-org/whisper.cpp/blob/master/examples/cli/cli.cpp#L895
 		s := CppGetSegmentStart(i) * (10000000)
 		t := CppGetSegmentEnd(i) * (10000000)
-		txt := strings.Clone(CppGetSegmentText(i))
+		// whisper.cpp can emit bytes that aren't valid UTF-8 (e.g. a multibyte
+		// codepoint split across token boundaries); protobuf string fields
+		// reject those at marshal time. Scrub before the value escapes cgo.
+		txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
 		tokens := make([]int32, CppNTokens(i))

 		if opts.Diarize && CppGetSegmentSpeakerTurnNext(i) {
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -263,6 +263,8 @@
    amd: "rocm-vllm"
    intel: "intel-vllm"
    nvidia-cuda-12: "cuda12-vllm"
+    nvidia-cuda-13: "cuda13-vllm"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm"
    cpu: "cpu-vllm"
 - &sglang
  name: "sglang"
@@ -285,6 +287,7 @@
    amd: "rocm-sglang"
    intel: "intel-sglang"
    nvidia-cuda-12: "cuda12-sglang"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sglang"
    cpu: "cpu-sglang"
 - &vllm-omni
  name: "vllm-omni"
@@ -311,6 +314,8 @@
    nvidia: "cuda12-vllm-omni"
    amd: "rocm-vllm-omni"
    nvidia-cuda-12: "cuda12-vllm-omni"
+    nvidia-cuda-13: "cuda13-vllm-omni"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm-omni"
 - &mlx
  name: "mlx"
  uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-mlx"
@@ -1006,6 +1011,23 @@
    nvidia: "cuda12-neutts"
    amd: "rocm-neutts"
    nvidia-cuda-12: "cuda12-neutts"
+- &sherpa-onnx
+  name: "sherpa-onnx"
+  alias: "sherpa-onnx"
+  urls:
+    - https://k2-fsa.github.io/sherpa/onnx/
+  description: |
+    Sherpa-ONNX backend for text-to-speech (VITS, Matcha, Kokoro), speech-to-text (Whisper, Paraformer, SenseVoice, Omnilingual ASR CTC), and voice activity detection via ONNX Runtime.
+    Supports multi-speaker voices, 1600+ language ASR, and GPU acceleration.
+  tags:
+    - text-to-speech
+    - TTS
+    - speech-to-text
+    - ASR
+  capabilities:
+    default: "cpu-sherpa-onnx"
+    nvidia: "cuda12-sherpa-onnx"
+    nvidia-cuda-12: "cuda12-sherpa-onnx"
 - !!merge <<: *neutts
  name: "neutts-development"
  capabilities:
@@ -1591,6 +1613,20 @@
  mirrors:
    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant
 ## whisper
+- !!merge <<: *whispercpp
+  name: "whisper-development"
+  capabilities:
+    default: "cpu-whisper-development"
+    nvidia: "cuda12-whisper-development"
+    intel: "intel-sycl-f16-whisper-development"
+    metal: "metal-whisper-development"
+    amd: "rocm-whisper-development"
+    vulkan: "vulkan-whisper-development"
+    nvidia-l4t: "nvidia-l4t-arm64-whisper-development"
+    nvidia-cuda-13: "cuda13-whisper-development"
+    nvidia-cuda-12: "cuda12-whisper-development"
+    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-whisper-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-whisper-development"
 - !!merge <<: *whispercpp
  name: "nvidia-l4t-arm64-whisper"
  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-whisper"
@@ -1797,12 +1833,25 @@
    nvidia: "cuda12-vllm-development"
    amd: "rocm-vllm-development"
    intel: "intel-vllm-development"
+    nvidia-cuda-12: "cuda12-vllm-development"
+    nvidia-cuda-13: "cuda13-vllm-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm-development"
    cpu: "cpu-vllm-development"
 - !!merge <<: *vllm
  name: "cuda12-vllm"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vllm"
  mirrors:
    - localai/localai-backends:latest-gpu-nvidia-cuda-12-vllm
+- !!merge <<: *vllm
+  name: "cuda13-vllm"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-vllm"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-13-vllm
+- !!merge <<: *vllm
+  name: "cuda13-nvidia-l4t-arm64-vllm"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-vllm"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-vllm
 - !!merge <<: *vllm
  name: "rocm-vllm"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-vllm"
@@ -1823,6 +1872,16 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-vllm"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-vllm
+- !!merge <<: *vllm
+  name: "cuda13-vllm-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-vllm"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-13-vllm
+- !!merge <<: *vllm
+  name: "cuda13-nvidia-l4t-arm64-vllm-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-vllm"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-vllm
 - !!merge <<: *vllm
  name: "rocm-vllm-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-vllm"
@@ -1845,12 +1904,19 @@
    nvidia: "cuda12-sglang-development"
    amd: "rocm-sglang-development"
    intel: "intel-sglang-development"
+    nvidia-cuda-12: "cuda12-sglang-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sglang-development"
    cpu: "cpu-sglang-development"
 - !!merge <<: *sglang
  name: "cuda12-sglang"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sglang"
  mirrors:
    - localai/localai-backends:latest-gpu-nvidia-cuda-12-sglang
+- !!merge <<: *sglang
+  name: "cuda13-nvidia-l4t-arm64-sglang"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-sglang"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-sglang
 - !!merge <<: *sglang
  name: "rocm-sglang"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-sglang"
@@ -1871,6 +1937,11 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sglang"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-sglang
+- !!merge <<: *sglang
+  name: "cuda13-nvidia-l4t-arm64-sglang-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-sglang"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-sglang
 - !!merge <<: *sglang
  name: "rocm-sglang-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-sglang"
@@ -1893,11 +1964,23 @@
    nvidia: "cuda12-vllm-omni-development"
    amd: "rocm-vllm-omni-development"
    nvidia-cuda-12: "cuda12-vllm-omni-development"
+    nvidia-cuda-13: "cuda13-vllm-omni-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm-omni-development"
 - !!merge <<: *vllm-omni
  name: "cuda12-vllm-omni"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vllm-omni"
  mirrors:
    - localai/localai-backends:latest-gpu-nvidia-cuda-12-vllm-omni
+- !!merge <<: *vllm-omni
+  name: "cuda13-vllm-omni"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-vllm-omni"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-13-vllm-omni
+- !!merge <<: *vllm-omni
+  name: "cuda13-nvidia-l4t-arm64-vllm-omni"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-vllm-omni"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-vllm-omni
 - !!merge <<: *vllm-omni
  name: "rocm-vllm-omni"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-vllm-omni"
@@ -1908,6 +1991,16 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-vllm-omni"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-vllm-omni
+- !!merge <<: *vllm-omni
+  name: "cuda13-vllm-omni-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-vllm-omni"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-13-vllm-omni
+- !!merge <<: *vllm-omni
+  name: "cuda13-nvidia-l4t-arm64-vllm-omni-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-vllm-omni"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-vllm-omni
 - !!merge <<: *vllm-omni
  name: "rocm-vllm-omni-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-vllm-omni"
@@ -3834,3 +3927,30 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-speaker-recognition"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-speaker-recognition
+## sherpa-onnx
+- !!merge <<: *sherpa-onnx
+  name: "sherpa-onnx-development"
+  capabilities:
+    default: "cpu-sherpa-onnx-development"
+    nvidia: "cuda12-sherpa-onnx-development"
+    nvidia-cuda-12: "cuda12-sherpa-onnx-development"
+- !!merge <<: *sherpa-onnx
+  name: "cpu-sherpa-onnx"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-sherpa-onnx"
+  mirrors:
+    - localai/localai-backends:latest-cpu-sherpa-onnx
+- !!merge <<: *sherpa-onnx
+  name: "cpu-sherpa-onnx-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-sherpa-onnx"
+  mirrors:
+    - localai/localai-backends:master-cpu-sherpa-onnx
+- !!merge <<: *sherpa-onnx
+  name: "cuda12-sherpa-onnx"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sherpa-onnx"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-12-sherpa-onnx
+- !!merge <<: *sherpa-onnx
+  name: "cuda12-sherpa-onnx-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sherpa-onnx"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-12-sherpa-onnx
--- a/backend/python/insightface/engines.py
+++ b/backend/python/insightface/engines.py
@@ -173,6 +173,30 @@ def _build_antispoofer(options: dict[str, str], model_dir: str | None) -> Antisp

 # ─── InsightFaceEngine ────────────────────────────────────────────────

+# Canonical ONNX manifest for each upstream insightface pack (v0.7 release
+# at github.com/deepinsight/insightface/releases). LocalAI's gallery extracts
+# these zips flat into the models directory, so when multiple packs or other
+# backends drop their own ONNX files alongside, the glob-the-directory
+# approach picks up foreign files and insightface's model_zoo.get_model()
+# raises IndexError trying to index `input_shape[2]` on a tensor that isn't
+# shaped like a face model. The manifest lets us pre-filter to only the
+# files that actually belong to the requested pack — deterministic, correct
+# pack choice, no crashes on neighbour ONNX files.
+_KNOWN_PACK_MANIFESTS: dict[str, frozenset[str]] = {
+    "buffalo_l": frozenset({
+        "det_10g.onnx",
+        "w600k_r50.onnx",
+        "genderage.onnx",
+        "2d106det.onnx",
+        "1k3d68.onnx",
+    }),
+    "buffalo_sc": frozenset({
+        "det_500m.onnx",
+        "w600k_mbf.onnx",
+    }),
+}
+
+
 class InsightFaceEngine:
    """Drives insightface's model_zoo directly — no FaceAnalysis wrapper.

@@ -222,6 +246,21 @@ class InsightFaceEngine:
            )

        onnx_files = sorted(glob.glob(os.path.join(pack_dir, "*.onnx")))
+        # When the pack extracts flat into a shared models directory it
+        # mixes with ONNX files from other backends (opencv face engine,
+        # MiniFASNet antispoof, WeSpeaker voice embedding, other buffalo
+        # packs installed earlier). Feeding those into model_zoo.get_model()
+        # blows up inside insightface's router — it assumes a 4-D NCHW
+        # input and indexes `input_shape[2]` on tensors that aren't shaped
+        # like a face model, raising IndexError. For the upstream packs we
+        # know the exact ONNX manifest; scoping to it makes the load
+        # deterministic (without it, det_10g.onnx from buffalo_l sorts
+        # before det_500m.onnx from buffalo_sc and silently wins).
+        manifest = _KNOWN_PACK_MANIFESTS.get(self.model_pack)
+        if manifest is not None:
+            scoped = [f for f in onnx_files if os.path.basename(f) in manifest]
+            if scoped:
+                onnx_files = scoped
        if not onnx_files:
            raise ValueError(f"no ONNX files in pack directory: {pack_dir}")

@@ -231,14 +270,31 @@ class InsightFaceEngine:
        self._providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]

        self.models = {}
+        skipped: list[tuple[str, str]] = []
        for onnx_file in onnx_files:
-            m = model_zoo.get_model(onnx_file, providers=self._providers)
+            try:
+                m = model_zoo.get_model(onnx_file, providers=self._providers)
+            except Exception as err:
+                # Foreign ONNX (wrong rank/shape, non-insightface model) —
+                # older insightface versions raise IndexError / ValueError
+                # instead of returning None. Keep loading the rest.
+                skipped.append((os.path.basename(onnx_file), str(err)))
+                continue
            if m is None:
+                skipped.append((os.path.basename(onnx_file), "unknown taskname"))
                continue
            # First occurrence of each taskname wins (matches FaceAnalysis).
            if m.taskname not in self.models:
                self.models[m.taskname] = m

+        if skipped:
+            import sys
+            print(
+                f"[insightface] skipped {len(skipped)} non-pack ONNX file(s) in {pack_dir}: "
+                + ", ".join(f"{n} ({why})" for n, why in skipped),
+                file=sys.stderr,
+            )
+
        if "detection" not in self.models:
            raise ValueError(f"no detector (taskname='detection') found in {pack_dir}")
        self.det_model = self.models["detection"]
--- a/backend/python/mlx-vlm/requirements-cpu.txt
+++ b/backend/python/mlx-vlm/requirements-cpu.txt
@@ -1,2 +1,2 @@
-git+https://github.com/Blaizzy/mlx-vlm
+git+https://github.com/Blaizzy/mlx-vlm@v0.4.4
 mlx[cpu]
--- a/backend/python/mlx-vlm/requirements-cublas12.txt
+++ b/backend/python/mlx-vlm/requirements-cublas12.txt
@@ -1,2 +1,2 @@
-git+https://github.com/Blaizzy/mlx-vlm
+git+https://github.com/Blaizzy/mlx-vlm@v0.4.4
 mlx[cuda12]
--- a/backend/python/mlx-vlm/requirements-cublas13.txt
+++ b/backend/python/mlx-vlm/requirements-cublas13.txt
@@ -1,2 +1,2 @@
-git+https://github.com/Blaizzy/mlx-vlm
+git+https://github.com/Blaizzy/mlx-vlm@v0.4.4
 mlx[cuda13]
--- a/backend/python/mlx-vlm/requirements-l4t12.txt
+++ b/backend/python/mlx-vlm/requirements-l4t12.txt
@@ -1,2 +1,2 @@
-git+https://github.com/Blaizzy/mlx-vlm
+git+https://github.com/Blaizzy/mlx-vlm@v0.4.4
 mlx[cuda12]
--- a/backend/python/mlx-vlm/requirements-l4t13.txt
+++ b/backend/python/mlx-vlm/requirements-l4t13.txt
@@ -1,2 +1,2 @@
-git+https://github.com/Blaizzy/mlx-vlm
+git+https://github.com/Blaizzy/mlx-vlm@v0.4.4
 mlx[cuda13]
--- a/backend/python/mlx-vlm/requirements-mps.txt
+++ b/backend/python/mlx-vlm/requirements-mps.txt
@@ -1 +1 @@
-git+https://github.com/Blaizzy/mlx-vlm
+git+https://github.com/Blaizzy/mlx-vlm@v0.4.4
--- a/backend/python/sglang/install.sh
+++ b/backend/python/sglang/install.sh
@@ -23,6 +23,19 @@ if [ "x${BUILD_PROFILE}" == "xcpu" ]; then
    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
 fi

+# JetPack 7 / L4T arm64 wheels are built for cp312 and shipped via
+# pypi.jetson-ai-lab.io. Bump the venv Python so the prebuilt sglang
+# wheel resolves cleanly. unsafe-best-match is required because the
+# jetson-ai-lab index lists transitive deps (e.g. decord) at older
+# versions only — without it uv refuses to fall through to PyPI for a
+# compatible wheel and resolution fails.
+if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
+    PYTHON_VERSION="3.12"
+    PYTHON_PATCH="12"
+    PY_STANDALONE_TAG="20251120"
+    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
+fi
+
 # sglang's CPU path has no prebuilt wheel on PyPI — upstream publishes
 # a separate pyproject_cpu.toml that must be swapped in before `pip install`.
 # Reference: docker/xeon.Dockerfile in the sglang upstream repo.
--- a/backend/python/sglang/requirements-l4t13.txt
+++ b/backend/python/sglang/requirements-l4t13.txt
@@ -0,0 +1,12 @@
+--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
+accelerate
+torch
+torchvision
+torchaudio
+transformers
+# Drop the [all] extra: it pulls outlines/decord, and decord has no
+# aarch64 cp312 wheel anywhere (PyPI nor the jetson-ai-lab index ships
+# only legacy cp35-cp37). With [all] uv backtracks through versions
+# trying to satisfy decord and lands on sglang==0.1.16. Floor at 0.5.0
+# so uv can't silently downgrade if a future resolution misfires.
+sglang>=0.5.0
--- a/backend/python/speaker-recognition/engines.py
+++ b/backend/python/speaker-recognition/engines.py
@@ -317,8 +317,23 @@ class OnnxDirectEngine:
        else:
            provider_list = ["CPUExecutionProvider"]
        self._session = ort.InferenceSession(onnx_path, providers=provider_list)
-        self._input_name = self._session.get_inputs()[0].name
+        input_meta = self._session.get_inputs()[0]
+        self._input_name = input_meta.name
+        # Pre-exported speaker encoders come in two shapes:
+        #   rank-2  [batch, samples]          — some 3D-Speaker exports feed raw waveform.
+        #   rank-3  [batch, frames, n_mels]   — WeSpeaker and most Kaldi-lineage encoders
+        #                                        expect pre-computed Kaldi FBank features.
+        # We detect this at load time and branch in embed(), because feeding raw audio
+        # into a rank-3 graph is exactly what triggered
+        # "Invalid rank for input: feats Got: 2 Expected: 3".
+        self._input_rank = len(input_meta.shape) if input_meta.shape is not None else 2
        self._expected_sr = int(options.get("sample_rate", "16000"))
+        self._fbank_mels = int(options.get("fbank_num_mel_bins", "80"))
+        self._fbank_frame_length_ms = float(options.get("fbank_frame_length_ms", "25"))
+        self._fbank_frame_shift_ms = float(options.get("fbank_frame_shift_ms", "10"))
+        # Per-utterance cepstral mean normalisation — on for WeSpeaker by default,
+        # toggleable for encoders that expect raw FBank.
+        self._fbank_cmn = options.get("fbank_cmn", "true").lower() in ("1", "true", "yes")
        self._analysis = AnalysisHead(options)

    def _load_waveform(self, path: str):
@@ -344,11 +359,37 @@ class OnnxDirectEngine:
        import numpy as np

        audio = self._load_waveform(audio_path)
-        feed = audio.reshape(1, -1)
+        if self._input_rank >= 3:
+            feats = self._extract_fbank(audio)        # [frames, n_mels]
+            feed = feats[np.newaxis, :, :]             # [1, frames, n_mels]
+        else:
+            feed = audio.reshape(1, -1)                # [1, samples]
        out = self._session.run(None, {self._input_name: feed})
        vec = np.asarray(out[0]).reshape(-1)
        return [float(x) for x in vec]

+    def _extract_fbank(self, audio):
+        """Compute Kaldi-style 80-dim FBank features for speaker encoders that
+        expect pre-featurised input (WeSpeaker, most 3D-Speaker exports).
+        torchaudio is already a backend dependency for SpeechBrain — no new
+        package required."""
+        import numpy as np
+        import torch  # type: ignore
+        import torchaudio.compliance.kaldi as kaldi  # type: ignore
+
+        tensor = torch.from_numpy(audio).unsqueeze(0)  # [1, samples]
+        feats = kaldi.fbank(
+            tensor,
+            sample_frequency=self._expected_sr,
+            num_mel_bins=self._fbank_mels,
+            frame_length=self._fbank_frame_length_ms,
+            frame_shift=self._fbank_frame_shift_ms,
+            dither=0.0,
+        )  # [frames, n_mels]
+        if self._fbank_cmn:
+            feats = feats - feats.mean(dim=0, keepdim=True)
+        return feats.numpy().astype(np.float32)
+
    def compare(self, audio1: str, audio2: str) -> float:
        return _cosine_distance(self.embed(audio1), self.embed(audio2))

--- a/backend/python/vllm-omni/install.sh
+++ b/backend/python/vllm-omni/install.sh
@@ -12,11 +12,15 @@ else
    source $backend_dir/../common/libbackend.sh
 fi

-# Handle l4t build profiles (Python 3.12, pip fallback) if needed
+# Handle l4t build profiles (Python 3.12, pip fallback) if needed.
+# unsafe-best-match is required on l4t13 because the jetson-ai-lab index
+# lists transitive deps at limited versions — without it uv pins to the
+# first matching index and fails to resolve a compatible wheel from PyPI.
 if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
  PYTHON_VERSION="3.12"
  PYTHON_PATCH="12"
  PY_STANDALONE_TAG="20251120"
+  EXTRA_PIP_INSTALL_FLAGS="${EXTRA_PIP_INSTALL_FLAGS:-} --index-strategy=unsafe-best-match"
 fi

 if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then
@@ -26,7 +30,11 @@ fi
 # Install base requirements first
 installRequirements

-# Install vllm based on build type
+# Install vllm based on build type. vllm-omni tracks vllm master from
+# source (cloned below) so we leave the upstream vllm dependency unpinned
+# — vllm 0.19+ ships cu130 wheels by default, which is what we want for
+# cublas13. Older cuda12/rocm/cpu paths still resolve a compatible wheel
+# from the relevant channel.
 if [ "x${BUILD_TYPE}" == "xhipblas" ]; then
    # ROCm
    if [ "x${USE_PIP}" == "xtrue" ]; then
@@ -34,8 +42,26 @@ if [ "x${BUILD_TYPE}" == "xhipblas" ]; then
    else
        uv pip install vllm==0.14.0 --extra-index-url https://wheels.vllm.ai/rocm/0.14.0/rocm700
    fi
+elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
+    # JetPack 7 / L4T arm64 cu130 — vllm comes from the prebuilt SBSA wheel
+    # at jetson-ai-lab. Version is unpinned: the index ships whatever build
+    # matches the cu130/cp312 ABI. unsafe-best-match lets uv fall through
+    # to PyPI for transitive deps not present on the jetson-ai-lab index.
+    if [ "x${USE_PIP}" == "xtrue" ]; then
+        pip install vllm --extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
+    else
+        uv pip install --index-strategy=unsafe-best-match vllm --extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
+    fi
+elif [ "x${BUILD_PROFILE}" == "xcublas13" ]; then
+    # vllm 0.19+ defaults to cu130 wheels on PyPI, no extra index needed.
+    if [ "x${USE_PIP}" == "xtrue" ]; then
+        pip install vllm --torch-backend=auto
+    else
+        uv pip install vllm --torch-backend=auto
+    fi
 elif [ "x${BUILD_TYPE}" == "xcublas" ] || [ "x${BUILD_TYPE}" == "x" ]; then
-    # CUDA (default) or CPU
+    # cuda12 / CPU — keep the 0.14.0 pin for compatibility with the existing
+    # cuda12 vllm-omni image; bumping should be its own change.
    if [ "x${USE_PIP}" == "xtrue" ]; then
        pip install vllm==0.14.0 --torch-backend=auto
    else
--- a/backend/python/vllm-omni/requirements-cublas13.txt
+++ b/backend/python/vllm-omni/requirements-cublas13.txt
@@ -0,0 +1,5 @@
+--extra-index-url https://download.pytorch.org/whl/cu130
+accelerate
+torch
+transformers
+bitsandbytes
--- a/backend/python/vllm-omni/requirements-l4t13.txt
+++ b/backend/python/vllm-omni/requirements-l4t13.txt
@@ -0,0 +1,13 @@
+--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
+accelerate
+torch
+torchvision
+torchaudio
+transformers
+bitsandbytes
+flash-attn
+diffusers
+librosa
+soundfile
+pillow
+numpy
--- a/backend/python/vllm/install.sh
+++ b/backend/python/vllm/install.sh
@@ -32,6 +32,22 @@ if [ "x${BUILD_PROFILE}" == "xcpu" ]; then
    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
 fi

+# JetPack 7 / L4T arm64 wheels (torch, vllm, flash-attn) live on
+# pypi.jetson-ai-lab.io and are built for cp312, so bump the venv Python
+# accordingly. JetPack 6 keeps cp310 + USE_PIP=true. unsafe-best-match
+# is required because the jetson-ai-lab index lists transitive deps at
+# limited versions — without it uv pins to the first matching index and
+# fails to resolve a compatible wheel from PyPI.
+if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then
+    USE_PIP=true
+fi
+if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
+    PYTHON_VERSION="3.12"
+    PYTHON_PATCH="12"
+    PY_STANDALONE_TAG="20251120"
+    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
+fi
+
 # FROM_SOURCE=true on a CPU build skips the prebuilt vllm wheel in
 # requirements-cpu-after.txt and compiles vllm locally against the host's
 # actual CPU. Not used by default because it takes ~30-40 minutes, but
--- a/backend/python/vllm/requirements-cublas12-after.txt
+++ b/backend/python/vllm/requirements-cublas12-after.txt
@@ -1,2 +1,9 @@
-https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
+# flash-attn wheels are ABI-tied to a specific torch version. vllm forces
+# torch==2.10.0 as a hard dep, but flash-attn 2.8.3 (latest) only ships
+# prebuilt wheels up to torch 2.8 — any wheel we pin here gets silently
+# broken when vllm upgrades torch during install, producing an undefined
+# libc10_cuda symbol at import time. FlashInfer (required by vllm) covers
+# attention, and rotary_embedding/common.py guards the flash_attn import
+# with find_spec(), so skipping flash-attn is safe and the only stable
+# choice until upstream ships a torch-2.10 wheel.
 vllm
--- a/backend/python/vllm/requirements-cublas12.txt
+++ b/backend/python/vllm/requirements-cublas12.txt
@@ -1,4 +1,4 @@
 accelerate
-torch==2.7.0
+torch
 transformers
 bitsandbytes
--- a/backend/python/vllm/requirements-cublas13-after.txt
+++ b/backend/python/vllm/requirements-cublas13-after.txt
@@ -0,0 +1,2 @@
+--extra-index-url https://download.pytorch.org/whl/cu130
+vllm
--- a/backend/python/vllm/requirements-cublas13.txt
+++ b/backend/python/vllm/requirements-cublas13.txt
@@ -0,0 +1,5 @@
+--extra-index-url https://download.pytorch.org/whl/cu130
+accelerate
+torch
+transformers
+bitsandbytes
--- a/backend/python/vllm/requirements-l4t13-after.txt
+++ b/backend/python/vllm/requirements-l4t13-after.txt
@@ -0,0 +1,2 @@
+--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
+vllm
--- a/backend/python/vllm/requirements-l4t13.txt
+++ b/backend/python/vllm/requirements-l4t13.txt
@@ -0,0 +1,8 @@
+--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
+accelerate
+torch
+torchvision
+torchaudio
+transformers
+bitsandbytes
+flash-attn
--- a/backend/rust/kokoros/Cargo.lock
+++ b/backend/rust/kokoros/Cargo.lock
@@ -1867,9 +1867,9 @@ dependencies = [

 [[package]]
 name = "rustls-webpki"
-version = "0.103.10"
+version = "0.103.13"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "df33b2b81ac578cabaf06b89b0631153a3f416b0a886e8a7a1707fb51abbd1ef"
+checksum = "61c429a8649f110dddef65e2a5ad240f747e85f7758a6bccc7e5777bd33f756e"
 dependencies = [
 "ring",
 "rustls-pki-types",
--- a/core/application/application.go
+++ b/core/application/application.go
@@ -81,18 +81,30 @@ func newApplication(appConfig *config.ApplicationConfig) *Application {
 	// The resolver closes over the ModelLoader so the Registry stays
 	// decoupled from loader plumbing; swapping in a postgres-backed
 	// implementation later is a single construction change here.
+	//
+	// `faceStoreName` is the default namespace passed to StoreBackend when
+	// the request doesn't override it. Face and voice MUST use distinct
+	// namespaces — the local-store gRPC surface rejects mixed dimensions
+	// inside one namespace ("Try to add key with length N when existing
+	// length is M"). ArcFace buffalo_l produces 512-dim embeddings while
+	// ECAPA-TDNN produces 192-dim; enrolling one after the other into a
+	// shared namespace is exactly how we hit that error.
+	const (
+		faceStoreName  = "localai-face-biometrics"
+		voiceStoreName = "localai-voice-biometrics"
+	)
 	faceStoreResolver := func(_ context.Context, storeName string) (pkggrpc.Backend, error) {
 		return corebackend.StoreBackend(ml, appConfig, storeName, "")
 	}
-	app.faceRegistry = facerecognition.NewStoreRegistry(faceStoreResolver, "", faceEmbeddingDim)
+	app.faceRegistry = facerecognition.NewStoreRegistry(faceStoreResolver, faceStoreName, faceEmbeddingDim)

 	// Voice (speaker) recognition registry — same plumbing, separate
-	// registry so embedding spaces stay isolated (a face vector and a
-	// speaker vector are not comparable).
+	// namespace so embedding spaces stay isolated (a face vector and a
+	// speaker vector are not comparable and differ in dimensionality).
 	voiceStoreResolver := func(_ context.Context, storeName string) (pkggrpc.Backend, error) {
 		return corebackend.StoreBackend(ml, appConfig, storeName, "")
 	}
-	app.voiceRegistry = voicerecognition.NewStoreRegistry(voiceStoreResolver, "", voiceEmbeddingDim)
+	app.voiceRegistry = voicerecognition.NewStoreRegistry(voiceStoreResolver, voiceStoreName, voiceEmbeddingDim)

 	return app
 }
--- a/core/application/startup.go
+++ b/core/application/startup.go
@@ -242,6 +242,12 @@ func New(opts ...config.AppOption) (*Application, error) {
 		bmFn := func() galleryop.BackendManager { return application.GalleryService().BackendManager() }
 		uc := NewUpgradeChecker(options, application.ModelLoader(), application.distributedDB(), bmFn)
 		application.upgradeChecker = uc
+		// Refresh the upgrade cache the moment a backend op finishes — otherwise
+		// the UI keeps showing a just-upgraded backend as upgradeable until the
+		// next 6-hour tick. TriggerCheck is non-blocking.
+		if gs := application.GalleryService(); gs != nil {
+			gs.OnBackendOpCompleted = uc.TriggerCheck
+		}
 		go uc.Run(options.Context)
 	}

--- a/core/backend/stores.go
+++ b/core/backend/stores.go
@@ -11,8 +11,17 @@ func StoreBackend(sl *model.ModelLoader, appConfig *config.ApplicationConfig, st
 	if backend == "" {
 		backend = model.LocalStoreBackend
 	}
+	// ModelLoader caches backend processes by `modelID`, not by the `model`
+	// passed via WithModel. Without a distinct modelID, every StoreBackend
+	// call collapses to the same `modelID=""` cache slot — face (512-D) and
+	// voice (192-D) biometrics would then share the same local-store process
+	// and the second enrollment would fail with
+	//   Try to add key with length N when existing length is M
+	// Use the store namespace as modelID so each namespace gets its own
+	// process instance and its own in-memory Store{}.
 	sc := []model.Option{
 		model.WithBackendString(backend),
+		model.WithModelID(storeName),
 		model.WithModel(storeName),
 	}

--- a/core/cli/worker.go
+++ b/core/cli/worker.go
@@ -90,6 +90,14 @@ type WorkerCMD struct {
 	RegistrationToken string `env:"LOCALAI_REGISTRATION_TOKEN" help:"Token for authenticating with the frontend" group:"registration"`
 	HeartbeatInterval string `env:"LOCALAI_HEARTBEAT_INTERVAL" default:"10s" help:"Interval between heartbeats" group:"registration"`
 	NodeLabels        string `env:"LOCALAI_NODE_LABELS" help:"Comma-separated key=value labels for this node (e.g. tier=fast,gpu=a100)" group:"registration"`
+	// MaxReplicasPerModel caps how many replicas of any one model can run on
+	// this worker concurrently. Default 1 = historical single-replica
+	// behavior. Set higher when a node has enough VRAM to host multiple
+	// copies of the same model (e.g. a fat 128 GiB box running 4× of a
+	// 24 GiB model for throughput). The auto-label `node.replica-slots=N`
+	// is published so model schedulers can target high-capacity nodes via
+	// the existing label selector.
+	MaxReplicasPerModel int `env:"LOCALAI_MAX_REPLICAS_PER_MODEL" default:"1" help:"Max replicas of any single model on this worker. Default 1 preserves single-replica behavior; set higher to allow stacking replicas on a fat node." group:"registration"`

 	// NATS (required)
 	NatsURL string `env:"LOCALAI_NATS_URL" required:"" help:"NATS server URL" group:"distributed"`
@@ -567,22 +575,35 @@ func (s *backendSupervisor) getAddr(backend string) string {
 	return ""
 }

+// buildProcessKey is the supervisor's stable identifier for a backend gRPC
+// process. It includes the replica index so the same model can run multiple
+// processes on a worker simultaneously without colliding on the same map slot
+// or port. The "#N" suffix is purely internal — the controller never reads it.
+func buildProcessKey(modelID, backend string, replicaIndex int) string {
+	base := modelID
+	if base == "" {
+		base = backend
+	}
+	return fmt.Sprintf("%s#%d", base, replicaIndex)
+}
+
 // installBackend handles the backend.install flow:
-// 1. If already running for this model, return existing address
+// 1. If already running for this (model, replica) slot, return existing address
 // 2. Install backend from gallery (if not already installed)
 // 3. Find backend binary
 // 4. Start gRPC process on a new port
 // Returns the gRPC address of the backend process.
+//
+// ProcessKey includes the replica index so a worker with MaxReplicasPerModel>1
+// can host multiple processes for the same model on distinct ports. Old
+// controllers (no replica_index in the request) implicitly target replica 0,
+// which preserves single-replica behavior.
 func (s *backendSupervisor) installBackend(req messaging.BackendInstallRequest) (string, error) {
-	// Process key: use ModelID if provided (per-model process), else backend name
-	processKey := req.ModelID
-	if processKey == "" {
-		processKey = req.Backend
-	}
+	processKey := buildProcessKey(req.ModelID, req.Backend, int(req.ReplicaIndex))

-	// If already running for this model, return its address
+	// If already running for this model+replica, return its address
 	if addr := s.getAddr(processKey); addr != "" {
-		xlog.Info("Backend already running for model", "backend", req.Backend, "model", req.ModelID, "addr", addr)
+		xlog.Info("Backend already running for model replica", "backend", req.Backend, "model", req.ModelID, "replica", req.ReplicaIndex, "addr", addr)
 		return addr, nil
 	}

@@ -886,13 +907,18 @@ func (cmd *WorkerCMD) registrationBody() map[string]any {
 	totalVRAM, _ := xsysinfo.TotalAvailableVRAM()
 	gpuVendor, _ := xsysinfo.DetectGPUVendor()

+	maxReplicas := cmd.MaxReplicasPerModel
+	if maxReplicas < 1 {
+		maxReplicas = 1
+	}
 	body := map[string]any{
-		"name":           nodeName,
-		"address":        cmd.advertiseAddr(),
-		"http_address":   cmd.advertiseHTTPAddr(),
-		"total_vram":     totalVRAM,
-		"available_vram": totalVRAM, // initially all VRAM is available
-		"gpu_vendor":     gpuVendor,
+		"name":                   nodeName,
+		"address":                cmd.advertiseAddr(),
+		"http_address":           cmd.advertiseHTTPAddr(),
+		"total_vram":             totalVRAM,
+		"available_vram":         totalVRAM, // initially all VRAM is available
+		"gpu_vendor":             gpuVendor,
+		"max_replicas_per_model": maxReplicas,
 	}

 	// If no GPU detected, report system RAM so the scheduler/UI has capacity info
@@ -906,39 +932,40 @@ func (cmd *WorkerCMD) registrationBody() map[string]any {
 		body["token"] = cmd.RegistrationToken
 	}

-	// Parse and add static node labels
+	// Parse and add static node labels. Always include the auto-label
+	// `node.replica-slots=N` so AND-selectors in ModelSchedulingConfig can
+	// target high-capacity nodes (e.g. {"node.replica-slots":"4"}).
+	labels := make(map[string]string)
 	if cmd.NodeLabels != "" {
-		labels := make(map[string]string)
 		for _, pair := range strings.Split(cmd.NodeLabels, ",") {
 			pair = strings.TrimSpace(pair)
 			if k, v, ok := strings.Cut(pair, "="); ok {
 				labels[strings.TrimSpace(k)] = strings.TrimSpace(v)
 			}
 		}
-		if len(labels) > 0 {
-			body["labels"] = labels
-		}
 	}
+	labels["node.replica-slots"] = strconv.Itoa(maxReplicas)
+	body["labels"] = labels

 	return body
 }

 // heartbeatBody returns the current VRAM/RAM stats for heartbeat payloads.
+//
+// When aggregate VRAM usage is unknown (no GPU, or temporary detection
+// failure), we deliberately OMIT available_vram so the frontend keeps its
+// last good value — overwriting with 0 makes the UI show the node as "fully
+// used", while reporting total-as-available lies to the scheduler about
+// free capacity.
 func (cmd *WorkerCMD) heartbeatBody() map[string]any {
-	var availVRAM uint64
+	body := map[string]any{}
 	aggregate := xsysinfo.GetGPUAggregateInfo()
 	if aggregate.TotalVRAM > 0 {
-		availVRAM = aggregate.FreeVRAM
-	} else {
-		// Fallback: report total as available (no usage tracking possible)
-		availVRAM, _ = xsysinfo.TotalAvailableVRAM()
+		body["available_vram"] = aggregate.FreeVRAM
 	}

-	body := map[string]any{
-		"available_vram": availVRAM,
-	}
-
-	// If no GPU, report system RAM usage instead
+	// CPU-only workers (or workers that lost GPU visibility momentarily):
+	// report system RAM so the scheduler still has capacity info.
 	if aggregate.TotalVRAM == 0 {
 		if ramInfo, err := xsysinfo.GetSystemRAMInfo(); err == nil {
 			body["available_ram"] = ramInfo.Available
--- a/core/cli/worker_replica_test.go
+++ b/core/cli/worker_replica_test.go
@@ -0,0 +1,70 @@
+package cli
+
+import (
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("Worker per-replica process keying", func() {
+	Describe("buildProcessKey", func() {
+		// Pin the supervisor's keying contract: distinct replica indexes for
+		// the same modelID produce distinct process keys, so the supervisor
+		// map can hold multiple processes for one model. Dropping the suffix
+		// would re-introduce the original flap (one model, one slot, churn).
+		DescribeTable("produces stable, distinct keys",
+			func(modelID, backend string, replica int, want string) {
+				Expect(buildProcessKey(modelID, backend, replica)).To(Equal(want))
+			},
+			Entry("modelID present, replica 0", "Qwen3-35B", "llama-cpp", 0, "Qwen3-35B#0"),
+			Entry("modelID present, replica 1", "Qwen3-35B", "llama-cpp", 1, "Qwen3-35B#1"),
+			Entry("falls back to backend when modelID empty", "", "llama-cpp", 0, "llama-cpp#0"),
+			Entry("backend fallback with replica 2", "", "llama-cpp", 2, "llama-cpp#2"),
+		)
+
+		It("makes replicas distinguishable", func() {
+			r0 := buildProcessKey("model-a", "llama-cpp", 0)
+			r1 := buildProcessKey("model-a", "llama-cpp", 1)
+			Expect(r0).ToNot(Equal(r1), "replicas of the same model must produce distinct keys")
+		})
+	})
+
+	Describe("registrationBody", func() {
+		It("includes max_replicas_per_model and the auto-label", func() {
+			cmd := &WorkerCMD{
+				Addr:                "worker.example.com:50051",
+				MaxReplicasPerModel: 4,
+			}
+			body := cmd.registrationBody()
+
+			Expect(body).To(HaveKey("max_replicas_per_model"))
+			Expect(body["max_replicas_per_model"]).To(Equal(4))
+
+			labels, ok := body["labels"].(map[string]string)
+			Expect(ok).To(BeTrue(), "labels must be present so selectors can target the slot count")
+			Expect(labels).To(HaveKeyWithValue("node.replica-slots", "4"))
+		})
+
+		It("coerces zero/unset MaxReplicasPerModel to 1", func() {
+			cmd := &WorkerCMD{Addr: "worker.example.com:50051"}
+			body := cmd.registrationBody()
+			Expect(body["max_replicas_per_model"]).To(Equal(1),
+				"unset must default to single-replica behavior, not capacity 0")
+
+			labels := body["labels"].(map[string]string)
+			Expect(labels).To(HaveKeyWithValue("node.replica-slots", "1"))
+		})
+
+		It("preserves user-provided labels alongside the auto-label", func() {
+			cmd := &WorkerCMD{
+				Addr:                "worker.example.com:50051",
+				MaxReplicasPerModel: 2,
+				NodeLabels:          "tier=fast,gpu=a100",
+			}
+			body := cmd.registrationBody()
+			labels := body["labels"].(map[string]string)
+			Expect(labels).To(HaveKeyWithValue("tier", "fast"))
+			Expect(labels).To(HaveKeyWithValue("gpu", "a100"))
+			Expect(labels).To(HaveKeyWithValue("node.replica-slots", "2"))
+		})
+	})
+})
--- a/core/config/model_config.go
+++ b/core/config/model_config.go
@@ -767,7 +767,7 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
 	}

 	if (u & FLAG_VAD) == FLAG_VAD {
-		if c.Backend != "silero-vad" && !(c.Backend == "whisper" && slices.Contains(c.Options, "vad_only")) {
+		if c.Backend != "silero-vad" && c.Backend != "sherpa-onnx" && !(c.Backend == "whisper" && slices.Contains(c.Options, "vad_only")) {
 			return false
 		}
 	}
--- a/core/gallery/backends.go
+++ b/core/gallery/backends.go
@@ -194,6 +194,20 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL

 	name := config.Name
 	backendPath := filepath.Join(systemState.Backend.BackendsPath, name)
+	// Clean up legacy flat-layout artefacts: earlier dev builds of the
+	// golang backends dropped the compiled binary directly at
+	// `<backendsPath>/<name>` (a plain file) instead of
+	// `<backendsPath>/<name>/<name>` (the nested layout the current code
+	// expects). MkdirAll below returns ENOTDIR when such a stale file
+	// exists, permanently blocking any reinstall or upgrade. Remove the
+	// file first so the install can proceed; the new install will write
+	// the correct nested layout, including metadata.json + run.sh.
+	if fi, statErr := os.Lstat(backendPath); statErr == nil && !fi.IsDir() {
+		xlog.Warn("removing stale non-directory backend artefact to make room for fresh install", "path", backendPath)
+		if rmErr := os.Remove(backendPath); rmErr != nil {
+			return fmt.Errorf("failed to remove stale backend artefact at %s: %w", backendPath, rmErr)
+		}
+	}
 	err = os.MkdirAll(backendPath, 0750)
 	if err != nil {
 		return fmt.Errorf("failed to create base path: %v", err)
--- a/core/http/endpoints/anthropic/messages.go
+++ b/core/http/endpoints/anthropic/messages.go
@@ -880,7 +880,7 @@ func convertAnthropicTools(input *schema.AnthropicRequest, cfg *config.ModelConf
 			if tcType, ok := tc["type"].(string); ok && tcType == "tool" {
 				if name, ok := tc["name"].(string); ok {
 					// Force specific tool
-					cfg.SetFunctionCallString(name)
+					cfg.SetFunctionCallNameString(name)
 				}
 			}
 		}
--- a/core/http/endpoints/localai/audio.go
+++ b/core/http/endpoints/localai/audio.go
@@ -14,7 +14,13 @@ import (
 	"github.com/mudler/LocalAI/pkg/utils"
 )

-var audioDataURIPattern = regexp.MustCompile(`^data:([^;]+);base64,`)
+// Match `data:<mime>[;param=value...];base64,` — MediaRecorder in the browser
+// produces data URIs like `data:audio/webm;codecs=opus;base64,...`, so the
+// pre-`;base64,` section can contain zero or more parameter segments. The
+// old `([^;]+)` form only matched exactly one segment and left recordings
+// from the React UI's live-capture tab unparsed, which then failed base64
+// decoding on the leading `data:` bytes.
+var audioDataURIPattern = regexp.MustCompile(`^data:[^,]+?;base64,`)

 var audioDownloadClient = http.Client{Timeout: 30 * time.Second}

--- a/core/http/endpoints/localai/backend.go
+++ b/core/http/endpoints/localai/backend.go
@@ -98,7 +98,7 @@ func (mgs *BackendEndpointService) GetAllStatusEndpoint() echo.HandlerFunc {
 // @Param request body GalleryBackend true "query params"
 // @Success 200 {object} schema.BackendResponse "Response"
 // @Router /backends/apply [post]
-func (mgs *BackendEndpointService) ApplyBackendEndpoint() echo.HandlerFunc {
+func (mgs *BackendEndpointService) ApplyBackendEndpoint(systemState *system.SystemState) echo.HandlerFunc {
 	return func(c echo.Context) error {
 		input := new(GalleryBackend)
 		// Get input data from the request body
@@ -106,6 +106,18 @@ func (mgs *BackendEndpointService) ApplyBackendEndpoint() echo.HandlerFunc {
 			return err
 		}

+		// In distributed mode, refuse to fan out a hardware-specific build to
+		// every node — a CPU build landing on a GPU cluster is almost always
+		// wrong, and the silent footgun is exactly what this guard exists for.
+		// Auto-resolving (meta) backends are fine because each node picks its
+		// own variant. Tooling can recover by hitting
+		// POST /api/nodes/{id}/backends/install per target node.
+		if mgs.backendApplier.BackendManager().IsDistributed() && input.ID != "" {
+			if guard := concreteFanOutGuard(c, mgs.galleries, systemState, input.ID); guard != nil {
+				return guard
+			}
+		}
+
 		uuid, err := uuid.NewUUID()
 		if err != nil {
 			return err
@@ -120,6 +132,66 @@ func (mgs *BackendEndpointService) ApplyBackendEndpoint() echo.HandlerFunc {
 	}
 }

+// concreteFanOutGuard returns a 409 response if the requested backend is a
+// hardware-specific build (not auto-resolving / meta) and we are in
+// distributed mode. It looks up the backend in the configured galleries; if
+// the lookup itself fails (gallery unreachable, name not found), the guard
+// stays out of the way and lets the install enqueue normally — a missing
+// name will surface from the worker as a clearer error than the guard could
+// produce here. The response body deliberately speaks human, with `code` and
+// `meta_alternative` as the programmatic contract for tooling.
+func concreteFanOutGuard(c echo.Context, galleries []config.Gallery, systemState *system.SystemState, backendID string) error {
+	// Use the unfiltered listing because in distributed mode the frontend's
+	// hardware is irrelevant — the install targets workers, not us — and the
+	// filtered list would hide variants that don't match the frontend host
+	// (e.g. a CUDA build on a CPU-only frontend), preventing the guard from
+	// firing for exactly the cases it's meant to protect against.
+	available, err := gallery.AvailableBackendsUnfiltered(galleries, systemState)
+	if err != nil {
+		return nil
+	}
+	requested := available.FindByName(backendID)
+	if requested == nil || requested.IsMeta() {
+		return nil
+	}
+
+	// Try to find an auto-resolving (meta) backend that has this concrete
+	// variant in its CapabilitiesMap, so we can suggest it as a one-shot
+	// alternative. Optional — empty string is fine if no parent exists.
+	metaAlternative := ""
+	for _, b := range available {
+		if !b.IsMeta() {
+			continue
+		}
+		for _, concrete := range b.CapabilitiesMap {
+			if concrete == backendID {
+				metaAlternative = b.Name
+				break
+			}
+		}
+		if metaAlternative != "" {
+			break
+		}
+	}
+
+	msg := fmt.Sprintf(
+		"Backend %q is a hardware-specific build and won't run correctly on every node in this cluster. In distributed mode, install it on specific nodes:\n\n  POST /api/nodes/{node_id}/backends/install\n  {\"backend\": %q}",
+		backendID, backendID,
+	)
+	if metaAlternative != "" {
+		msg += fmt.Sprintf(
+			"\n\nTo install across all nodes, use the auto-resolving backend %q — each node picks its own variant based on its hardware.",
+			metaAlternative,
+		)
+	}
+
+	return c.JSON(409, map[string]any{
+		"error":            msg,
+		"code":             "concrete_backend_requires_target",
+		"meta_alternative": metaAlternative,
+	})
+}
+
 // DeleteBackendEndpoint lets delete backends from a LocalAI instance
 // @Summary delete backends from LocalAI.
 // @Tags backends
--- a/core/http/endpoints/localai/nodes.go
+++ b/core/http/endpoints/localai/nodes.go
@@ -73,6 +73,10 @@ type RegisterNodeRequest struct {
 	AvailableRAM  uint64 `json:"available_ram,omitempty"`
 	GPUVendor     string            `json:"gpu_vendor,omitempty"`
 	Labels        map[string]string `json:"labels,omitempty"`
+	// MaxReplicasPerModel is the per-node cap on replicas of any single model.
+	// Workers older than this field omit it; we coerce 0 → 1 below to preserve
+	// historical single-replica behavior.
+	MaxReplicasPerModel int `json:"max_replicas_per_model,omitempty"`
 }

 // RegisterNodeEndpoint registers a new backend node.
@@ -131,17 +135,26 @@ func RegisterNodeEndpoint(registry *nodes.NodeRegistry, expectedToken string, au
 			tokenHash = hex.EncodeToString(h[:])
 		}

+		// Coerce 0 → 1 for backward compat with workers that don't send the field.
+		// GORM's `default:1` only fires for a missing column; once Go zero-values
+		// reach the struct field they're written as 0 unless explicitly set here.
+		maxReplicasPerModel := req.MaxReplicasPerModel
+		if maxReplicasPerModel < 1 {
+			maxReplicasPerModel = 1
+		}
+
 		node := &nodes.BackendNode{
-			Name:          req.Name,
-			NodeType:      nodeType,
-			Address:       req.Address,
-			HTTPAddress:   req.HTTPAddress,
-			TokenHash:     tokenHash,
-			TotalVRAM:     req.TotalVRAM,
-			AvailableVRAM: req.AvailableVRAM,
-			TotalRAM:      req.TotalRAM,
-			AvailableRAM:  req.AvailableRAM,
-			GPUVendor:     req.GPUVendor,
+			Name:                req.Name,
+			NodeType:            nodeType,
+			Address:             req.Address,
+			HTTPAddress:         req.HTTPAddress,
+			TokenHash:           tokenHash,
+			TotalVRAM:           req.TotalVRAM,
+			AvailableVRAM:       req.AvailableVRAM,
+			TotalRAM:            req.TotalRAM,
+			AvailableRAM:        req.AvailableRAM,
+			GPUVendor:           req.GPUVendor,
+			MaxReplicasPerModel: maxReplicasPerModel,
 		}

 		ctx := c.Request().Context()
@@ -363,6 +376,9 @@ func ResumeNodeEndpoint(registry *nodes.NodeRegistry) echo.HandlerFunc {
 }

 // InstallBackendOnNodeEndpoint triggers backend installation on a worker node via NATS.
+// Backend can be either a gallery ID (resolved against BackendGalleries) or a
+// direct URI install (URI + Name + optional Alias) — same shape as the
+// standalone /api/backends/install-external path, just scoped to one node.
 func InstallBackendOnNodeEndpoint(unloader nodes.NodeCommandSender) echo.HandlerFunc {
 	return func(c echo.Context) error {
 		if unloader == nil {
@@ -372,17 +388,27 @@ func InstallBackendOnNodeEndpoint(unloader nodes.NodeCommandSender) echo.Handler
 		var req struct {
 			Backend          string `json:"backend"`
 			BackendGalleries string `json:"backend_galleries,omitempty"`
+			URI              string `json:"uri,omitempty"`
+			Name             string `json:"name,omitempty"`
+			Alias            string `json:"alias,omitempty"`
 		}
-		if err := c.Bind(&req); err != nil || req.Backend == "" {
-			return c.JSON(http.StatusBadRequest, nodeError(http.StatusBadRequest, "backend name required"))
+		if err := c.Bind(&req); err != nil {
+			return c.JSON(http.StatusBadRequest, nodeError(http.StatusBadRequest, "invalid request body"))
 		}
-		reply, err := unloader.InstallBackend(nodeID, req.Backend, "", req.BackendGalleries, "", "", "")
+		// Either a gallery backend name or a direct URI must be supplied.
+		if req.Backend == "" && req.URI == "" {
+			return c.JSON(http.StatusBadRequest, nodeError(http.StatusBadRequest, "backend name or uri required"))
+		}
+		// Admin-driven backend install: not tied to a specific replica slot
+		// (no model is being loaded). Pass replica 0 to match the worker's
+		// admin process-key convention (`backend#0`).
+		reply, err := unloader.InstallBackend(nodeID, req.Backend, "", req.BackendGalleries, req.URI, req.Name, req.Alias, 0)
 		if err != nil {
-			xlog.Error("Failed to install backend on node", "node", nodeID, "backend", req.Backend, "error", err)
+			xlog.Error("Failed to install backend on node", "node", nodeID, "backend", req.Backend, "uri", req.URI, "error", err)
 			return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "failed to install backend on node"))
 		}
 		if !reply.Success {
-			xlog.Error("Backend install failed on node", "node", nodeID, "backend", req.Backend, "error", reply.Error)
+			xlog.Error("Backend install failed on node", "node", nodeID, "backend", req.Backend, "uri", req.URI, "error", reply.Error)
 			return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "backend installation failed"))
 		}
 		return c.JSON(http.StatusOK, map[string]string{"message": "backend installed"})
@@ -457,8 +483,8 @@ func UnloadModelOnNodeEndpoint(unloader nodes.NodeCommandSender, registry *nodes
 			xlog.Error("Failed to stop backend after model unload", "node", nodeID, "model", req.ModelName, "error", err)
 			return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "model unloaded but backend stop failed"))
 		}
-		// Remove from registry
-		registry.RemoveNodeModel(c.Request().Context(), nodeID, req.ModelName)
+		// Remove every replica of this model on the node from the registry.
+		registry.RemoveAllNodeModelReplicas(c.Request().Context(), nodeID, req.ModelName)
 		return c.JSON(http.StatusOK, map[string]string{"message": "model unloaded"})
 	}
 }
@@ -484,7 +510,7 @@ func DeleteModelOnNodeEndpoint(unloader nodes.NodeCommandSender, registry *nodes
 			// Non-fatal — backend process may not be running
 			xlog.Warn("StopBackend failed during model deletion (non-fatal)", "node", nodeID, "model", req.ModelName, "error", err)
 		}
-		registry.RemoveNodeModel(c.Request().Context(), nodeID, req.ModelName)
+		registry.RemoveAllNodeModelReplicas(c.Request().Context(), nodeID, req.ModelName)
 		return c.JSON(http.StatusOK, map[string]string{"message": "model deleted from node"})
 	}
 }
@@ -659,6 +685,78 @@ func GetNodeLabelsEndpoint(registry *nodes.NodeRegistry) echo.HandlerFunc {
 	}
 }

+// UpdateMaxReplicasPerModelRequest is the body for the per-node replica cap endpoint.
+type UpdateMaxReplicasPerModelRequest struct {
+	// Value is the new per-model replica cap on this node. Must be >= 1.
+	Value int `json:"value"`
+}
+
+// UpdateMaxReplicasPerModelEndpoint sets the per-node cap on how many replicas
+// of any one model can be loaded concurrently. The corresponding
+// `node.replica-slots` auto-label is refreshed so existing AND-selectors keep
+// matching, and any unsatisfiable scheduling cooldowns are cleared so the
+// reconciler retries on the next tick.
+//
+// This is a transient admin override — a worker re-registration restores the
+// value the worker was started with (--max-replicas-per-model). For permanent
+// fleet changes, change the worker flag.
+//
+// @Summary Update a node's max replicas per model
+// @Tags Nodes
+// @Param id path string true "Node ID"
+// @Param request body UpdateMaxReplicasPerModelRequest true "New value"
+// @Success 200 {object} map[string]int
+// @Failure 400 {object} map[string]any "value must be >= 1"
+// @Failure 404 {object} map[string]any "node not found"
+// @Router /api/nodes/{id}/max-replicas-per-model [put]
+func UpdateMaxReplicasPerModelEndpoint(registry *nodes.NodeRegistry) echo.HandlerFunc {
+	return func(c echo.Context) error {
+		ctx := c.Request().Context()
+		nodeID := c.Param("id")
+		if _, err := registry.Get(ctx, nodeID); err != nil {
+			return c.JSON(http.StatusNotFound, nodeError(http.StatusNotFound, "node not found"))
+		}
+		var req UpdateMaxReplicasPerModelRequest
+		if err := c.Bind(&req); err != nil {
+			return c.JSON(http.StatusBadRequest, nodeError(http.StatusBadRequest, "invalid request body"))
+		}
+		if req.Value < 1 {
+			return c.JSON(http.StatusBadRequest, nodeError(http.StatusBadRequest, "value must be >= 1"))
+		}
+		if err := registry.UpdateMaxReplicasPerModel(ctx, nodeID, req.Value); err != nil {
+			xlog.Error("Failed to update max_replicas_per_model", "node", nodeID, "value", req.Value, "error", err)
+			return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "failed to update max replicas per model"))
+		}
+		return c.JSON(http.StatusOK, map[string]int{"max_replicas_per_model": req.Value})
+	}
+}
+
+// ResetMaxReplicasPerModelEndpoint clears the admin override on a node, so
+// the next worker re-registration is allowed to update the value from its
+// CLI flag again. The current value is left in place until the worker calls
+// register.
+//
+// @Summary Reset a node's max replicas per model to the worker default
+// @Tags Nodes
+// @Param id path string true "Node ID"
+// @Success 200 {object} map[string]bool
+// @Failure 404 {object} map[string]any "node not found"
+// @Router /api/nodes/{id}/max-replicas-per-model [delete]
+func ResetMaxReplicasPerModelEndpoint(registry *nodes.NodeRegistry) echo.HandlerFunc {
+	return func(c echo.Context) error {
+		ctx := c.Request().Context()
+		nodeID := c.Param("id")
+		if _, err := registry.Get(ctx, nodeID); err != nil {
+			return c.JSON(http.StatusNotFound, nodeError(http.StatusNotFound, "node not found"))
+		}
+		if err := registry.ResetMaxReplicasPerModel(ctx, nodeID); err != nil {
+			xlog.Error("Failed to reset max_replicas_per_model override", "node", nodeID, "error", err)
+			return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "failed to reset override"))
+		}
+		return c.JSON(http.StatusOK, map[string]bool{"reset": true})
+	}
+}
+
 // SetNodeLabelsEndpoint replaces all labels for a node.
 func SetNodeLabelsEndpoint(registry *nodes.NodeRegistry) echo.HandlerFunc {
 	return func(c echo.Context) error {
--- a/core/http/endpoints/openai/realtime.go
+++ b/core/http/endpoints/openai/realtime.go
@@ -1315,13 +1315,35 @@ func triggerResponse(ctx context.Context, session *Session, conv *Conversation,
 	}
 	thinkingStartToken := reasoning.DetectThinkingStartToken(template, &config.ReasoningConfig)

-	reasoningText, responseWithoutReasoning := reasoning.ExtractReasoningWithConfig(rawResponse, thinkingStartToken, config.ReasoningConfig)
+	// When the C++ autoparser emitted ChatDeltas with actionable data,
+	// prefer them — the backend clears Reply.Message in that path and
+	// delivers parsed content/reasoning/tool-calls via the delta stream
+	// (see pkg/functions/chat_deltas.go, mirrored from chat.go's non-SSE
+	// handling). Without this, Response is empty and realtime would
+	// synthesize silence for replies that actually produced tokens.
+	var reasoningText, responseWithoutReasoning, textContent, cleanedResponse string
+	var toolCalls []functions.FuncCallResults
+	deltaToolCalls := functions.ToolCallsFromChatDeltas(pred.ChatDeltas)
+	deltaContent := functions.ContentFromChatDeltas(pred.ChatDeltas)
+	deltaReasoning := functions.ReasoningFromChatDeltas(pred.ChatDeltas)
+	if len(deltaToolCalls) > 0 || deltaContent != "" {
+		xlog.Debug("[ChatDeltas] realtime: using C++ autoparser deltas",
+			"tool_calls", len(deltaToolCalls),
+			"content_len", len(deltaContent),
+			"reasoning_len", len(deltaReasoning))
+		reasoningText = deltaReasoning
+		responseWithoutReasoning = deltaContent
+		textContent = deltaContent
+		cleanedResponse = deltaContent
+		toolCalls = deltaToolCalls
+	} else {
+		reasoningText, responseWithoutReasoning = reasoning.ExtractReasoningWithConfig(rawResponse, thinkingStartToken, config.ReasoningConfig)
+		textContent = functions.ParseTextContent(responseWithoutReasoning, config.FunctionsConfig)
+		cleanedResponse = functions.CleanupLLMResult(responseWithoutReasoning, config.FunctionsConfig)
+		toolCalls = functions.ParseFunctionCall(cleanedResponse, config.FunctionsConfig)
+	}
 	xlog.Debug("LLM Response", "reasoning", reasoningText, "response_without_reasoning", responseWithoutReasoning)

-	textContent := functions.ParseTextContent(responseWithoutReasoning, config.FunctionsConfig)
-	cleanedResponse := functions.CleanupLLMResult(responseWithoutReasoning, config.FunctionsConfig)
-	toolCalls := functions.ParseFunctionCall(cleanedResponse, config.FunctionsConfig)
-
 	xlog.Debug("Function call parsing", "textContent", textContent, "cleanedResponse", cleanedResponse, "toolCallsCount", len(toolCalls))

 	noActionName := "answer"
--- a/core/http/endpoints/openai/realtime_model.go
+++ b/core/http/endpoints/openai/realtime_model.go
@@ -168,7 +168,7 @@ func (m *wrappedModel) Predict(ctx context.Context, messages schema.Messages, im
 			}
 		} else if toolChoice.Function != nil {
 			// Specific function specified
-			m.LLMConfig.SetFunctionCallString(toolChoice.Function.Name)
+			m.LLMConfig.SetFunctionCallNameString(toolChoice.Function.Name)
 		}
 	}

--- a/core/http/endpoints/openresponses/responses.go
+++ b/core/http/endpoints/openresponses/responses.go
@@ -773,7 +773,7 @@ func convertORToolsToFunctions(input *schema.OpenResponsesRequest, cfg *config.M
 		case map[string]any:
 			if tcType, ok := tc["type"].(string); ok && tcType == "function" {
 				if name, ok := tc["name"].(string); ok {
-					cfg.SetFunctionCallString(name)
+					cfg.SetFunctionCallNameString(name)
 				}
 			}
 		}
--- a/core/http/react-ui/e2e/manage-logs-link.spec.js
+++ b/core/http/react-ui/e2e/manage-logs-link.spec.js
@@ -1,29 +1,32 @@
 import { test, expect } from '@playwright/test'

 test.describe('Manage Page - Backend Logs Link', () => {
-  test('models table shows terminal icon for logs', async ({ page }) => {
+  test('row action menu exposes Backend logs entry with terminal icon', async ({ page }) => {
    await page.goto('/app/manage')
-    // Wait for models to load
    await expect(page.locator('.table')).toBeVisible({ timeout: 10_000 })

-    // Check for terminal icon (backend logs link)
-    const terminalIcon = page.locator('a[title="Backend logs"] i.fa-terminal')
-    await expect(terminalIcon.first()).toBeVisible()
+    // Row actions live behind the kebab (ActionMenu) — open the first row's menu.
+    const trigger = page.locator('button.action-menu__trigger').first()
+    await expect(trigger).toBeVisible()
+    await trigger.click()
+
+    const logsItem = page.getByRole('menuitem', { name: 'Backend logs' })
+    await expect(logsItem).toBeVisible()
+    await expect(logsItem.locator('i.fa-terminal')).toBeVisible()
  })

-  test('terminal icon links to backend-logs page', async ({ page }) => {
+  test('Backend logs menu item navigates to backend-logs page', async ({ page }) => {
    await page.goto('/app/manage')
    await expect(page.locator('.table')).toBeVisible({ timeout: 10_000 })

-    const logsLink = page.locator('a[title="Backend logs"]').first()
-    await expect(logsLink).toBeVisible()
+    const trigger = page.locator('button.action-menu__trigger').first()
+    await expect(trigger).toBeVisible()
+    await trigger.click()

-    // Link uses href="#" with onClick for navigation
-    const href = await logsLink.getAttribute('href')
-    expect(href).toBe('#')
+    const logsItem = page.getByRole('menuitem', { name: 'Backend logs' })
+    await expect(logsItem).toBeVisible()
+    await logsItem.click()

-    // Click and verify navigation
-    await logsLink.click()
    await expect(page).toHaveURL(/\/app\/backend-logs\//)
  })
 })
--- a/core/http/react-ui/e2e/nodes-per-node-backend-actions.spec.js
+++ b/core/http/react-ui/e2e/nodes-per-node-backend-actions.spec.js
@@ -0,0 +1,166 @@
+import { test, expect } from '@playwright/test'
+
+// These specs cover the per-node backend row in the Nodes page:
+//   - the upgrade affordance is self-explanatory (icon + tooltip)
+//   - a delete affordance is present and goes through ConfirmDialog
+//
+// We mock the distributed-mode API so the tests can run against the
+// standalone ui-test-server without spinning up workers/NATS.
+
+const NODE_ID = 'test-node-1'
+const NODE_NAME = 'worker-test'
+const BACKEND_NAME = 'cuda12-vllm-development'
+
+async function mockDistributedNodes(page, { onDelete } = {}) {
+  await page.route('**/api/nodes', (route) => {
+    route.fulfill({
+      status: 200,
+      contentType: 'application/json',
+      body: JSON.stringify([
+        {
+          id: NODE_ID,
+          name: NODE_NAME,
+          node_type: 'backend',
+          address: '10.0.0.1:50051',
+          http_address: '10.0.0.1:8090',
+          status: 'healthy',
+          total_vram: 0,
+          available_vram: 0,
+          total_ram: 8_000_000_000,
+          available_ram: 4_000_000_000,
+          gpu_vendor: '',
+          last_heartbeat: new Date().toISOString(),
+          created_at: new Date().toISOString(),
+          updated_at: new Date().toISOString(),
+        },
+      ]),
+    })
+  })
+
+  await page.route('**/api/nodes/scheduling', (route) => {
+    route.fulfill({
+      status: 200,
+      contentType: 'application/json',
+      body: '[]',
+    })
+  })
+
+  await page.route(`**/api/nodes/${NODE_ID}/models`, (route) => {
+    route.fulfill({
+      status: 200,
+      contentType: 'application/json',
+      body: '[]',
+    })
+  })
+
+  await page.route(`**/api/nodes/${NODE_ID}/backends`, (route) => {
+    route.fulfill({
+      status: 200,
+      contentType: 'application/json',
+      body: JSON.stringify([
+        {
+          name: BACKEND_NAME,
+          is_system: false,
+          is_meta: false,
+          installed_at: new Date().toISOString(),
+        },
+      ]),
+    })
+  })
+
+  await page.route(`**/api/nodes/${NODE_ID}/backends/delete`, async (route) => {
+    if (onDelete) {
+      await onDelete(route)
+    }
+    route.fulfill({
+      status: 200,
+      contentType: 'application/json',
+      body: JSON.stringify({ message: 'backend deleted' }),
+    })
+  })
+}
+
+async function expandNodeAndWaitForBackends(page) {
+  await page.goto('/app/nodes')
+  // Click the row to expand it. The chevron toggle and the row both work,
+  // but clicking the name cell is the most user-like.
+  await page.getByText(NODE_NAME).first().click()
+  // Backends, Capacity and Labels live behind a "Manage" <details>
+  // disclosure (the drawer was distilled to keep at-a-glance content
+  // lean — see distill refactor in the multi-replica branch). Open it
+  // by clicking the summary inside the .node-manage scope so the
+  // per-node backend table is in the DOM before assertions run.
+  await page.locator('.node-manage > summary').first().click()
+  await expect(page.getByRole('cell', { name: BACKEND_NAME, exact: true })).toBeVisible({ timeout: 10_000 })
+}
+
+test.describe('Nodes page — per-node backend actions', () => {
+  test('upgrade affordance is self-explanatory (not "Reinstall backend" with a sync icon)', async ({ page }) => {
+    await mockDistributedNodes(page)
+    await expandNodeAndWaitForBackends(page)
+
+    // Negative: the old, ambiguous wording must not be used.
+    await expect(page.locator('button[title="Reinstall backend"]')).toHaveCount(0)
+    await expect(page.locator('button[title="Reinstall backend"] i.fa-sync-alt')).toHaveCount(0)
+
+    // Positive: a self-explanatory upgrade affordance is rendered next to the
+    // backend row. We accept either an arrow-up or arrows-rotate glyph; both
+    // map to "upgrade" semantics in FontAwesome 6 unambiguously.
+    const upgradeBtn = page.locator('button[title="Upgrade backend on this node"]')
+    await expect(upgradeBtn).toBeVisible()
+    const iconClass = await upgradeBtn.locator('i').getAttribute('class')
+    expect(iconClass).toMatch(/fa-(arrow-up|arrows-rotate|up-long)/)
+  })
+
+  test('per-node backend row shows a delete (trash) button next to upgrade', async ({ page }) => {
+    await mockDistributedNodes(page)
+    await expandNodeAndWaitForBackends(page)
+
+    const deleteBtn = page.locator('button[title="Delete backend from this node"]')
+    await expect(deleteBtn).toBeVisible()
+    await expect(deleteBtn.locator('i.fa-trash')).toBeVisible()
+  })
+
+  test('clicking delete opens the confirm dialog and POSTs to the per-node delete endpoint', async ({ page }) => {
+    let postedBody = null
+    await mockDistributedNodes(page, {
+      onDelete: async (route) => {
+        postedBody = route.request().postDataJSON()
+      },
+    })
+    await expandNodeAndWaitForBackends(page)
+
+    await page.locator('button[title="Delete backend from this node"]').click()
+
+    // ConfirmDialog uses role="alertdialog" and a danger confirm button.
+    const dialog = page.getByRole('alertdialog')
+    await expect(dialog).toBeVisible()
+    const confirmBtn = dialog.locator('button.btn-danger')
+    await expect(confirmBtn).toBeVisible()
+    await confirmBtn.click()
+
+    // Wait until the POST landed.
+    await expect.poll(() => postedBody, { timeout: 5_000 }).toEqual({ backend: BACKEND_NAME })
+  })
+
+  test('clicking delete and cancelling does not POST', async ({ page }) => {
+    let deleteCalls = 0
+    await mockDistributedNodes(page, {
+      onDelete: () => {
+        deleteCalls += 1
+      },
+    })
+    await expandNodeAndWaitForBackends(page)
+
+    await page.locator('button[title="Delete backend from this node"]').click()
+
+    const dialog = page.getByRole('alertdialog')
+    await expect(dialog).toBeVisible()
+    await dialog.getByRole('button', { name: /cancel/i }).click()
+    await expect(dialog).toBeHidden()
+
+    // Give any errant request a moment to fire so a regression would be caught.
+    await page.waitForTimeout(500)
+    expect(deleteCalls).toBe(0)
+  })
+})
--- a/core/http/react-ui/index.html
+++ b/core/http/react-ui/index.html
@@ -7,7 +7,7 @@
    <link rel="icon" type="image/svg+xml" href="/favicon.svg" />
    <link rel="preconnect" href="https://fonts.googleapis.com" />
    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
-    <link href="https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500;600&display=swap" rel="stylesheet" />
+    <link href="https://fonts.googleapis.com/css2?family=Geist:wght@300..700&family=Geist+Mono:wght@300..700&display=swap" rel="stylesheet" />
  </head>
  <body>
    <div id="root"></div>
--- a/core/http/react-ui/package-lock.json
+++ b/core/http/react-ui/package-lock.json
@@ -3258,9 +3258,9 @@
      }
    },
    "node_modules/postcss": {
-      "version": "8.5.8",
-      "resolved": "https://registry.npmjs.org/postcss/-/postcss-8.5.8.tgz",
-      "integrity": "sha512-OW/rX8O/jXnm82Ey1k44pObPtdblfiuWnrd8X7GJ7emImCOstunGbXUpp7HdBrFQX6rJzn3sPT397Wp5aCwCHg==",
+      "version": "8.5.10",
+      "resolved": "https://registry.npmjs.org/postcss/-/postcss-8.5.10.tgz",
+      "integrity": "sha512-pMMHxBOZKFU6HgAZ4eyGnwXF/EvPGGqUr0MnZ5+99485wwW41kW91A4LOGxSHhgugZmSChL5AlElNdwlNgcnLQ==",
      "dev": true,
      "funding": [
        {
@@ -3276,6 +3276,7 @@
          "url": "https://github.com/sponsors/ai"
        }
      ],
+      "license": "MIT",
      "dependencies": {
        "nanoid": "^3.3.11",
        "picocolors": "^1.1.1",
--- a/core/http/react-ui/src/App.css
+++ b/core/http/react-ui/src/App.css
--- a/core/http/react-ui/src/App.jsx
+++ b/core/http/react-ui/src/App.jsx
@@ -1,9 +1,11 @@
-import { useState, useEffect } from 'react'
-import { Outlet, useLocation } from 'react-router-dom'
+import { useState, useEffect, useRef } from 'react'
+import { Outlet, useLocation, useNavigate } from 'react-router-dom'
 import Sidebar from './components/Sidebar'
 import OperationsBar from './components/OperationsBar'
 import { ToastContainer, useToast } from './components/Toast'
 import { systemApi } from './utils/api'
+import { useTheme } from './contexts/ThemeContext'
+import { useAuth } from './context/AuthContext'

 const COLLAPSED_KEY = 'localai_sidebar_collapsed'

@@ -15,6 +17,10 @@ export default function App() {
  const { toasts, addToast, removeToast } = useToast()
  const [version, setVersion] = useState('')
  const location = useLocation()
+  const navigate = useNavigate()
+  const { theme, toggleTheme } = useTheme()
+  const { authEnabled, user } = useAuth()
+  const hamburgerRef = useRef(null)
  const isChatRoute = location.pathname.match(/\/chat(\/|$)/) || location.pathname.match(/\/agents\/[^/]+\/chat/)

  useEffect(() => {
@@ -34,26 +40,80 @@ export default function App() {
    window.scrollTo(0, 0)
  }, [location.pathname])

+  // Drawer polish: lock body scroll, close on Escape, return focus to the
+  // hamburger when the drawer closes. Only engages when the drawer is open;
+  // desktop and tablet rail mode are unaffected.
+  useEffect(() => {
+    if (!sidebarOpen) return
+    const prevOverflow = document.body.style.overflow
+    document.body.style.overflow = 'hidden'
+    const onKey = (e) => { if (e.key === 'Escape') setSidebarOpen(false) }
+    window.addEventListener('keydown', onKey)
+    return () => {
+      document.body.style.overflow = prevOverflow
+      window.removeEventListener('keydown', onKey)
+      // Restore focus to the trigger so keyboard users land back where
+      // they invoked the drawer from.
+      hamburgerRef.current?.focus()
+    }
+  }, [sidebarOpen])
+
  const layoutClasses = [
    'app-layout',
    isChatRoute ? 'app-layout-chat' : '',
    sidebarCollapsed ? 'sidebar-is-collapsed' : '',
  ].filter(Boolean).join(' ')

+  const showAvatar = authEnabled && user
+  const accountLabel = user?.name || user?.email || 'Account'
+
  return (
    <div className={layoutClasses}>
      <Sidebar isOpen={sidebarOpen} onClose={() => setSidebarOpen(false)} />
-      <main className="main-content">
+      <main className="main-content" {...(sidebarOpen ? { 'aria-hidden': 'true', inert: '' } : {})}>
        <OperationsBar />
-        {/* Mobile header */}
+        {/* Mobile header — primary actions reachable without opening the
+            drawer. Hamburger is the only way to expand the nav on phones;
+            theme toggle and account avatar are mirrored from the sidebar
+            footer so they remain one tap away. */}
        <header className="mobile-header">
          <button
+            ref={hamburgerRef}
            className="hamburger-btn"
            onClick={() => setSidebarOpen(true)}
+            aria-label="Open menu"
+            aria-expanded={sidebarOpen}
+            aria-controls="app-sidebar"
          >
-            <i className="fas fa-bars" />
+            <i className="fas fa-bars" aria-hidden="true" />
          </button>
          <span className="mobile-title">LocalAI</span>
+          <div className="mobile-header-actions">
+            <button
+              type="button"
+              className="mobile-header-btn"
+              onClick={toggleTheme}
+              aria-label={`Switch to ${theme === 'dark' ? 'light' : 'dark'} mode`}
+              title={`Switch to ${theme === 'dark' ? 'light' : 'dark'} mode`}
+            >
+              <i className={`fas ${theme === 'dark' ? 'fa-sun' : 'fa-moon'}`} aria-hidden="true" />
+            </button>
+            {showAvatar && (
+              <button
+                type="button"
+                className="mobile-header-btn mobile-header-avatar"
+                onClick={() => navigate('/app/account')}
+                aria-label={`Account: ${accountLabel}`}
+                title={accountLabel}
+              >
+                {user.avatarUrl ? (
+                  <img src={user.avatarUrl} alt="" />
+                ) : (
+                  <i className="fas fa-user-circle" aria-hidden="true" />
+                )}
+              </button>
+            )}
+          </div>
        </header>
        <div className="main-content-inner">
          <div className="page-transition" key={location.pathname}>
--- a/core/http/react-ui/src/components/ActionMenu.jsx
+++ b/core/http/react-ui/src/components/ActionMenu.jsx
@@ -0,0 +1,141 @@
+import { useRef, useState, useEffect, useCallback } from 'react'
+import Popover from './Popover'
+
+// ActionMenu renders a kebab (three-dot) button that opens a popover with a
+// list of row actions. Replaces the inline cluster of icon buttons that made
+// dense tables feel like a control panel — actions stay out of the way until
+// the user reaches for them, the way Linear/Vercel/Notion handle row menus.
+//
+// Items shape:
+//   { key, icon?, label, onClick, danger?, disabled?, hidden?, shortcut? }
+//   { divider: true }                       // visual separator
+//   { type: 'badge', icon?, label }         // non-interactive badge row
+//
+// Hidden items are filtered out so callers can write conditional menus
+// inline (`{ key: 'stop', visible: isRunning, ... }` style) without ternaries.
+//
+// Keyboard:
+//   ArrowUp / ArrowDown  — move highlight (skipping dividers + badges)
+//   Enter / Space        — activate
+//   Escape               — close, return focus to trigger
+export default function ActionMenu({ items, ariaLabel = 'Actions', triggerLabel, compact = false }) {
+  const triggerRef = useRef(null)
+  const [open, setOpen] = useState(false)
+  const [activeIdx, setActiveIdx] = useState(-1)
+
+  const interactive = (Array.isArray(items) ? items : []).filter(it => it && !it.divider && it.type !== 'badge' && !it.hidden)
+  const visible = (Array.isArray(items) ? items : []).filter(it => it && !it.hidden)
+
+  const close = useCallback(() => {
+    setOpen(false)
+    setActiveIdx(-1)
+  }, [])
+
+  // Move highlight to the first interactive item when opening, so keyboard
+  // users land somewhere meaningful instead of having to arrow into the menu.
+  useEffect(() => {
+    if (open && activeIdx === -1 && interactive.length > 0) {
+      setActiveIdx(0)
+    }
+  }, [open, activeIdx, interactive.length])
+
+  const handleTriggerKeyDown = (e) => {
+    if (e.key === 'ArrowDown' || e.key === 'Enter' || e.key === ' ') {
+      e.preventDefault()
+      e.stopPropagation()
+      setOpen(true)
+    }
+  }
+
+  const handleMenuKeyDown = (e) => {
+    if (e.key === 'ArrowDown') {
+      e.preventDefault()
+      setActiveIdx(i => Math.min(interactive.length - 1, (i < 0 ? -1 : i) + 1))
+    } else if (e.key === 'ArrowUp') {
+      e.preventDefault()
+      setActiveIdx(i => Math.max(0, (i < 0 ? interactive.length : i) - 1))
+    } else if (e.key === 'Home') {
+      e.preventDefault()
+      setActiveIdx(0)
+    } else if (e.key === 'End') {
+      e.preventDefault()
+      setActiveIdx(interactive.length - 1)
+    } else if (e.key === 'Enter' || e.key === ' ') {
+      e.preventDefault()
+      const item = interactive[activeIdx]
+      if (item && !item.disabled) {
+        close()
+        item.onClick?.()
+      }
+    }
+  }
+
+  if (interactive.length === 0 && !visible.some(it => it.type === 'badge')) {
+    return null
+  }
+
+  return (
+    <>
+      <button
+        ref={triggerRef}
+        type="button"
+        className={`action-menu__trigger${compact ? ' action-menu__trigger--compact' : ''}${open ? ' is-open' : ''}`}
+        aria-haspopup="menu"
+        aria-expanded={open}
+        aria-label={triggerLabel || ariaLabel}
+        onClick={(e) => { e.stopPropagation(); setOpen(v => !v) }}
+        onKeyDown={handleTriggerKeyDown}
+      >
+        <i className="fas fa-ellipsis-vertical" />
+      </button>
+      <Popover anchor={triggerRef} open={open} onClose={close} ariaLabel={ariaLabel}>
+        <div
+          role="menu"
+          aria-label={ariaLabel}
+          className="action-menu"
+          onKeyDown={handleMenuKeyDown}
+          // Capture focus when the menu opens so arrow keys work without the
+          // user clicking inside first.
+          tabIndex={-1}
+          ref={el => { if (el && open) el.focus() }}
+        >
+          {visible.map((item, i) => {
+            if (item.divider) {
+              return <div key={`d-${i}`} className="action-menu__divider" role="separator" />
+            }
+            if (item.type === 'badge') {
+              return (
+                <div key={item.key || `b-${i}`} className="action-menu__badge" role="presentation">
+                  {item.icon && <i className={`fas ${item.icon}`} aria-hidden="true" />}
+                  <span>{item.label}</span>
+                </div>
+              )
+            }
+            const idx = interactive.indexOf(item)
+            const active = idx === activeIdx
+            return (
+              <button
+                key={item.key}
+                type="button"
+                role="menuitem"
+                disabled={item.disabled}
+                className={`action-menu__item${item.danger ? ' is-danger' : ''}${active ? ' is-active' : ''}`}
+                onMouseEnter={() => setActiveIdx(idx)}
+                onClick={(e) => {
+                  e.stopPropagation()
+                  if (item.disabled) return
+                  close()
+                  item.onClick?.()
+                }}
+              >
+                {item.icon && <i className={`fas ${item.icon} action-menu__icon`} aria-hidden="true" />}
+                <span className="action-menu__label">{item.label}</span>
+                {item.shortcut && <span className="action-menu__shortcut">{item.shortcut}</span>}
+              </button>
+            )
+          })}
+        </div>
+      </Popover>
+    </>
+  )
+}
--- a/core/http/react-ui/src/components/ClientMCPDropdown.jsx
+++ b/core/http/react-ui/src/components/ClientMCPDropdown.jsx
@@ -80,7 +80,7 @@ export default function ClientMCPDropdown({
                placeholder="Server URL (e.g. https://mcp.example.com/sse)"
                value={url}
                onChange={e => setUrl(e.target.value)}
-                style={{ width: '100%', marginBottom: '4px' }}
+                style={{ width: '100%', marginBottom: 'var(--spacing-xs)' }}
              />
              <input
                type="text"
@@ -88,7 +88,7 @@ export default function ClientMCPDropdown({
                placeholder="Name (optional)"
                value={name}
                onChange={e => setName(e.target.value)}
-                style={{ width: '100%', marginBottom: '4px' }}
+                style={{ width: '100%', marginBottom: 'var(--spacing-xs)' }}
              />
              <input
                type="password"
@@ -96,13 +96,13 @@ export default function ClientMCPDropdown({
                placeholder="Auth token (optional)"
                value={authToken}
                onChange={e => setAuthToken(e.target.value)}
-                style={{ width: '100%', marginBottom: '4px' }}
+                style={{ width: '100%', marginBottom: 'var(--spacing-xs)' }}
              />
              <label style={{ display: 'flex', alignItems: 'center', gap: '6px', fontSize: '0.8rem', marginBottom: '6px' }}>
                <input type="checkbox" checked={useProxy} onChange={e => setUseProxy(e.target.checked)} />
                Use CORS proxy
              </label>
-              <div style={{ display: 'flex', gap: '4px', justifyContent: 'flex-end' }}>
+              <div style={{ display: 'flex', gap: 'var(--spacing-xs)', justifyContent: 'flex-end' }}>
                <button type="button" className="btn btn-sm btn-secondary" onClick={() => setAddDialog(false)}>Cancel</button>
                <button type="button" className="btn btn-sm btn-primary" onClick={handleAdd} disabled={!url.trim()}>Add</button>
              </div>
--- a/core/http/react-ui/src/components/ConfigFieldRenderer.jsx
+++ b/core/http/react-ui/src/components/ConfigFieldRenderer.jsx
@@ -135,7 +135,7 @@ function JsonEditor({ value, onChange }) {
        className="input"
        value={text}
        onChange={e => handleChange(e.target.value)}
-        style={{ width: '100%', minHeight: 80, fontFamily: 'monospace', fontSize: '0.8125rem', resize: 'vertical' }}
+        style={{ width: '100%', minHeight: 80, fontFamily: 'var(--font-mono)', fontSize: '0.8125rem', resize: 'vertical' }}
      />
      {parseError && <div style={{ color: 'var(--color-error)', fontSize: '0.75rem', marginTop: 2 }}>{parseError}</div>}
    </div>
--- a/core/http/react-ui/src/components/FieldBrowser.jsx
+++ b/core/http/react-ui/src/components/FieldBrowser.jsx
@@ -158,7 +158,7 @@ export default function FieldBrowser({ fields, activeFieldPaths, onAddField }) {
                      {field.description}
                    </div>
                  )}
-                  <div style={{ fontSize: '0.6875rem', color: 'var(--color-text-muted)', marginTop: 1, fontFamily: 'monospace' }}>
+                  <div style={{ fontSize: '0.6875rem', color: 'var(--color-text-muted)', marginTop: 1, fontFamily: 'var(--font-mono)' }}>
                    {field.path}
                  </div>
                </div>
--- a/core/http/react-ui/src/components/GalleryLoader.jsx
+++ b/core/http/react-ui/src/components/GalleryLoader.jsx
@@ -0,0 +1,79 @@
+import { useState, useEffect } from 'react'
+
+const LOADING_PHRASES = [
+  { text: 'Loading models...', icon: 'fa-brain' },
+  { text: 'Fetching gallery...', icon: 'fa-download' },
+  { text: 'Checking availability...', icon: 'fa-circle-check' },
+  { text: 'Almost ready...', icon: 'fa-hourglass-half' },
+  { text: 'Preparing gallery...', icon: 'fa-store' },
+]
+
+// GalleryLoader is the animated skeleton used while the gallery list loads.
+// Used by Models, Backends, and (now) the Manage page so an empty fetch state
+// reads the same everywhere instead of one tab showing pulsing dots and the
+// other showing "Loading...".
+export default function GalleryLoader() {
+  const [idx, setIdx] = useState(() => Math.floor(Math.random() * LOADING_PHRASES.length))
+  const [fade, setFade] = useState(true)
+
+  useEffect(() => {
+    const interval = setInterval(() => {
+      setFade(false)
+      setTimeout(() => {
+        setIdx(prev => (prev + 1) % LOADING_PHRASES.length)
+        setFade(true)
+      }, 300)
+    }, 2800)
+    return () => clearInterval(interval)
+  }, [])
+
+  const phrase = LOADING_PHRASES[idx]
+
+  return (
+    <div style={{
+      display: 'flex', flexDirection: 'column', alignItems: 'center',
+      justifyContent: 'center', padding: 'var(--spacing-xl) var(--spacing-md)',
+      minHeight: '280px', gap: 'var(--spacing-lg)',
+    }}>
+      <div style={{ display: 'flex', gap: 'var(--spacing-sm)' }}>
+        {[0, 1, 2, 3, 4].map(i => (
+          <div key={i} style={{
+            width: 10, height: 10, borderRadius: '50%',
+            background: 'var(--color-primary)',
+            animation: `galleryDot 1.4s ease-in-out ${i * 0.15}s infinite`,
+          }} />
+        ))}
+      </div>
+      <div style={{
+        display: 'flex', alignItems: 'center', gap: 'var(--spacing-sm)',
+        opacity: fade ? 1 : 0,
+        transition: 'opacity 300ms ease',
+        color: 'var(--color-text-secondary)',
+        fontSize: '0.9375rem',
+        fontWeight: 500,
+      }}>
+        <i className={`fas ${phrase.icon}`} style={{ color: 'var(--color-accent)', fontSize: '1.125rem' }} />
+        {phrase.text}
+      </div>
+      <div style={{ width: '100%', maxWidth: '700px', display: 'flex', flexDirection: 'column', gap: '12px' }}>
+        {[0.9, 0.7, 0.5].map((opacity, i) => (
+          <div key={i} style={{
+            height: '48px', borderRadius: 'var(--radius-md)',
+            background: 'var(--color-bg-tertiary)', opacity,
+            animation: `galleryShimmer 1.8s ease-in-out ${i * 0.2}s infinite`,
+          }} />
+        ))}
+      </div>
+      <style>{`
+        @keyframes galleryDot {
+          0%, 80%, 100% { transform: scale(0.4); opacity: 0.3; }
+          40% { transform: scale(1); opacity: 1; }
+        }
+        @keyframes galleryShimmer {
+          0%, 100% { opacity: var(--shimmer-base, 0.15); }
+          50% { opacity: var(--shimmer-peak, 0.3); }
+        }
+      `}</style>
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/ManageSummary.jsx
+++ b/core/http/react-ui/src/components/ManageSummary.jsx
@@ -0,0 +1,47 @@
+import StatCard from './StatCard'
+
+// ManageSummary anchors the Manage page with the same StatCard pattern the
+// Nodes dashboard uses, so the page reads as a real overview rather than
+// "two tabs in a hat". Counts are derived in-memory by the parent — this
+// component is purely presentational. Cards are clickable and route the
+// user to the relevant tab + filter.
+export default function ManageSummary({
+  modelsCount,
+  backendsCount,
+  runningCount,
+  updatesCount,
+  onCardClick,
+}) {
+  const click = (tab, filter) => onCardClick && onCardClick(tab, filter)
+
+  return (
+    <div className="stat-grid manage-summary">
+      <StatCard
+        icon="fas fa-brain"
+        label="Models Installed"
+        value={modelsCount}
+        onClick={() => click('models', 'all')}
+      />
+      <StatCard
+        icon="fas fa-server"
+        label="Backends Installed"
+        value={backendsCount}
+        onClick={() => click('backends', 'all')}
+      />
+      <StatCard
+        icon="fas fa-circle-play"
+        label="Currently Running"
+        value={runningCount}
+        accentVar={runningCount > 0 ? '--color-success' : undefined}
+        onClick={() => click('models', 'running')}
+      />
+      <StatCard
+        icon="fas fa-arrow-up"
+        label="Updates Available"
+        value={updatesCount}
+        accentVar={updatesCount > 0 ? '--color-warning' : undefined}
+        onClick={() => click('backends', updatesCount > 0 ? 'upgradable' : 'all')}
+      />
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/MetaBadgeRow.jsx
+++ b/core/http/react-ui/src/components/MetaBadgeRow.jsx
@@ -0,0 +1,30 @@
+// MetaBadgeRow renders the System / User / Meta / Dev badge cluster the same
+// way everywhere — Manage tabs and (in future) Install gallery. The badges
+// already exist as classes; this component locks down the icons + labels so
+// the same backend type doesn't read "User" in one tab and "downloaded" in
+// another.
+export default function MetaBadgeRow({ isSystem, isMeta, isDevelopment }) {
+  return (
+    <div className="badge-row">
+      {isSystem ? (
+        <span className="badge badge-info" title="Bundled with the LocalAI runtime">
+          <i className="fas fa-shield-alt" /> System
+        </span>
+      ) : (
+        <span className="badge badge-success" title="Installed from the gallery or external source">
+          <i className="fas fa-download" /> User
+        </span>
+      )}
+      {isMeta && (
+        <span className="badge badge-accent" title="Meta backend — selects a concrete variant per node">
+          <i className="fas fa-layer-group" /> Meta
+        </span>
+      )}
+      {isDevelopment && (
+        <span className="badge badge-warning" title="Marked as development / pre-release by the gallery">
+          <i className="fas fa-flask" /> Dev
+        </span>
+      )}
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/NodeInstallPicker.jsx
+++ b/core/http/react-ui/src/components/NodeInstallPicker.jsx
@@ -0,0 +1,668 @@
+import { useState, useMemo, useEffect, useRef } from 'react'
+import Modal from './Modal'
+import SearchableSelect from './SearchableSelect'
+import { nodesApi } from '../utils/api'
+
+// NodeInstallPicker is the single multi-node install surface used both from
+// the Backends gallery split-button and from the "Install on more nodes" `+`
+// affordance in the Nodes column. Submit fires N parallel per-node install
+// calls; rows transition inline so the user sees per-node success/failure
+// without leaving the modal.
+//
+// Props:
+//   open               — controls visibility
+//   onClose            — close handler (header X / Cancel / Esc / backdrop)
+//   onComplete         — fired after at least one node install succeeded;
+//                        gallery uses this to refetch and update the Nodes
+//                        column without a manual reload
+//   backend            — { name, isMeta, capabilities, metaBackendFor }
+//   nodes              — BackendNode[] from /api/nodes
+//   installedNodeIds   — Set/array of node IDs that already have this backend
+//   initialSelection   — optional pre-selected node IDs (e.g. "missing nodes"
+//                        when opened from the Nodes column `+` affordance)
+
+const STATUS_LABELS = { healthy: 'Healthy', draining: 'Draining', unhealthy: 'Unhealthy', offline: 'Offline' }
+
+function formatVRAM(bytes) {
+  if (!bytes || bytes === 0) return null
+  const gb = bytes / (1024 * 1024 * 1024)
+  return gb >= 1 ? `${gb.toFixed(1)} GB` : `${(bytes / (1024 * 1024)).toFixed(0)} MB`
+}
+
+function gpuVendorLabel(vendor) {
+  const labels = { nvidia: 'NVIDIA', amd: 'AMD', intel: 'Intel', vulkan: 'Vulkan' }
+  return labels[vendor] || null
+}
+
+// hardwareTargetOf parses the capability key that points to a concrete
+// variant in the parent meta's CapabilitiesMap. e.g. cpu-llama-cpp comes
+// from {"cpu": "cpu-llama-cpp"} → "cpu". Falls back to "" when the parent
+// is unknown (the gallery list payload still gives us metaBackendFor).
+function hardwareTargetOf(backend, allBackends) {
+  if (!backend || !backend.name || backend.isMeta) return ''
+  const parentName = backend.metaBackendFor
+  if (!parentName) return ''
+  const parent = (allBackends || []).find(b => b.name === parentName || b.id === parentName)
+  if (!parent || !parent.capabilities) return ''
+  for (const [cap, concreteName] of Object.entries(parent.capabilities)) {
+    if (concreteName === backend.name) return cap
+  }
+  return ''
+}
+
+// humanTargetLabel turns a capability key into a user-facing phrase used in
+// the picker header note: "CPU build", "CUDA 12 build", etc. Keep it
+// concrete and product-recognisable, not the raw token from the gallery.
+function humanTargetLabel(target) {
+  if (!target) return 'hardware-specific build'
+  const t = target.toLowerCase()
+  if (t.startsWith('cpu') || t === 'default') return 'CPU build'
+  if (t.includes('cuda-13') || t.includes('cuda13')) return 'CUDA 13 build'
+  if (t.includes('cuda-12') || t.includes('cuda12')) return 'CUDA 12 build'
+  if (t.includes('cuda')) return 'NVIDIA CUDA build'
+  if (t.includes('l4t')) return 'NVIDIA Jetson (L4T) build'
+  if (t.includes('nvidia')) return 'NVIDIA build'
+  if (t.includes('rocm') || t.includes('amd')) return 'AMD ROCm build'
+  if (t.includes('metal')) return 'Apple Metal build'
+  if (t.includes('sycl') || t.includes('intel')) return 'Intel SYCL build'
+  if (t.includes('vulkan')) return 'Vulkan build'
+  if (t.includes('darwin-x86')) return 'macOS x86 build'
+  return 'hardware-specific build'
+}
+
+// suitabilityFor returns the picker's per-row suitability state for the
+// requested backend. Already-installed wins over compatible/override so
+// the user sees a single signal per row.
+function suitabilityFor({ node, backend, hardwareTarget, alreadyInstalled }) {
+  if (alreadyInstalled) return 'installed'
+  // backend can be null on the first render before pickerBackend is set —
+  // this function is invoked from useMemo, which runs regardless of the
+  // outer open guard. Treat missing data as "compatible" so the placeholder
+  // render doesn't blow up; the picker won't actually paint anything until
+  // the early-return below the hooks fires.
+  if (!backend || backend.isMeta || !hardwareTarget) return 'compatible'
+  const vendor = (node.gpu_vendor || '').toLowerCase()
+  const t = hardwareTarget.toLowerCase()
+  if (t.startsWith('cpu') || t === 'default') {
+    // CPU builds always run; they're never marked Override (running CPU on a
+    // GPU node is the headline use case the user is choosing intentionally).
+    return 'compatible'
+  }
+  if (t.includes('nvidia') || t.includes('cuda') || t.includes('l4t')) {
+    return vendor === 'nvidia' ? 'compatible' : 'override'
+  }
+  if (t.includes('amd') || t.includes('rocm') || t.includes('hip')) {
+    return vendor === 'amd' ? 'compatible' : 'override'
+  }
+  if (t.includes('intel') || t.includes('sycl')) {
+    return vendor === 'intel' ? 'compatible' : 'override'
+  }
+  if (t.includes('metal') || t.includes('darwin')) {
+    // No vendor reporting for Metal; trust the user.
+    return 'compatible'
+  }
+  return 'compatible'
+}
+
+export default function NodeInstallPicker({
+  open, onClose, onComplete,
+  backend,
+  nodes = [],
+  allBackends = [],
+  installedNodeIds = [],
+  initialSelection,
+  addToast,
+}) {
+  const [search, setSearch] = useState('')
+  const [showHealthy, setShowHealthy] = useState(true)
+  const [showDraining, setShowDraining] = useState(false)
+  const [selected, setSelected] = useState(() => new Set())
+  const [overrideVariant, setOverrideVariant] = useState('') // chosen concrete name
+  const [overrideExpanded, setOverrideExpanded] = useState(false)
+  const [submitting, setSubmitting] = useState(false)
+  const [showMismatchConfirm, setShowMismatchConfirm] = useState(false)
+  // Per-node submission state: { [nodeId]: { status: 'pending'|'installing'|'done'|'error', error? , version? } }
+  const [perNode, setPerNode] = useState({})
+  const headerInputRef = useRef(null)
+
+  // Backend-derived metadata used throughout the picker.
+  const hardwareTarget = useMemo(() => hardwareTargetOf(backend, allBackends), [backend, allBackends])
+  const targetLabel = humanTargetLabel(hardwareTarget)
+  const concreteVariants = useMemo(() => {
+    if (!backend?.isMeta || !backend.capabilities) return []
+    return Object.entries(backend.capabilities).map(([cap, concrete]) => ({
+      value: concrete,
+      label: `${concrete}  ·  ${cap}`,
+    }))
+  }, [backend])
+
+  // Pending nodes are surgically removed from the list — they can't accept
+  // installs until approved. Surface the count instead of dead-disabled rows.
+  const pendingCount = nodes.filter(n => n.status === 'pending').length
+  const backendNodes = nodes.filter(n =>
+    (!n.node_type || n.node_type === 'backend') && n.status !== 'pending'
+  )
+
+  const installedSet = useMemo(() => {
+    const s = new Set()
+    if (Array.isArray(installedNodeIds)) installedNodeIds.forEach(id => s.add(id))
+    else if (installedNodeIds && typeof installedNodeIds.has === 'function') {
+      installedNodeIds.forEach(id => s.add(id))
+    }
+    return s
+  }, [installedNodeIds])
+
+  const filteredNodes = useMemo(() => {
+    let list = backendNodes
+    if (!showHealthy) list = list.filter(n => n.status !== 'healthy')
+    if (!showDraining) list = list.filter(n => n.status !== 'draining')
+    if (search.trim()) {
+      const q = search.toLowerCase()
+      list = list.filter(n =>
+        (n.name || '').toLowerCase().includes(q) ||
+        Object.entries(n.labels || {}).some(([k, v]) => `${k}=${v}`.toLowerCase().includes(q))
+      )
+    }
+    return list
+  }, [backendNodes, showHealthy, showDraining, search])
+
+  // Pre-seed selection on open. Reset all transient state so reopening
+  // doesn't surface ghost progress from the prior submit.
+  useEffect(() => {
+    if (!open) return
+    const initial = new Set()
+    if (Array.isArray(initialSelection)) initialSelection.forEach(id => initial.add(id))
+    setSelected(initial)
+    setSearch('')
+    setOverrideVariant('')
+    setOverrideExpanded(false)
+    setPerNode({})
+    setSubmitting(false)
+    setShowMismatchConfirm(false)
+  }, [open, initialSelection])
+
+  // Auto-expand the variant override disclosure when at least one selected
+  // node lacks a working GPU. This is the headline use case the feature
+  // exists for; surfacing it instead of hiding behind a click.
+  useEffect(() => {
+    if (!backend?.isMeta) return
+    const someGPUMissing = Array.from(selected).some(id => {
+      const n = backendNodes.find(x => x.id === id)
+      return n && (!n.gpu_vendor || n.gpu_vendor === '' || n.gpu_vendor === 'unknown')
+    })
+    if (someGPUMissing && !overrideExpanded) setOverrideExpanded(true)
+  }, [selected, backend, backendNodes]) // eslint-disable-line react-hooks/exhaustive-deps
+
+  // The effective backend that gets installed on each node. For
+  // hardware-specific backends this is just backend.name. For meta backends
+  // with no override, the worker picks per-node — we pass backend.name and
+  // the worker resolves. With an override set, the picker installs that
+  // exact concrete variant on every selected node.
+  const effectiveBackendName = overrideVariant || backend?.name
+
+  const counts = useMemo(() => {
+    let already = 0, overrides = 0
+    selected.forEach(id => {
+      const n = backendNodes.find(x => x.id === id)
+      if (!n) return
+      if (installedSet.has(id)) { already++; return }
+      const eff = overrideVariant
+        ? { name: overrideVariant, isMeta: false, metaBackendFor: backend?.name }
+        : backend
+      const target = overrideVariant ? hardwareTargetOf(eff, allBackends) : hardwareTarget
+      const s = suitabilityFor({ node: n, backend: eff, hardwareTarget: target, alreadyInstalled: false })
+      if (s === 'override') overrides++
+    })
+    return { already, overrides, selected: selected.size }
+  }, [selected, backendNodes, installedSet, overrideVariant, backend, hardwareTarget, allBackends])
+
+  const toggle = (nodeId) => {
+    setSelected(prev => {
+      const next = new Set(prev)
+      next.has(nodeId) ? next.delete(nodeId) : next.add(nodeId)
+      return next
+    })
+  }
+
+  const selectAllHealthy = () => {
+    setSelected(new Set(filteredNodes.filter(n => n.status === 'healthy').map(n => n.id)))
+  }
+  const selectCompatible = () => {
+    const eff = overrideVariant
+      ? { name: overrideVariant, isMeta: false, metaBackendFor: backend?.name }
+      : backend
+    const target = overrideVariant ? hardwareTargetOf(eff, allBackends) : hardwareTarget
+    setSelected(new Set(
+      filteredNodes
+        .filter(n => suitabilityFor({ node: n, backend: eff, hardwareTarget: target, alreadyInstalled: false }) === 'compatible')
+        .map(n => n.id)
+    ))
+  }
+  const clearSelection = () => setSelected(new Set())
+
+  const submit = async () => {
+    if (selected.size === 0 || submitting) return
+    if (counts.overrides > 0 && !showMismatchConfirm) {
+      setShowMismatchConfirm(true)
+      return
+    }
+    setShowMismatchConfirm(false)
+    setSubmitting(true)
+    const ids = Array.from(selected)
+    setPerNode(prev => {
+      const next = { ...prev }
+      ids.forEach(id => { next[id] = { status: 'installing' } })
+      return next
+    })
+
+    const results = await Promise.allSettled(ids.map(id =>
+      nodesApi.installBackend(id, effectiveBackendName)
+        .then(r => ({ id, ok: true, message: r?.message }))
+        .catch(err => ({ id, ok: false, error: err?.message || 'install failed' }))
+    ))
+
+    let successCount = 0, failCount = 0
+    setPerNode(prev => {
+      const next = { ...prev }
+      for (const r of results) {
+        if (r.status !== 'fulfilled') continue
+        const v = r.value
+        if (v.ok) {
+          next[v.id] = { status: 'done' }
+          successCount++
+        } else {
+          next[v.id] = { status: 'error', error: v.error }
+          failCount++
+        }
+      }
+      return next
+    })
+    setSubmitting(false)
+
+    if (successCount > 0 && onComplete) onComplete()
+
+    if (failCount === 0) {
+      addToast?.(`Installed on ${successCount} node${successCount === 1 ? '' : 's'}`, 'success')
+      setTimeout(() => onClose?.(), 800)
+    } else if (successCount === 0) {
+      addToast?.(`Install failed on all ${failCount} node${failCount === 1 ? '' : 's'}`, 'error')
+    } else {
+      addToast?.(`Installed on ${successCount}, failed on ${failCount}`, 'warning')
+    }
+  }
+
+  const retryFailed = async () => {
+    const failedIds = Object.entries(perNode)
+      .filter(([, v]) => v.status === 'error')
+      .map(([id]) => id)
+    if (failedIds.length === 0) return
+    setSelected(new Set(failedIds))
+    // Replace state for failed rows so they show "installing" again, not stale errors.
+    setPerNode(prev => {
+      const next = { ...prev }
+      failedIds.forEach(id => { next[id] = { status: 'installing' } })
+      return next
+    })
+    setSubmitting(true)
+    const results = await Promise.allSettled(failedIds.map(id =>
+      nodesApi.installBackend(id, effectiveBackendName)
+        .then(r => ({ id, ok: true, message: r?.message }))
+        .catch(err => ({ id, ok: false, error: err?.message || 'install failed' }))
+    ))
+    let successCount = 0, failCount = 0
+    setPerNode(prev => {
+      const next = { ...prev }
+      for (const r of results) {
+        if (r.status !== 'fulfilled') continue
+        const v = r.value
+        if (v.ok) { next[v.id] = { status: 'done' }; successCount++ }
+        else { next[v.id] = { status: 'error', error: v.error }; failCount++ }
+      }
+      return next
+    })
+    setSubmitting(false)
+    if (successCount > 0 && onComplete) onComplete()
+    if (failCount === 0) {
+      addToast?.(`Installed on ${successCount} node${successCount === 1 ? '' : 's'}`, 'success')
+      setTimeout(() => onClose?.(), 800)
+    }
+  }
+
+  const doneCount = Object.values(perNode).filter(v => v.status === 'done').length
+  const errorCount = Object.values(perNode).filter(v => v.status === 'error').length
+  const totalAttempted = Object.keys(perNode).length
+
+  if (!open || !backend) return null
+
+  const noNodes = backendNodes.length === 0
+
+  return (
+    <Modal onClose={onClose} maxWidth="780px">
+      <div style={{
+        padding: 'var(--spacing-md) var(--spacing-lg)',
+        borderBottom: '1px solid var(--color-border-subtle)',
+        display: 'flex',
+        alignItems: 'center',
+        justifyContent: 'space-between',
+        gap: 'var(--spacing-sm)',
+      }}>
+        <h2 style={{ margin: 0, fontSize: '1rem', display: 'flex', alignItems: 'center', gap: 'var(--spacing-sm)' }}>
+          <i className="fas fa-cog" style={{ color: 'var(--color-primary)' }} />
+          Install <span style={{ fontFamily: 'var(--font-mono)' }}>{backend.name}</span>
+          {backend.isMeta ? (
+            <span className="badge badge-info" style={{ fontSize: '0.6875rem' }}>Auto-resolving</span>
+          ) : (
+            <span className="badge badge-warning" style={{ fontSize: '0.6875rem' }}>Hardware-specific</span>
+          )}
+        </h2>
+        <button
+          type="button"
+          className="btn btn-ghost btn-sm"
+          onClick={onClose}
+          aria-label="Close"
+          style={{ fontSize: '1.125rem', lineHeight: 1, padding: '4px 10px' }}
+        >×</button>
+      </div>
+
+      <div style={{ padding: 'var(--spacing-md) var(--spacing-lg)' }}>
+        {!backend.isMeta && (
+          <div className="card" style={{
+            marginBottom: 'var(--spacing-md)',
+            padding: 'var(--spacing-sm) var(--spacing-md)',
+            background: 'var(--color-warning-light)',
+            border: '1px solid var(--color-warning-border)',
+            borderRadius: 'var(--radius-md)',
+            display: 'flex',
+            alignItems: 'center',
+            gap: 'var(--spacing-sm)',
+          }}>
+            <i className="fas fa-microchip" style={{ color: 'var(--color-warning)' }} />
+            <span style={{ color: 'var(--color-warning)', fontSize: '0.8125rem' }}>
+              {targetLabel}. Install only on nodes where you want this build to run.
+              {hardwareTarget && ` Targets: ${humanTargetLabel(hardwareTarget).replace(' build', '')}.`}
+            </span>
+          </div>
+        )}
+
+        {noNodes ? (
+          <div className="empty-state" style={{ padding: 'var(--spacing-xl) 0' }}>
+            <div className="empty-state-icon"><i className="fas fa-server" /></div>
+            <h3 className="empty-state-title">No backend nodes available</h3>
+            <p className="empty-state-text">
+              Approve pending workers or register new ones.
+              {pendingCount > 0 && ` (${pendingCount} awaiting approval.)`}
+            </p>
+            <a className="btn btn-secondary btn-sm" href="/app/nodes">
+              <i className="fas fa-network-wired" /> Manage nodes
+            </a>
+          </div>
+        ) : (
+          <>
+            {/* Filter row */}
+            <div style={{ display: 'flex', gap: 'var(--spacing-sm)', alignItems: 'center', marginBottom: 'var(--spacing-sm)', flexWrap: 'wrap' }}>
+              <div className="search-bar" style={{ flex: 1, minWidth: 180 }}>
+                <i className="fas fa-search search-icon" />
+                <input
+                  ref={headerInputRef}
+                  className="input"
+                  placeholder="Filter nodes by name or label..."
+                  value={search}
+                  onChange={e => setSearch(e.target.value)}
+                />
+              </div>
+              <button className="btn btn-secondary btn-sm" onClick={selectAllHealthy} type="button">
+                Select all healthy
+              </button>
+              <button className="btn btn-secondary btn-sm" onClick={selectCompatible} type="button">
+                Select compatible nodes
+              </button>
+              {selected.size > 0 && (
+                <button className="btn btn-ghost btn-sm" onClick={clearSelection} type="button">
+                  Clear
+                </button>
+              )}
+            </div>
+
+            {/* Variant override (auto-resolving only) */}
+            {backend.isMeta && concreteVariants.length > 0 && (
+              <div style={{ marginBottom: 'var(--spacing-sm)' }}>
+                <button
+                  type="button"
+                  className="btn btn-ghost btn-sm"
+                  onClick={() => setOverrideExpanded(v => !v)}
+                  aria-expanded={overrideExpanded}
+                  style={{ padding: '4px 8px' }}
+                >
+                  <i className={`fas fa-chevron-${overrideExpanded ? 'down' : 'right'}`} style={{ marginRight: 4, fontSize: '0.625rem' }} />
+                  Override variant for selected nodes…
+                </button>
+                {overrideExpanded && (
+                  <div className="card" style={{ marginTop: 4, padding: 'var(--spacing-sm) var(--spacing-md)' }}>
+                    <p style={{ fontSize: '0.75rem', color: 'var(--color-text-secondary)', marginTop: 0, marginBottom: 'var(--spacing-xs)' }}>
+                      By default each node picks its own variant. Override to install one specific variant on every selected node — useful when GPU detection fails on a node and you want the CPU build there instead.
+                    </p>
+                    <SearchableSelect
+                      value={overrideVariant}
+                      onChange={setOverrideVariant}
+                      options={concreteVariants}
+                      placeholder="Per-node auto-resolve (default)"
+                      allOption={{ value: '', label: 'Per-node auto-resolve (default)' }}
+                    />
+                  </div>
+                )}
+              </div>
+            )}
+
+            {/* Node table */}
+            <div className="table-container" style={{ marginBottom: 'var(--spacing-sm)', maxHeight: '40vh', overflowY: 'auto' }}>
+              <table className="table" style={{ margin: 0 }}>
+                <thead>
+                  <tr>
+                    <th style={{ width: 28 }}>
+                      <input
+                        type="checkbox"
+                        aria-label="Select all visible"
+                        checked={filteredNodes.length > 0 && filteredNodes.every(n => selected.has(n.id))}
+                        onChange={(e) => {
+                          setSelected(prev => {
+                            const next = new Set(prev)
+                            if (e.target.checked) filteredNodes.forEach(n => next.add(n.id))
+                            else filteredNodes.forEach(n => next.delete(n.id))
+                            return next
+                          })
+                        }}
+                      />
+                    </th>
+                    <th>Node</th>
+                    <th>Status</th>
+                    <th>Hardware</th>
+                    <th>Suitability</th>
+                  </tr>
+                </thead>
+                <tbody>
+                  {filteredNodes.map(node => {
+                    const installed = installedSet.has(node.id)
+                    const eff = overrideVariant
+                      ? { name: overrideVariant, isMeta: false, metaBackendFor: backend.name }
+                      : backend
+                    const target = overrideVariant ? hardwareTargetOf(eff, allBackends) : hardwareTarget
+                    const suit = suitabilityFor({ node, backend: eff, hardwareTarget: target, alreadyInstalled: installed })
+                    const isSel = selected.has(node.id)
+                    const rowState = perNode[node.id]
+                    const vendor = gpuVendorLabel(node.gpu_vendor)
+                    const totalVRAM = formatVRAM(node.total_vram)
+                    const totalRAM = formatVRAM(node.total_ram)
+                    return (
+                      <tr key={node.id}>
+                        <td>
+                          <input
+                            type="checkbox"
+                            aria-label={`Select ${node.name}`}
+                            aria-disabled={rowState?.status === 'installing'}
+                            checked={isSel}
+                            onChange={() => toggle(node.id)}
+                          />
+                        </td>
+                        <td>
+                          <div style={{ display: 'flex', flexDirection: 'column', gap: 2 }}>
+                            <span style={{ fontWeight: 500, fontSize: '0.875rem' }}>{node.name}</span>
+                            {node.labels && Object.keys(node.labels).length > 0 && (
+                              <div style={{ display: 'flex', flexWrap: 'wrap', gap: 3 }}>
+                                {Object.entries(node.labels).slice(0, 3).map(([k, v]) => (
+                                  <span key={k} className="cell-mono" style={{
+                                    padding: '1px 5px', borderRadius: 'var(--radius-sm)', fontSize: '0.6875rem',
+                                    background: 'var(--color-bg-tertiary)', border: '1px solid var(--color-border-subtle)',
+                                  }}>{k}={v}</span>
+                                ))}
+                                {Object.keys(node.labels).length > 3 && (
+                                  <span className="cell-muted" style={{ fontSize: '0.6875rem' }}>
+                                    +{Object.keys(node.labels).length - 3}
+                                  </span>
+                                )}
+                              </div>
+                            )}
+                          </div>
+                        </td>
+                        <td>
+                          <span style={{ fontSize: '0.8125rem' }}>
+                            {STATUS_LABELS[node.status] || node.status}
+                          </span>
+                        </td>
+                        <td style={{ fontSize: '0.8125rem', fontFamily: 'var(--font-mono)', color: 'var(--color-text-secondary)' }}>
+                          {totalVRAM ? (
+                            <>{vendor && <span style={{ marginRight: 4 }}>{vendor}</span>}{totalVRAM}</>
+                          ) : totalRAM ? (
+                            <span>CPU · {totalRAM}</span>
+                          ) : <span className="cell-muted">—</span>}
+                        </td>
+                        <td>
+                          {rowState?.status === 'installing' ? (
+                            <span className="badge badge-info">
+                              <i className="fas fa-spinner fa-spin" style={{ marginRight: 4 }} />Installing
+                            </span>
+                          ) : rowState?.status === 'done' ? (
+                            <span className="badge badge-success">
+                              <i className="fas fa-check" style={{ marginRight: 4 }} />Installed
+                            </span>
+                          ) : rowState?.status === 'error' ? (
+                            <button
+                              type="button"
+                              className="badge badge-error"
+                              title={rowState.error}
+                              aria-describedby={`err-${node.id}`}
+                              style={{ border: 'none', cursor: 'help' }}
+                            >
+                              <i className="fas fa-exclamation-triangle" style={{ marginRight: 4 }} />Failed
+                              <span id={`err-${node.id}`} style={{ position: 'absolute', left: -9999 }}>{rowState.error}</span>
+                            </button>
+                          ) : suit === 'installed' ? (
+                            <span className="badge" style={{ background: 'var(--color-bg-tertiary)', color: 'var(--color-text-muted)' }}>
+                              Installed
+                            </span>
+                          ) : suit === 'override' ? (
+                            <span className="badge badge-warning">
+                              <i className="fas fa-exclamation-circle" style={{ marginRight: 4 }} />Override
+                            </span>
+                          ) : (
+                            <span className="badge badge-success" style={{ background: 'var(--color-success-light)', color: 'var(--color-success)' }}>
+                              Compatible
+                            </span>
+                          )}
+                        </td>
+                      </tr>
+                    )
+                  })}
+                  {filteredNodes.length === 0 && (
+                    <tr>
+                      <td colSpan={5} style={{ textAlign: 'center', padding: 'var(--spacing-md)', color: 'var(--color-text-muted)' }}>
+                        No nodes match the current filters.
+                      </td>
+                    </tr>
+                  )}
+                </tbody>
+              </table>
+            </div>
+
+            {pendingCount > 0 && (
+              <p style={{ fontSize: '0.75rem', color: 'var(--color-text-muted)', marginTop: 0, marginBottom: 'var(--spacing-sm)' }}>
+                +{pendingCount} awaiting approval — <a href="/app/nodes" style={{ color: 'var(--color-primary)' }}>approve from Nodes</a>.
+              </p>
+            )}
+
+            {/* Mismatch confirm */}
+            {showMismatchConfirm && (
+              <div className="card" style={{
+                marginBottom: 'var(--spacing-sm)',
+                padding: 'var(--spacing-md)',
+                background: 'var(--color-warning-light)',
+                border: '1px solid var(--color-warning-border)',
+                borderRadius: 'var(--radius-md)',
+              }}>
+                <p style={{ marginTop: 0, marginBottom: 'var(--spacing-sm)', color: 'var(--color-warning)', fontSize: '0.875rem' }}>
+                  Installing {targetLabel.toLowerCase()} on {counts.overrides} node{counts.overrides === 1 ? '' : 's'} that don't match. Those nodes will run inference on the chosen build, not their native GPU. Continue?
+                </p>
+                <div style={{ display: 'flex', gap: 'var(--spacing-sm)', justifyContent: 'flex-end' }}>
+                  <button className="btn btn-secondary btn-sm" type="button" onClick={() => setShowMismatchConfirm(false)}>
+                    Cancel
+                  </button>
+                  <button className="btn btn-primary btn-sm" type="button" onClick={submit}
+                    style={{ background: 'var(--color-warning)', borderColor: 'var(--color-warning)' }}>
+                    Install on {targetLabel.replace(' build', '')}
+                  </button>
+                </div>
+              </div>
+            )}
+          </>
+        )}
+      </div>
+
+      {!noNodes && (
+        <div style={{
+          padding: 'var(--spacing-md) var(--spacing-lg)',
+          borderTop: '1px solid var(--color-border-subtle)',
+          display: 'flex',
+          alignItems: 'center',
+          gap: 'var(--spacing-sm)',
+          flexWrap: 'wrap',
+        }}>
+          <div style={{ flex: 1, fontSize: '0.8125rem', color: 'var(--color-text-secondary)' }}>
+            {totalAttempted > 0 ? (
+              <>
+                {doneCount} of {totalAttempted} done
+                {errorCount > 0 && (
+                  <> · <span className="badge badge-error" style={{ fontSize: '0.6875rem' }}>{errorCount} failed</span></>
+                )}
+              </>
+            ) : (
+              <>
+                {counts.selected} {counts.selected === 1 ? 'node' : 'nodes'} selected
+                {counts.already > 0 && <> · {counts.already} already installed</>}
+                {counts.overrides > 0 && <> · {counts.overrides} override{counts.overrides === 1 ? '' : 's'}</>}
+              </>
+            )}
+          </div>
+          {errorCount > 0 && !submitting && (
+            <button className="btn btn-secondary btn-sm" type="button" onClick={retryFailed}>
+              <i className="fas fa-redo" /> Retry failed nodes
+            </button>
+          )}
+          <button className="btn btn-secondary btn-sm" type="button" onClick={onClose} disabled={submitting}>
+            {totalAttempted > 0 && doneCount > 0 ? 'Close' : 'Cancel'}
+          </button>
+          <button
+            className="btn btn-primary btn-sm"
+            type="button"
+            onClick={submit}
+            disabled={submitting || counts.selected === 0 || showMismatchConfirm}
+          >
+            {submitting ? (
+              <><i className="fas fa-spinner fa-spin" /> Installing…</>
+            ) : (
+              <>Install on {counts.selected} {counts.selected === 1 ? 'node' : 'nodes'}</>
+            )}
+          </button>
+        </div>
+      )}
+    </Modal>
+  )
+}
--- a/core/http/react-ui/src/components/ResourceActions.jsx
+++ b/core/http/react-ui/src/components/ResourceActions.jsx
@@ -0,0 +1,29 @@
+// ResourceActions groups row-level buttons into a lifecycle cluster (start,
+// stop, pin, reinstall, upgrade) and a destructive cluster (delete) with a
+// thin divider between them, so a destructive intent visually separates from
+// a routine one. Replaces the old 4px-gap row of buttons in the Manage page
+// where Stop / Pin / Delete sat shoulder-to-shoulder with no visual cue
+// telling apart "click to fiddle" from "click to throw away".
+//
+// `lifecycle` and `destructive` accept any ReactNode — typically one or more
+// <button>s. The wrapping div stops click propagation so action clicks don't
+// also expand the row.
+export default function ResourceActions({ lifecycle, destructive }) {
+  const hasLifecycle = !!lifecycle
+  const hasDestructive = !!destructive
+  if (!hasLifecycle && !hasDestructive) return null
+
+  return (
+    <div className="resource-actions" onClick={e => e.stopPropagation()}>
+      {hasLifecycle && (
+        <div className="resource-actions__group">{lifecycle}</div>
+      )}
+      {hasLifecycle && hasDestructive && (
+        <span className="resource-actions__divider" aria-hidden="true" />
+      )}
+      {hasDestructive && (
+        <div className="resource-actions__group">{destructive}</div>
+      )}
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/ResourceMonitor.jsx
+++ b/core/http/react-ui/src/components/ResourceMonitor.jsx
@@ -51,7 +51,7 @@ export default function ResourceMonitor() {
                  <div className="resource-bar-container" style={{ flex: 1 }}>
                    <div className="resource-bar" style={{ width: `${pct}%`, background: color }} />
                  </div>
-                  <span style={{ fontSize: '0.8125rem', fontWeight: 600, fontFamily: "'JetBrains Mono', monospace", color, minWidth: '3em', textAlign: 'right' }}>
+                  <span style={{ fontSize: '0.8125rem', fontWeight: 600, fontFamily: 'var(--font-mono)', color, minWidth: '3em', textAlign: 'right' }}>
                    {pct.toFixed(0)}%
                  </span>
                </div>
@@ -76,7 +76,7 @@ export default function ResourceMonitor() {
            <div className="resource-bar-container" style={{ flex: 1 }}>
              <div className="resource-bar" style={{ width: `${ram.usage_percent || 0}%`, background: percentColor(ram.usage_percent || 0) }} />
            </div>
-            <span style={{ fontSize: '0.8125rem', fontWeight: 600, fontFamily: "'JetBrains Mono', monospace", color: percentColor(ram.usage_percent || 0), minWidth: '3em', textAlign: 'right' }}>
+            <span style={{ fontSize: '0.8125rem', fontWeight: 600, fontFamily: 'var(--font-mono)', color: percentColor(ram.usage_percent || 0), minWidth: '3em', textAlign: 'right' }}>
              {(ram.usage_percent || 0).toFixed(0)}%
            </span>
          </div>
@@ -91,7 +91,7 @@ export default function ResourceMonitor() {
      {isGpu && aggregate.gpu_count > 1 && (
        <div style={{ fontSize: '0.75rem', color: 'var(--color-text-secondary)', marginTop: 'var(--spacing-sm)', display: 'flex', justifyContent: 'space-between' }}>
          <span>Total VRAM</span>
-          <span style={{ fontFamily: "'JetBrains Mono', monospace" }}>
+          <span style={{ fontFamily: 'var(--font-mono)' }}>
            {formatBytes(aggregate.used_memory)} / {formatBytes(aggregate.total_memory)} ({aggregate.usage_percent?.toFixed(1)}%)
          </span>
        </div>
@@ -101,7 +101,7 @@ export default function ResourceMonitor() {
      {resources.storage_size != null && (
        <div style={{ fontSize: '0.75rem', color: 'var(--color-text-secondary)', marginTop: 'var(--spacing-sm)', display: 'flex', justifyContent: 'space-between' }}>
          <span>Models storage</span>
-          <span style={{ fontFamily: "'JetBrains Mono', monospace", color: 'var(--color-text-primary)' }}>
+          <span style={{ fontFamily: 'var(--font-mono)', color: 'var(--color-text-primary)' }}>
            {formatBytes(resources.storage_size)}
          </span>
        </div>
--- a/core/http/react-ui/src/components/ResourceRow.jsx
+++ b/core/http/react-ui/src/components/ResourceRow.jsx
@@ -0,0 +1,81 @@
+import { Fragment } from 'react'
+
+// ResourceRow renders the visible row + its conditional detail row as a pair
+// of <tr>s, so the existing .table styling keeps applying and the Manage page
+// can re-use the gallery's expand-to-detail interaction without inventing a
+// new table system. The consumer owns the cells (which pass through as
+// children) — this component only manages the click-to-expand handler, the
+// dimmed state for disabled rows, and the colSpan'd detail row beneath.
+//
+// `onToggleExpand` fires on row click only. Buttons / toggles inside cells
+// must call e.stopPropagation() (or be wrapped in an .actions-stop wrapper)
+// to avoid double-triggering the expand.
+export default function ResourceRow({
+  expanded,
+  onToggleExpand,
+  detail,
+  colSpan,
+  dimmed,
+  className = '',
+  children,
+}) {
+  return (
+    <Fragment>
+      <tr
+        className={`resource-row${dimmed ? ' is-dimmed' : ''}${expanded ? ' is-expanded' : ''} ${className}`.trim()}
+        onClick={onToggleExpand}
+        style={{ cursor: onToggleExpand ? 'pointer' : 'default' }}
+      >
+        {children}
+      </tr>
+      {expanded && detail && (
+        <tr className="resource-row__detail-row">
+          <td colSpan={colSpan} className="resource-row__detail-cell">
+            {detail}
+          </td>
+        </tr>
+      )}
+    </Fragment>
+  )
+}
+
+// ChevronCell is the small rotating chevron used as the leftmost cell of an
+// expandable row. Mirrors the Nodes/Models/Backends gallery affordance so
+// users see the same "click to expand" cue everywhere.
+export function ChevronCell({ expanded }) {
+  return (
+    <td className="resource-row__chevron-cell">
+      <span className={`row-chevron${expanded ? ' is-expanded' : ''}`} aria-hidden="true">
+        <i className="fas fa-chevron-right" />
+      </span>
+    </td>
+  )
+}
+
+// IconCell renders the 48px brand icon shell — the same one the Install
+// gallery uses. `icon` is the image URL (from gallery metadata); when absent
+// or broken we fall back to a FontAwesome glyph so custom-imported items
+// still get a placeholder instead of an empty square.
+export function IconCell({ icon, fallback = 'fa-cube', alt = '' }) {
+  return (
+    <td className="resource-row__icon-cell">
+      <div className="resource-row__icon">
+        {icon ? (
+          <img src={icon} alt={alt} loading="lazy" />
+        ) : (
+          <i className={`fas ${fallback}`} />
+        )}
+      </div>
+    </td>
+  )
+}
+
+// StopPropagationCell wraps cell contents that contain interactive controls
+// (Toggle, action buttons) so a click on them doesn't also expand the row.
+export function StopPropagationCell({ children, ...props }) {
+  return (
+    <td {...props} onClick={e => e.stopPropagation()}>
+      {children}
+    </td>
+  )
+}
--- a/core/http/react-ui/src/components/SearchableSelect.jsx
+++ b/core/http/react-ui/src/components/SearchableSelect.jsx
@@ -116,7 +116,7 @@ export default function SearchableSelect({
        aria-expanded={open}
        onClick={() => { if (!disabled) { setOpen(!open); setQuery(''); setFocusIndex(-1) } }}
        style={{
-          width: '100%', padding: '4px 8px', fontSize: '0.8125rem',
+          width: '100%', padding: 'var(--spacing-xs) var(--spacing-sm)', fontSize: '0.8125rem',
          cursor: disabled ? 'not-allowed' : 'pointer',
          display: 'flex', alignItems: 'center', gap: '6px',
          background: 'var(--color-bg-primary)', border: '1px solid var(--color-border)',
@@ -145,7 +145,7 @@ export default function SearchableSelect({
              value={query}
              onChange={(e) => { setQuery(e.target.value); setFocusIndex(-1) }}
              onKeyDown={handleKeyDown}
-              style={{ width: '100%', padding: '4px 8px', fontSize: '0.8125rem' }}
+              style={{ width: '100%', padding: 'var(--spacing-xs) var(--spacing-sm)', fontSize: '0.8125rem' }}
            />
          </div>
          <div ref={listRef} role="listbox" style={{ overflowY: 'auto', maxHeight: 'min(200px, 50vh)' }}>
--- a/core/http/react-ui/src/components/Sidebar.jsx
+++ b/core/http/react-ui/src/components/Sidebar.jsx
@@ -1,4 +1,4 @@
-import { useState, useEffect } from 'react'
+import { useState, useEffect, useRef } from 'react'
 import { NavLink, useNavigate, useLocation } from 'react-router-dom'
 import ThemeToggle from './ThemeToggle'
 import { useAuth } from '../context/AuthContext'
@@ -24,6 +24,18 @@ const sections = [
      { path: '/app/quantize', icon: 'fas fa-compress', label: 'Quantize (Experimental)', feature: 'quantization' },
    ],
  },
+  {
+    id: 'biometrics',
+    title: 'Biometrics',
+    featureMap: {
+      '/app/face': 'face_recognition',
+      '/app/voice': 'voice_recognition',
+    },
+    items: [
+      { path: '/app/face', icon: 'fas fa-face-smile', label: 'Face Recognition', feature: 'face_recognition' },
+      { path: '/app/voice', icon: 'fas fa-microphone-lines', label: 'Voice Recognition', feature: 'voice_recognition' },
+    ],
+  },
  {
    id: 'agents',
    title: 'Agents',
@@ -95,11 +107,22 @@ export default function Sidebar({ isOpen, onClose }) {
  const { isAdmin, authEnabled, user, logout, hasFeature } = useAuth()
  const navigate = useNavigate()
  const location = useLocation()
+  const closeBtnRef = useRef(null)

  useEffect(() => {
    fetch(apiUrl('/api/features')).then(r => r.json()).then(setFeatures).catch(() => {})
  }, [])

+  // Move focus into the drawer when opened on mobile/tablet so keyboard
+  // and screen-reader users land inside the dialog. Targeting the close
+  // button avoids hijacking the visual focus to a nav item the user may
+  // not have meant to activate.
+  useEffect(() => {
+    if (!isOpen) return
+    const id = window.requestAnimationFrame(() => closeBtnRef.current?.focus())
+    return () => window.cancelAnimationFrame(id)
+  }, [isOpen])
+
  // Auto-expand section containing the active route
  useEffect(() => {
    for (const section of sections) {
@@ -156,7 +179,11 @@ export default function Sidebar({ isOpen, onClose }) {
    <>
      {isOpen && <div className="sidebar-overlay" onClick={onClose} />}

-      <aside className={`sidebar ${isOpen ? 'open' : ''} ${collapsed ? 'collapsed' : ''}`}>
+      <aside
+        id="app-sidebar"
+        className={`sidebar ${isOpen ? 'open' : ''} ${collapsed ? 'collapsed' : ''}`}
+        aria-label="Primary navigation"
+      >
        {/* Logo */}
        <div className="sidebar-header">
          <a href="./" className="sidebar-logo-link">
@@ -165,8 +192,13 @@ export default function Sidebar({ isOpen, onClose }) {
          <a href="./" className="sidebar-logo-icon" title="LocalAI">
            <img src={apiUrl('/static/logo.png')} alt="LocalAI" className="sidebar-logo-icon-img" />
          </a>
-          <button className="sidebar-close-btn" onClick={onClose} aria-label="Close menu">
-            <i className="fas fa-times" />
+          <button
+            ref={closeBtnRef}
+            className="sidebar-close-btn"
+            onClick={onClose}
+            aria-label="Close menu"
+          >
+            <i className="fas fa-times" aria-hidden="true" />
          </button>
        </div>

--- a/core/http/react-ui/src/components/StatCard.jsx
+++ b/core/http/react-ui/src/components/StatCard.jsx
@@ -0,0 +1,39 @@
+// StatCard renders a single cluster/dashboard metric card. The left accent
+// bar + icon chip color is driven by `accentVar` (a CSS custom property name,
+// e.g. "--color-success") so the card reads as semantic without the caller
+// having to reach into colors directly. `onClick` upgrades the card to a
+// keyboard-focusable button — used by the Manage page so cards double as
+// shortcuts to the relevant tab + filter.
+export default function StatCard({ icon, label, value, color, accentVar, onClick }) {
+  const accent = color || (accentVar ? `var(${accentVar})` : 'var(--color-text-primary)')
+  const interactive = typeof onClick === 'function'
+
+  const handleKeyDown = interactive
+    ? (e) => {
+        if (e.key === 'Enter' || e.key === ' ') {
+          e.preventDefault()
+          onClick(e)
+        }
+      }
+    : undefined
+
+  return (
+    <div
+      className="stat-card"
+      data-clickable={interactive ? 'true' : undefined}
+      role={interactive ? 'button' : undefined}
+      tabIndex={interactive ? 0 : undefined}
+      onClick={interactive ? onClick : undefined}
+      onKeyDown={handleKeyDown}
+      style={accentVar ? { ['--stat-accent']: `var(${accentVar})` } : undefined}
+    >
+      <div className="stat-card__body">
+        <div className="stat-card__label">{label}</div>
+        <div className="stat-card__value" style={{ color: accent }}>{value}</div>
+      </div>
+      <div className="stat-card__icon" style={accentVar ? { color: accent } : undefined}>
+        <i className={icon} />
+      </div>
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/TemplateSelector.jsx
+++ b/core/http/react-ui/src/components/TemplateSelector.jsx
@@ -24,7 +24,7 @@ export default function TemplateSelector({ onSelect }) {
            <p style={{ fontSize: '0.8125rem', color: 'var(--color-text-secondary)', lineHeight: 1.5, margin: 0 }}>
              {t.description}
            </p>
-            <div style={{ display: 'flex', flexWrap: 'wrap', gap: '4px', marginTop: 'var(--spacing-xs)' }}>
+            <div style={{ display: 'flex', flexWrap: 'wrap', gap: 'var(--spacing-xs)', marginTop: 'var(--spacing-xs)' }}>
              {Object.keys(t.fields).filter(k => k !== 'name').map(k => (
                <span key={k} className="badge" style={{
                  fontSize: '0.6875rem', background: 'var(--color-bg-tertiary)',
--- a/core/http/react-ui/src/components/UnifiedMCPDropdown.jsx
+++ b/core/http/react-ui/src/components/UnifiedMCPDropdown.jsx
@@ -187,7 +187,7 @@ export default function UnifiedMCPDropdown({
                    placeholder="Server URL (e.g. https://mcp.example.com/sse)"
                    value={url}
                    onChange={e => setUrl(e.target.value)}
-                    style={{ width: '100%', marginBottom: '4px' }}
+                    style={{ width: '100%', marginBottom: 'var(--spacing-xs)' }}
                  />
                  <input
                    type="text"
@@ -195,7 +195,7 @@ export default function UnifiedMCPDropdown({
                    placeholder="Name (optional)"
                    value={name}
                    onChange={e => setName(e.target.value)}
-                    style={{ width: '100%', marginBottom: '4px' }}
+                    style={{ width: '100%', marginBottom: 'var(--spacing-xs)' }}
                  />
                  <input
                    type="password"
@@ -203,13 +203,13 @@ export default function UnifiedMCPDropdown({
                    placeholder="Auth token (optional)"
                    value={authToken}
                    onChange={e => setAuthToken(e.target.value)}
-                    style={{ width: '100%', marginBottom: '4px' }}
+                    style={{ width: '100%', marginBottom: 'var(--spacing-xs)' }}
                  />
                  <label style={{ display: 'flex', alignItems: 'center', gap: '6px', fontSize: '0.8rem', marginBottom: '6px' }}>
                    <input type="checkbox" checked={useProxy} onChange={e => setUseProxy(e.target.checked)} />
                    Use CORS proxy
                  </label>
-                  <div style={{ display: 'flex', gap: '4px', justifyContent: 'flex-end' }}>
+                  <div style={{ display: 'flex', gap: 'var(--spacing-xs)', justifyContent: 'flex-end' }}>
                    <button type="button" className="btn btn-sm btn-secondary" onClick={() => setAddDialog(false)}>Cancel</button>
                    <button type="button" className="btn btn-sm btn-primary" onClick={handleAddClient} disabled={!url.trim()}>Add</button>
                  </div>
--- a/core/http/react-ui/src/components/biometrics/BoundingBoxCanvas.jsx
+++ b/core/http/react-ui/src/components/biometrics/BoundingBoxCanvas.jsx
@@ -0,0 +1,63 @@
+import { useEffect, useRef, useState } from 'react'
+
+// BoundingBoxCanvas — overlay face-detection rectangles on the user-supplied image.
+// boxes: [{ x, y, w, h, label?, sublabel?, tone? }]
+// tone: 'default' | 'success' | 'warning' | 'error' | 'accent'
+export default function BoundingBoxCanvas({ src, boxes = [], alt = '' }) {
+  const wrapRef = useRef(null)
+  const imgRef = useRef(null)
+  const [dims, setDims] = useState({ w: 0, h: 0, natW: 0, natH: 0 })
+
+  useEffect(() => {
+    const update = () => {
+      if (!wrapRef.current || !imgRef.current) return
+      const rect = imgRef.current.getBoundingClientRect()
+      setDims({
+        w: rect.width,
+        h: rect.height,
+        natW: imgRef.current.naturalWidth || 1,
+        natH: imgRef.current.naturalHeight || 1,
+      })
+    }
+    update()
+    const ro = new ResizeObserver(update)
+    if (imgRef.current) ro.observe(imgRef.current)
+    window.addEventListener('resize', update)
+    return () => {
+      ro.disconnect()
+      window.removeEventListener('resize', update)
+    }
+  }, [src])
+
+  const sx = dims.natW ? dims.w / dims.natW : 1
+  const sy = dims.natH ? dims.h / dims.natH : 1
+
+  return (
+    <div ref={wrapRef} className="biometrics-bbox">
+      {src && <img ref={imgRef} src={src} alt={alt} onLoad={(e) => {
+        setDims({
+          w: e.target.getBoundingClientRect().width,
+          h: e.target.getBoundingClientRect().height,
+          natW: e.target.naturalWidth,
+          natH: e.target.naturalHeight,
+        })
+      }} />}
+      {boxes.map((b, i) => (
+        <div key={i} className={`biometrics-bbox__box tone-${b.tone || 'accent'}`}
+          style={{
+            left: `${b.x * sx}px`,
+            top: `${b.y * sy}px`,
+            width: `${b.w * sx}px`,
+            height: `${b.h * sy}px`,
+          }}>
+          {(b.label || b.sublabel) && (
+            <div className="biometrics-bbox__tag">
+              {b.label && <strong>{b.label}</strong>}
+              {b.sublabel && <span>{b.sublabel}</span>}
+            </div>
+          )}
+        </div>
+      ))}
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/biometrics/DistributionBars.jsx
+++ b/core/http/react-ui/src/components/biometrics/DistributionBars.jsx
@@ -0,0 +1,33 @@
+// DistributionBars — one horizontal bar per label, width proportional to value.
+// distribution: Record<string, number> (values are probabilities 0..1 or any positive scale).
+// dominant: string — highlighted row.
+export default function DistributionBars({ title, distribution, dominant, icon }) {
+  if (!distribution || Object.keys(distribution).length === 0) return null
+  const entries = Object.entries(distribution).sort((a, b) => b[1] - a[1])
+  const max = entries.reduce((m, [, v]) => Math.max(m, v), 0) || 1
+
+  return (
+    <div className="biometrics-dist card">
+      <div className="biometrics-dist__head">
+        {icon && <i className={icon} aria-hidden="true" />}
+        <h3>{title}</h3>
+        {dominant && <span className="biometrics-dist__dominant">{dominant}</span>}
+      </div>
+      <ul className="biometrics-dist__rows">
+        {entries.map(([label, value]) => {
+          const pct = (value / max) * 100
+          const isDominant = label === dominant
+          return (
+            <li key={label} className={`biometrics-dist__row ${isDominant ? 'dominant' : ''}`}>
+              <span className="biometrics-dist__label">{label}</span>
+              <div className="biometrics-dist__bar-wrap" aria-hidden="true">
+                <div className="biometrics-dist__bar" style={{ width: `${pct}%` }} />
+              </div>
+              <span className="biometrics-dist__value">{(value * 100).toFixed(1)}%</span>
+            </li>
+          )
+        })}
+      </ul>
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/biometrics/EmbeddingInspector.jsx
+++ b/core/http/react-ui/src/components/biometrics/EmbeddingInspector.jsx
@@ -0,0 +1,89 @@
+import { useMemo, useRef, useEffect, useState } from 'react'
+
+// EmbeddingInspector — compact visualization of a raw vector returned by /v1/face|voice/embed.
+// embedding: number[] (can be large). dim: int. model: string.
+export default function EmbeddingInspector({ embedding, dim, model, elapsedMs }) {
+  const canvasRef = useRef(null)
+  const [copied, setCopied] = useState(false)
+
+  const summary = useMemo(() => {
+    if (!embedding || !embedding.length) return null
+    let sum = 0, sumSq = 0, min = Infinity, max = -Infinity
+    for (const v of embedding) {
+      sum += v
+      sumSq += v * v
+      if (v < min) min = v
+      if (v > max) max = v
+    }
+    const mean = sum / embedding.length
+    const norm = Math.sqrt(sumSq)
+    return { mean, norm, min, max }
+  }, [embedding])
+
+  useEffect(() => {
+    if (!canvasRef.current || !embedding?.length) return
+    const canvas = canvasRef.current
+    const dpr = window.devicePixelRatio || 1
+    const cssW = canvas.clientWidth
+    const cssH = 60
+    canvas.width = Math.floor(cssW * dpr)
+    canvas.height = Math.floor(cssH * dpr)
+    const ctx = canvas.getContext('2d')
+    ctx.scale(dpr, dpr)
+    ctx.clearRect(0, 0, cssW, cssH)
+
+    const COUNT = Math.min(embedding.length, 128)
+    const values = embedding.slice(0, COUNT)
+    const max = Math.max(...values.map(Math.abs)) || 1
+    const mid = cssH / 2
+    const barW = cssW / COUNT
+    const accent = getComputedStyle(canvas).getPropertyValue('--color-accent').trim() || '#e8a87c'
+    const accentMuted = getComputedStyle(canvas).getPropertyValue('--color-text-muted').trim() || '#6c7084'
+    ctx.strokeStyle = accentMuted
+    ctx.beginPath()
+    ctx.moveTo(0, mid + 0.5)
+    ctx.lineTo(cssW, mid + 0.5)
+    ctx.stroke()
+    ctx.fillStyle = accent
+    for (let i = 0; i < COUNT; i++) {
+      const v = values[i]
+      const h = (Math.abs(v) / max) * (cssH * 0.45)
+      if (v >= 0) ctx.fillRect(i * barW, mid - h, Math.max(0.5, barW - 0.5), h)
+      else ctx.fillRect(i * barW, mid, Math.max(0.5, barW - 0.5), h)
+    }
+  }, [embedding])
+
+  if (!embedding) return null
+
+  const copy = async () => {
+    try {
+      await navigator.clipboard.writeText(JSON.stringify(embedding))
+      setCopied(true)
+      setTimeout(() => setCopied(false), 1500)
+    } catch (_) {
+      /* clipboard gated */
+    }
+  }
+
+  return (
+    <div className="biometrics-embed card">
+      <div className="biometrics-embed__head">
+        <div>
+          <div className="biometrics-embed__title">Embedding vector</div>
+          <div className="biometrics-embed__meta">
+            {dim != null && <span><strong>{dim}</strong> dims</span>}
+            {summary && <span>L2 <strong>{summary.norm.toFixed(3)}</strong></span>}
+            {summary && <span>range <strong>[{summary.min.toFixed(3)}, {summary.max.toFixed(3)}]</strong></span>}
+            {model && <span>model <code>{model}</code></span>}
+            {elapsedMs != null && <span>{elapsedMs.toFixed(0)} ms</span>}
+          </div>
+        </div>
+        <button type="button" className="btn btn-secondary btn-sm" onClick={copy}>
+          <i className={`fas ${copied ? 'fa-check' : 'fa-copy'}`} aria-hidden="true" />
+          {copied ? ' Copied' : ' Copy JSON'}
+        </button>
+      </div>
+      <canvas ref={canvasRef} style={{ width: '100%', height: 60 }} aria-label="Embedding sparkline (first 128 dimensions)" />
+    </div>
+  )
+}
--- a/Show More
+++ b/Show More