fix(distributed): per-replica backend logs (store aggregation + UI)

The multi-replica refactor (PR #9583) changed the worker's process key from `modelID` to `modelID#replicaIndex`, but the BackendLogStore kept the bare-modelID lookup. Result: every distributed deployment lost backend logs in the Nodes UI — single-replica too, since even the default capacity of 1 produces a `#0` suffix. Two changes wired together: * pkg/model: BackendLogStore.GetLines/Subscribe now treat a modelID without `#` as a model prefix and merge across all `modelID#N` replica buffers (timestamp-sorted for GetLines; fan-in for Subscribe). Calls with a full `modelID#N` key resolve exactly. ListModels strips replica suffixes and deduplicates so the listing surfaces one entry per loaded model. * react-ui: per-replica log streams as the default. Loaded Models table disambiguates each row with a `rep N` pill (only when the node hosts >1 replica of a model). Each row's "View logs" link routes to the per-replica process key so operators see only that replica's output. The logs page renders the replica context as a chip in the title and surfaces a segmented control — `Replica 0 / 1 / … / All merged` — when the model has multiple replicas; the merged segment uses the bare-modelID URL (delegating to the store's prefix aggregation) for the side-by-side comparison case. Single-replica deployments see no extra UI. Tests added first (TDD): the regression set in backend_log_store_test.go reproduces the bug at the exact failure point — GetLines/ListModels/Subscribe assertions all fail against the broken code, all pass against the fix. TestSubscribe_PerReplicaFilter pins the exact-key path so a future change can't silently break it. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [Skill:critique] [Skill:audit] [Skill:polish] [Skill:distill]
fix(ui): hide meta-dev backends in System → Backends Development toggle
2026-05-20 06:35:41 -04:00 · 2026-04-27 20:55:24 +00:00 · 2026-04-27 20:38:20 +00:00 · 2026-04-27 20:17:36 +00:00 · 2026-04-27 21:20:05 +02:00 · 2026-04-27 14:21:11 +00:00
177 changed files with 8260 additions and 3046 deletions
--- a/.agents/adding-backends.md
+++ b/.agents/adding-backends.md
@@ -43,7 +43,7 @@ If you add a new language bucket, `scripts/changed-backends.js` also needs a bra

 **Additional build types you may need:**
 - ROCm/HIP: Use `build-type: 'hipblas'` with `base-image: "rocm/dev-ubuntu-24.04:7.2.1"`
- Intel/SYCL: Use `build-type: 'intel'` or `build-type: 'sycl_f16'`/`sycl_f32` with `base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"`
+- Intel/SYCL: Use `build-type: 'intel'` or `build-type: 'sycl_f16'`/`sycl_f32` with `base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"`
 - L4T (ARM): Use `build-type: 'l4t'` with `platforms: 'linux/arm64'` and `runs-on: 'ubuntu-24.04-arm'`

 ## 3. Add Backend Metadata to `backend/index.yaml`
--- a/.agents/ai-coding-assistants.md
+++ b/.agents/ai-coding-assistants.md
@@ -35,33 +35,19 @@ All contributions must comply with LocalAI's licensing requirements:

 ## Signed-off-by and Developer Certificate of Origin

-Only humans can certify the Developer Certificate of Origin (DCO). AI
-agents MUST NOT invent or guess a human identity for `Signed-off-by` —
-doing so forges the DCO certification.
+**AI agents MUST NOT add `Signed-off-by` tags.** Only humans can legally
+certify the Developer Certificate of Origin (DCO). The human submitter
+is responsible for:

-However, when a human operator explicitly directs the AI to commit on
-their behalf, the AI is acting as a typing tool — no different from an
-editor macro or `git commit -s`. In that case the AI SHOULD add
-`Signed-off-by:` using the **configured `user.name` / `user.email`** of
-the current git repository (i.e. the operator's own identity). The
-resulting trailer is the operator's signature; they take responsibility
-for it by reviewing and pushing the commit. The AI MUST NOT use any
-other identity and MUST NOT add its own name to the sign-off.
-
-When running `git commit`, prefer `git commit --signoff` (or `-s`) so
-the trailer is emitted by git itself from the configured identity,
-rather than hand-writing it in a heredoc — this guarantees the sign-off
-matches whatever identity the operator is currently using.
-
-The human submitter remains responsible for:
-
- Reviewing all AI-generated code before it's pushed or merged
+- Reviewing all AI-generated code
 - Ensuring compliance with licensing requirements
+- Adding their own `Signed-off-by` tag (when the project requires DCO)
+  to certify the contribution
 - Taking full responsibility for the contribution

-AI agents MUST NOT add `Co-Authored-By` trailers for themselves. A human
-reviewer owns the contribution; the AI's involvement is recorded via
-`Assisted-by` (see below).
+AI agents MUST NOT add `Co-Authored-By` trailers for themselves either.
+A human reviewer owns the contribution; the AI's involvement is recorded
+via `Assisted-by` (see below).

 ## Attribution

@@ -98,12 +84,6 @@ Assisted-by: Claude:claude-opus-4-7 golangci-lint
 Signed-off-by: Jane Developer <jane@example.com>
 ```

-The `Signed-off-by` line uses Jane's own identity because Jane is the
-submitter operating the AI. If Jane asks Claude to create the commit via
-`git commit -s`, git emits that exact trailer from Jane's configured
-identity — no separate human step is needed beyond Jane reviewing the
-diff before pushing.
-
 ## Scope and Responsibility

 Using an AI assistant does not reduce the contributor's responsibility.
--- a/.agents/ci-caching.md
+++ b/.agents/ci-caching.md
@@ -0,0 +1,111 @@
+# CI Build Caching
+
+Container builds — both the root LocalAI image (`Dockerfile`) and the per-backend images (`backend/Dockerfile.*`) — share a registry-backed BuildKit cache. This file explains how that cache is laid out, what invalidates it, and how to bypass it.
+
+## Cache layout
+
+- **Cache registry**: `quay.io/go-skynet/ci-cache`
+- **One tag per matrix entry**, derived from the existing `tag-suffix`:
+  - Backend builds (`backend_build.yml`): `cache<tag-suffix>`
+    - e.g. `cache-gpu-nvidia-cuda-12-llama-cpp`, `cache-cpu-vllm`, `cache-nvidia-l4t-cuda-13-arm64-vllm`
+  - Root image builds (`image_build.yml`): `cache-localai<tag-suffix>`
+    - e.g. `cache-localai-gpu-nvidia-cuda-12`, `cache-localai-gpu-vulkan`
+- Each tag stores a multi-arch BuildKit cache manifest (`mode=max`), so every intermediate stage is re-usable, not just the final image.
+
+## Read/write semantics
+
+| Trigger | `cache-from` | `cache-to` |
+|---|---|---|
+| `push` to `master` / tag | yes | yes (`mode=max,ignore-error=true`) |
+| `pull_request` | yes | **no** |
+
+PR builds read master's warm cache but never write — this prevents PRs from polluting the shared cache with their experimental state. After merge, the master build for that matrix entry refreshes the cache.
+
+`ignore-error=true` on the write side means a transient quay push failure does not fail the build; the next master push retries.
+
+## Self-warming, no separate populator
+
+There is no cron job that pre-warms the cache. The production builds *are* the populator. The first master build of a given matrix entry pays the cold cost; subsequent same-entry master builds reuse everything that hasn't changed (apt installs, gRPC compile in `Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}`, Python wheel installs, etc.).
+
+Historically there was a `generate_grpc_cache.yaml` cron that targeted a `grpc` stage in the root Dockerfile. That stage was removed in July 2025 and the cron silently failed every night for 9 months without writing anything. It was deleted along with the registry-cache rollout.
+
+## The `DEPS_REFRESH` cache-buster (Python backends)
+
+Every Python backend goes through the shared `backend/Dockerfile.python`, which ends with:
+
+```dockerfile
+ARG DEPS_REFRESH=initial
+RUN cd /${BACKEND} && PORTABLE_PYTHON=true make
+```
+
+Most Python backends ship `requirements*.txt` files that **do not pin every transitive dep** (`torch`, `transformers`, `vllm`, `diffusers`, etc. are listed without a `==` pin, or with `>=` lower bounds only). With a warm BuildKit cache, the `make` layer hashes only on Dockerfile instructions + COPYed source — not on what `pip install` resolves at runtime. So a warm cache would ship the *first* version of `vllm` ever cached and never pick up upstream releases.
+
+`DEPS_REFRESH` defends against that:
+
+- `backend_build.yml` computes `date -u +%Y-W%V` (ISO week, e.g. `2026-W17`) before each build and passes it as a build-arg.
+- The `RUN ... make` layer's BuildKit hash now includes that string, so the layer invalidates **at most once per week**, automatically picking up newer wheels.
+- Within a week, builds stay warm.
+
+This applies only to `Dockerfile.python` because:
+- Go (`Dockerfile.golang`) pins versions in `go.mod` / `go.sum`.
+- Rust (`Dockerfile.rust`) pins via `Cargo.lock`.
+- C++ backends (`Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}`) clone gRPC at a pinned tag (`v1.65.0`) and llama.cpp at a pinned commit; their inputs don't drift between rebuilds.
+
+### Adjusting the cadence
+
+If you need a faster refresh (e.g. while debugging an upstream flake), bump the format to daily (`+%Y-%m-%d`) or hourly (`+%Y-%m-%d-%H`). If you need a one-shot rebuild for a specific backend without changing the schedule, append a marker to the tag-suffix in the matrix or temporarily delete that backend's cache tag in quay.
+
+## Manually evicting cache
+
+To force a fully cold build for one backend or the whole image:
+
+```bash
+# Delete a single tag (requires quay credentials with admin on the repo)
+curl -X DELETE \
+  -H "Authorization: Bearer ${QUAY_TOKEN}" \
+  https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/cache-gpu-nvidia-cuda-12-vllm
+
+# List all tags
+curl -s -H "Authorization: Bearer ${QUAY_TOKEN}" \
+  "https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/?limit=100" | jq '.tags[].name'
+```
+
+Eviction is rarely needed in normal operation — `DEPS_REFRESH` handles weekly drift, source changes invalidate naturally, and `mode=max` keeps the cache scoped per matrix entry so a stale tag never bleeds into a different build.
+
+## What the cache **does not** cover
+
+- The "Free Disk Space" / "Release space from worker" steps run on every job — these reclaim ~6 GB on `ubuntu-latest` runners. They are runner-state cleanup, not Docker, and BuildKit caches don't apply.
+- Intermediate artifacts of `Build and push (PR)` are not pushed anywhere — PRs only build for verification.
+- Darwin builds (see below) — macOS runners have no Docker daemon, so the registry-backed BuildKit cache cannot apply.
+
+## Darwin native caches
+
+`backend_build_darwin.yml` runs natively on `macOS-14` GitHub-hosted runners — there is no Docker, no BuildKit, no cross-job registry cache. Instead, the reusable workflow uses `actions/cache@v4` for four native caches that mirror the spirit of the Linux cache (warm by default, weekly refresh for unpinned Python deps, PRs read-only).
+
+| Cache | Path(s) | Key | Scope |
+|---|---|---|---|
+| Go modules + build | `~/go/pkg/mod`, `~/Library/Caches/go-build` | `go.sum` (managed by `actions/setup-go@v5` `cache: true`) | All darwin jobs |
+| Homebrew | `~/Library/Caches/Homebrew/downloads`, selected `/opt/homebrew/Cellar/*` | hash of `backend_build_darwin.yml` | All darwin jobs |
+| ccache (llama.cpp CMake) | `~/Library/Caches/ccache` | pinned `LLAMA_VERSION` from `backend/cpp/llama-cpp/Makefile` | `inputs.backend == 'llama-cpp'` only |
+| Python wheels (uv + pip) | `~/Library/Caches/pip`, `~/Library/Caches/uv` | `inputs.backend` + ISO week (`+%Y-W%V`) + hash of that backend's `requirements*.txt` | `inputs.lang == 'python'` only |
+
+Read/write semantics match the BuildKit cache: `actions/cache/restore` runs every time, `actions/cache/save` is gated on `github.event_name != 'pull_request'`. PRs read master's warm cache but never write back.
+
+The Python wheel cache uses the same ISO-week cache-buster as the Linux `DEPS_REFRESH` build-arg — same problem (unpinned `torch`/`mlx`/`diffusers`/`transformers` resolve to fresh wheels weekly), same ~one-cold-rebuild-per-week solution.
+
+The brew Cellar cache requires `HOMEBREW_NO_AUTO_UPDATE=1` and `HOMEBREW_NO_INSTALL_CLEANUP=1` (set as job-level env). Without those, `brew install` would mutate the very directories that were just restored, defeating the cache.
+
+For ccache, the workflow exports `CMAKE_ARGS=… -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache` via `$GITHUB_ENV` before running `make build-darwin-go-backend`. The Makefile in `backend/cpp/llama-cpp/` already forwards `CMAKE_ARGS` through to each variant build (`fallback`, `grpc`, `rpc-server`), so no script changes are needed. The three variants share most TUs, so ccache dedupes object files across them.
+
+### Cache budget on Darwin
+
+GitHub Actions caches are limited to 10 GB per repo. Steady-state worst case: ~800 MB Go cache + ~2 GB brew Cellar + up to 2 GB ccache + ~1.5 GB × 5 python backends. If the cap is hit, prefer collapsing the per-backend Python keys into a shared `pyenv-darwin-shared-<week>` key (accepts more cross-backend churn for a smaller footprint) before reducing other caches.
+
+## Touching the cache pipeline
+
+When changing `image_build.yml`, `backend_build.yml`, or any of the `backend/Dockerfile.*` files:
+
+1. **Don't drop `DEPS_REFRESH=...` from the build-args** without a replacement strategy (lockfiles, pinned requirements). Otherwise master will silently freeze on whichever versions were cached at the time.
+2. **Keep `tag-suffix` unique per matrix entry** — it's the cache namespace. Two matrix entries sharing a tag-suffix would clobber each other's cache.
+3. **Keep `cache-to` gated on `github.event_name != 'pull_request'`** — PRs must not write.
+4. **Keep `ignore-error=true` on `cache-to`** — quay registry hiccups must not fail builds.
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -141,7 +141,7 @@ jobs:
          - build-type: ''
            cuda-major-version: ""
            cuda-minor-version: ""
-            platforms: 'linux/amd64'
+            platforms: 'linux/amd64,linux/arm64'
            tag-latest: 'auto'
            tag-suffix: '-cpu-whisperx'
            runs-on: 'ubuntu-latest'
@@ -154,7 +154,7 @@ jobs:
          - build-type: ''
            cuda-major-version: ""
            cuda-minor-version: ""
-            platforms: 'linux/amd64'
+            platforms: 'linux/amd64,linux/arm64'
            tag-latest: 'auto'
            tag-suffix: '-cpu-faster-whisper'
            runs-on: 'ubuntu-latest'
@@ -399,19 +399,6 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
-          - build-type: 'cublas'
-            cuda-major-version: "12"
-            cuda-minor-version: "8"
-            platforms: 'linux/amd64'
-            tag-latest: 'auto'
-            tag-suffix: '-gpu-nvidia-cuda-12-buun-llama-cpp'
-            runs-on: 'bigger-runner'
-            base-image: "ubuntu:24.04"
-            skip-drivers: 'false'
-            backend: "buun-llama-cpp"
-            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
-            context: "./"
-            ubuntu-version: '2404'
          - build-type: 'cublas'
            cuda-major-version: "12"
            cuda-minor-version: "8"
@@ -907,19 +894,6 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
-          - build-type: 'cublas'
-            cuda-major-version: "13"
-            cuda-minor-version: "0"
-            platforms: 'linux/amd64'
-            tag-latest: 'auto'
-            tag-suffix: '-gpu-nvidia-cuda-13-buun-llama-cpp'
-            runs-on: 'ubuntu-latest'
-            base-image: "ubuntu:24.04"
-            skip-drivers: 'false'
-            backend: "buun-llama-cpp"
-            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
-            context: "./"
-            ubuntu-version: '2404'
          - build-type: 'cublas'
            cuda-major-version: "13"
            cuda-minor-version: "0"
@@ -949,16 +923,29 @@ jobs:
          - build-type: 'cublas'
            cuda-major-version: "13"
            cuda-minor-version: "0"
-            platforms: 'linux/arm64'
-            skip-drivers: 'false'
+            platforms: 'linux/amd64'
            tag-latest: 'auto'
-            tag-suffix: '-nvidia-l4t-cuda-13-arm64-buun-llama-cpp'
+            tag-suffix: '-gpu-nvidia-cuda-13-vllm'
+            runs-on: 'arc-runner-set'
            base-image: "ubuntu:24.04"
-            runs-on: 'ubuntu-24.04-arm'
-            ubuntu-version: '2404'
-            backend: "buun-llama-cpp"
-            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
+            skip-drivers: 'false'
+            backend: "vllm"
+            dockerfile: "./backend/Dockerfile.python"
            context: "./"
+            ubuntu-version: '2404'
+          - build-type: 'cublas'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-nvidia-cuda-13-vllm-omni'
+            runs-on: 'arc-runner-set'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "vllm-omni"
+            dockerfile: "./backend/Dockerfile.python"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'cublas'
            cuda-major-version: "13"
            cuda-minor-version: "0"
@@ -1115,6 +1102,45 @@ jobs:
            backend: "diffusers"
            dockerfile: "./backend/Dockerfile.python"
            context: "./"
+          - build-type: 'l4t'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/arm64'
+            tag-latest: 'auto'
+            tag-suffix: '-nvidia-l4t-cuda-13-arm64-vllm'
+            runs-on: 'ubuntu-24.04-arm'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            ubuntu-version: '2404'
+            backend: "vllm"
+            dockerfile: "./backend/Dockerfile.python"
+            context: "./"
+          - build-type: 'l4t'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/arm64'
+            tag-latest: 'auto'
+            tag-suffix: '-nvidia-l4t-cuda-13-arm64-vllm-omni'
+            runs-on: 'ubuntu-24.04-arm'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            ubuntu-version: '2404'
+            backend: "vllm-omni"
+            dockerfile: "./backend/Dockerfile.python"
+            context: "./"
+          - build-type: 'l4t'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/arm64'
+            tag-latest: 'auto'
+            tag-suffix: '-nvidia-l4t-cuda-13-arm64-sglang'
+            runs-on: 'ubuntu-24.04-arm'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            ubuntu-version: '2404'
+            backend: "sglang"
+            dockerfile: "./backend/Dockerfile.python"
+            context: "./"
          - build-type: 'l4t'
            cuda-major-version: "13"
            cuda-minor-version: "0"
@@ -1493,19 +1519,6 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
-          - build-type: 'hipblas'
-            cuda-major-version: ""
-            cuda-minor-version: ""
-            platforms: 'linux/amd64'
-            tag-latest: 'auto'
-            tag-suffix: '-gpu-rocm-hipblas-buun-llama-cpp'
-            runs-on: 'ubuntu-latest'
-            base-image: "rocm/dev-ubuntu-24.04:7.2.1"
-            skip-drivers: 'false'
-            backend: "buun-llama-cpp"
-            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
-            context: "./"
-            ubuntu-version: '2404'
          - build-type: 'hipblas'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -1723,7 +1736,7 @@ jobs:
            tag-latest: 'auto'
            tag-suffix: '-gpu-intel-rerankers'
            runs-on: 'ubuntu-latest'
-            base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+            base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
            skip-drivers: 'false'
            backend: "rerankers"
            dockerfile: "./backend/Dockerfile.python"
@@ -1736,7 +1749,7 @@ jobs:
            tag-latest: 'auto'
            tag-suffix: '-gpu-intel-sycl-f32-llama-cpp'
            runs-on: 'ubuntu-latest'
-            base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+            base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
            skip-drivers: 'false'
            backend: "llama-cpp"
            dockerfile: "./backend/Dockerfile.llama-cpp"
@@ -1755,19 +1768,6 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
-          - build-type: 'sycl_f32'
-            cuda-major-version: ""
-            cuda-minor-version: ""
-            platforms: 'linux/amd64'
-            tag-latest: 'auto'
-            tag-suffix: '-gpu-intel-sycl-f32-buun-llama-cpp'
-            runs-on: 'ubuntu-latest'
-            base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
-            skip-drivers: 'false'
-            backend: "buun-llama-cpp"
-            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
-            context: "./"
-            ubuntu-version: '2404'
          - build-type: 'sycl_f16'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -1794,19 +1794,6 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
-          - build-type: 'sycl_f16'
-            cuda-major-version: ""
-            cuda-minor-version: ""
-            platforms: 'linux/amd64'
-            tag-latest: 'auto'
-            tag-suffix: '-gpu-intel-sycl-f16-buun-llama-cpp'
-            runs-on: 'ubuntu-latest'
-            base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
-            skip-drivers: 'false'
-            backend: "buun-llama-cpp"
-            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
-            context: "./"
-            ubuntu-version: '2404'
          - build-type: 'intel'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -2212,19 +2199,6 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
-          - build-type: ''
-            cuda-major-version: ""
-            cuda-minor-version: ""
-            platforms: 'linux/amd64,linux/arm64'
-            tag-latest: 'auto'
-            tag-suffix: '-cpu-buun-llama-cpp'
-            runs-on: 'bigger-runner'
-            base-image: "ubuntu:24.04"
-            skip-drivers: 'false'
-            backend: "buun-llama-cpp"
-            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
-            context: "./"
-            ubuntu-version: '2404'
          - build-type: ''
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -2264,19 +2238,6 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2204'
-          - build-type: 'cublas'
-            cuda-major-version: "12"
-            cuda-minor-version: "0"
-            platforms: 'linux/arm64'
-            skip-drivers: 'false'
-            tag-latest: 'auto'
-            tag-suffix: '-nvidia-l4t-arm64-buun-llama-cpp'
-            base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
-            runs-on: 'ubuntu-24.04-arm'
-            backend: "buun-llama-cpp"
-            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
-            context: "./"
-            ubuntu-version: '2204'
          - build-type: 'vulkan'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -2303,19 +2264,6 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
-          - build-type: 'vulkan'
-            cuda-major-version: ""
-            cuda-minor-version: ""
-            platforms: 'linux/amd64,linux/arm64'
-            tag-latest: 'auto'
-            tag-suffix: '-gpu-vulkan-buun-llama-cpp'
-            runs-on: 'bigger-runner'
-            base-image: "ubuntu:24.04"
-            skip-drivers: 'false'
-            backend: "buun-llama-cpp"
-            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
-            context: "./"
-            ubuntu-version: '2404'
          # Stablediffusion-ggml
          - build-type: ''
            cuda-major-version: ""
--- a/.github/workflows/backend_build.yml
+++ b/.github/workflows/backend_build.yml
@@ -208,6 +208,15 @@ jobs:
          username: ${{ secrets.quayUsername }}
          password: ${{ secrets.quayPassword }}

+      # Weekly cache-buster for the per-backend `make` step. Most Python
+      # backends list unpinned deps (torch, transformers, vllm, ...), so a
+      # warm cache freezes upstream versions indefinitely. Rolling this
+      # weekly forces a re-resolve of the install layer at most once per
+      # week, picking up newer wheels without a full cold rebuild.
+      - name: Compute deps refresh key
+        id: deps_refresh
+        run: echo "key=$(date -u +%Y-W%V)" >> "$GITHUB_OUTPUT"
+
      - name: Build and push
        uses: docker/build-push-action@v7
        if: github.event_name != 'pull_request'
@@ -222,9 +231,11 @@ jobs:
            BACKEND=${{ inputs.backend }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
            AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
+            DEPS_REFRESH=${{ steps.deps_refresh.outputs.key }}
          context: ${{ inputs.context }}
          file: ${{ inputs.dockerfile }}
-          cache-from: type=gha
+          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}
+          cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }},mode=max,ignore-error=true
          platforms: ${{ inputs.platforms }}
          push: ${{ github.event_name != 'pull_request' }}
          tags: ${{ steps.meta.outputs.tags }}
@@ -244,9 +255,10 @@ jobs:
            BACKEND=${{ inputs.backend }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
            AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
+            DEPS_REFRESH=${{ steps.deps_refresh.outputs.key }}
          context: ${{ inputs.context }}
          file: ${{ inputs.dockerfile }}
-          cache-from: type=gha
+          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}
          platforms: ${{ inputs.platforms }}
          push: ${{ env.quay_username != '' }}
          tags: ${{ steps.meta_pull_request.outputs.tags }}
--- a/.github/workflows/backend_build_darwin.yml
+++ b/.github/workflows/backend_build_darwin.yml
@@ -48,6 +48,13 @@ jobs:
    strategy:
      matrix:
        go-version: ['${{ inputs.go-version }}']
+    env:
+      # Keep the brew Cellar stable across cache restores. Without these,
+      # `brew install` would auto-update brew itself and re-link formulas,
+      # mutating the very paths the cache just restored.
+      HOMEBREW_NO_AUTO_UPDATE: '1'
+      HOMEBREW_NO_INSTALL_CLEANUP: '1'
+      HOMEBREW_NO_ANALYTICS: '1'
    steps:
      - name: Clone
        uses: actions/checkout@v6
@@ -58,21 +65,141 @@ jobs:
        uses: actions/setup-go@v5
        with:
          go-version: ${{ matrix.go-version }}
-          cache: false
+          # Caches ~/go/pkg/mod and ~/Library/Caches/go-build keyed on go.sum.
+          # Shared across every darwin matrix entry — first job in a run warms
+          # it, the rest hit warm.
+          cache: true

      # You can test your matrix by printing the current Go version
      - name: Display Go version
        run: go version

+      # ---- Homebrew cache ----
+      # macOS runners have no Docker daemon, so the BuildKit registry cache used
+      # for Linux backend images (see .agents/ci-caching.md) doesn't apply here.
+      # We cache the brew downloads + Cellar entries for the formulas we install
+      # below. Read on every run, write only on master/tag pushes — same policy
+      # as the Linux registry cache.
+      - name: Restore Homebrew cache
+        id: brew-cache
+        uses: actions/cache/restore@v4
+        with:
+          path: |
+            ~/Library/Caches/Homebrew/downloads
+            /opt/homebrew/Cellar/protobuf
+            /opt/homebrew/Cellar/grpc
+            /opt/homebrew/Cellar/protoc-gen-go
+            /opt/homebrew/Cellar/protoc-gen-go-grpc
+            /opt/homebrew/Cellar/libomp
+            /opt/homebrew/Cellar/llvm
+            /opt/homebrew/Cellar/ccache
+          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
+
      - name: Dependencies
        run: |
-          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm
+          # ccache is always installed (used by the llama-cpp variant build) so
+          # the brew cache content stays stable across every backend in the
+          # matrix — they all share one cache key.
+          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache
+
+      - name: Save Homebrew cache
+        if: github.event_name != 'pull_request' && steps.brew-cache.outputs.cache-hit != 'true'
+        uses: actions/cache/save@v4
+        with:
+          path: |
+            ~/Library/Caches/Homebrew/downloads
+            /opt/homebrew/Cellar/protobuf
+            /opt/homebrew/Cellar/grpc
+            /opt/homebrew/Cellar/protoc-gen-go
+            /opt/homebrew/Cellar/protoc-gen-go-grpc
+            /opt/homebrew/Cellar/libomp
+            /opt/homebrew/Cellar/llvm
+            /opt/homebrew/Cellar/ccache
+          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
+
+      # ---- ccache for llama.cpp CMake builds ----
+      # Three CMake variants (fallback, grpc, rpc-server) compile the same
+      # llama.cpp source tree with overlapping flags — ccache dedupes object
+      # files across them. Key on the pinned LLAMA_VERSION so a pin bump
+      # invalidates cleanly; restore-keys fall back to the latest entry for the
+      # same pin so unchanged TUs stay warm even when the cache is fresh.
+      - name: Compute llama.cpp version
+        if: inputs.backend == 'llama-cpp'
+        id: llama-version
+        run: |
+          version=$(grep '^LLAMA_VERSION' backend/cpp/llama-cpp/Makefile | head -1 | cut -d= -f2 | cut -d'?' -f1 | tr -d ' ')
+          echo "version=${version}" >> "$GITHUB_OUTPUT"
+
+      - name: Restore ccache
+        if: inputs.backend == 'llama-cpp'
+        id: ccache-cache
+        uses: actions/cache/restore@v4
+        with:
+          path: ~/Library/Caches/ccache
+          key: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-${{ github.run_id }}
+          restore-keys: |
+            ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-
+
+      - name: Configure ccache
+        if: inputs.backend == 'llama-cpp'
+        run: |
+          mkdir -p "$HOME/Library/Caches/ccache"
+          ccache -M 2G
+          ccache -z
+          # llama-cpp-darwin.sh reads CMAKE_ARGS / CCACHE_DIR from env.
+          {
+            echo "CMAKE_ARGS=${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache"
+            echo "CCACHE_DIR=$HOME/Library/Caches/ccache"
+          } >> "$GITHUB_ENV"
+
+      # ---- Python wheel cache (uv + pip) ----
+      # Mirrors the Linux DEPS_REFRESH cadence (see .agents/ci-caching.md): the
+      # ISO-week segment of the cache key forces at most one cold rebuild per
+      # backend per week, automatically picking up newer wheels for unpinned
+      # deps (torch, mlx, diffusers, …). Restore-keys fall back to the most
+      # recent build of the same backend so off-week PRs still hit warm.
+      - name: Compute weekly cache bucket
+        if: inputs.lang == 'python'
+        id: weekly
+        run: echo "bucket=$(date -u +%Y-W%V)" >> "$GITHUB_OUTPUT"
+
+      - name: Restore Python wheel cache
+        if: inputs.lang == 'python'
+        id: pyenv-cache
+        uses: actions/cache/restore@v4
+        with:
+          path: |
+            ~/Library/Caches/pip
+            ~/Library/Caches/uv
+          key: pyenv-darwin-${{ inputs.backend }}-${{ steps.weekly.outputs.bucket }}-${{ hashFiles(format('backend/python/{0}/requirements*.txt', inputs.backend)) }}
+          restore-keys: |
+            pyenv-darwin-${{ inputs.backend }}-

      - name: Build ${{ inputs.backend }}-darwin
        run: |
          make protogen-go
          BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend

+      - name: ccache stats
+        if: inputs.backend == 'llama-cpp'
+        run: ccache -s
+
+      - name: Save ccache
+        if: inputs.backend == 'llama-cpp' && github.event_name != 'pull_request'
+        uses: actions/cache/save@v4
+        with:
+          path: ~/Library/Caches/ccache
+          key: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-${{ github.run_id }}
+
+      - name: Save Python wheel cache
+        if: inputs.lang == 'python' && github.event_name != 'pull_request' && steps.pyenv-cache.outputs.cache-hit != 'true'
+        uses: actions/cache/save@v4
+        with:
+          path: |
+            ~/Library/Caches/pip
+            ~/Library/Caches/uv
+          key: pyenv-darwin-${{ inputs.backend }}-${{ steps.weekly.outputs.bucket }}-${{ hashFiles(format('backend/python/{0}/requirements*.txt', inputs.backend)) }}
+
      - name: Upload ${{ inputs.backend }}.tar
        uses: actions/upload-artifact@v7
        with:
--- a/.github/workflows/gallery-agent.yaml
+++ b/.github/workflows/gallery-agent.yaml
@@ -2,7 +2,7 @@ name: Gallery Agent
 on:

  schedule:
-    - cron: '0 */3 * * *'  # Run every 4 hours
+    - cron: '0 */12 * * *'  # Run every 4 hours
  workflow_dispatch:
    inputs:
      search_term:
--- a/.github/workflows/generate_grpc_cache.yaml
+++ b/.github/workflows/generate_grpc_cache.yaml
@@ -1,96 +0,0 @@
-name: 'generate and publish GRPC docker caches'
-
-on:
-  workflow_dispatch:
-
-  schedule:
-    # daily at midnight
-    - cron: '0 0 * * *'
-
-concurrency:
-  group: grpc-cache-${{ github.head_ref || github.ref }}-${{ github.repository }}
-  cancel-in-progress: true
-
-jobs:
-  generate_caches:
-    if: github.repository == 'mudler/LocalAI'
-    strategy:
-      matrix:
-        include:
-          - grpc-base-image: ubuntu:24.04
-            runs-on: 'ubuntu-latest'
-            platforms: 'linux/amd64,linux/arm64'
-    runs-on: ${{matrix.runs-on}}
-    steps:
-      - name: Release space from worker
-        if: matrix.runs-on == 'ubuntu-latest'
-        run: |
-          echo "Listing top largest packages"
-          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
-          head -n 30 <<< "${pkgs}"
-          echo
-          df -h
-          echo
-          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
-          sudo apt-get remove --auto-remove android-sdk-platform-tools || true
-          sudo apt-get purge --auto-remove android-sdk-platform-tools || true
-          sudo rm -rf /usr/local/lib/android
-          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
-          sudo rm -rf /usr/share/dotnet
-          sudo apt-get remove -y '^mono-.*' || true
-          sudo apt-get remove -y '^ghc-.*' || true
-          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
-          sudo apt-get remove -y 'php.*' || true
-          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
-          sudo apt-get remove -y '^google-.*' || true
-          sudo apt-get remove -y azure-cli || true
-          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
-          sudo apt-get remove -y '^gfortran-.*' || true
-          sudo apt-get remove -y microsoft-edge-stable || true
-          sudo apt-get remove -y firefox || true
-          sudo apt-get remove -y powershell || true
-          sudo apt-get remove -y r-base-core || true
-          sudo apt-get autoremove -y
-          sudo apt-get clean
-          echo
-          echo "Listing top largest packages"
-          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
-          head -n 30 <<< "${pkgs}"
-          echo
-          sudo rm -rfv build || true
-          sudo rm -rf /usr/share/dotnet || true
-          sudo rm -rf /opt/ghc || true
-          sudo rm -rf "/usr/local/share/boost" || true
-          sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
-          df -h
-
-      - name: Set up QEMU
-        uses: docker/setup-qemu-action@master
-        with:
-          platforms: all
-
-      - name: Set up Docker Buildx
-        id: buildx
-        uses: docker/setup-buildx-action@master
-
-      - name: Checkout
-        uses: actions/checkout@v6
-
-      - name: Cache GRPC
-        uses: docker/build-push-action@v7
-        with:
-          builder: ${{ steps.buildx.outputs.name }}
-          # The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
-          # This means that even the MAKEFLAGS have to be an EXACT match.
-          # If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
-          build-args: |
-            GRPC_BASE_IMAGE=${{ matrix.grpc-base-image }}
-            GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
-            GRPC_VERSION=v1.65.0
-          context: .
-          file: ./Dockerfile
-          cache-to: type=gha,ignore-error=true
-          cache-from: type=gha
-          target: grpc
-          platforms: ${{ matrix.platforms }}
-          push: false
--- a/.github/workflows/generate_intel_image.yaml
+++ b/.github/workflows/generate_intel_image.yaml
@@ -16,7 +16,7 @@ jobs:
    strategy:
      matrix:
        include:
-          - base-image: intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04
+          - base-image: intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04
            runs-on: 'arc-runner-set'
            platforms: 'linux/amd64'
    runs-on: ${{matrix.runs-on}}
--- a/.github/workflows/image-pr.yml
+++ b/.github/workflows/image-pr.yml
@@ -20,7 +20,6 @@
        platforms: ${{ matrix.platforms }}
        runs-on: ${{ matrix.runs-on }}
        base-image: ${{ matrix.base-image }}
-        grpc-base-image: ${{ matrix.grpc-base-image }}
        makeflags: ${{ matrix.makeflags }}
        ubuntu-version: ${{ matrix.ubuntu-version }}
      secrets:
@@ -60,15 +59,13 @@
              tag-latest: 'false'
              tag-suffix: '-hipblas'
              base-image: "rocm/dev-ubuntu-24.04:7.2.1"
-              grpc-base-image: "ubuntu:24.04"
              runs-on: 'ubuntu-latest'
              makeflags: "--jobs=3 --output-sync=target"
              ubuntu-version: '2404'
            - build-type: 'sycl'
              platforms: 'linux/amd64'
              tag-latest: 'false'
-              base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
-              grpc-base-image: "ubuntu:24.04"
+              base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
              tag-suffix: 'sycl'
              runs-on: 'ubuntu-latest'
              makeflags: "--jobs=3 --output-sync=target"
--- a/.github/workflows/image.yml
+++ b/.github/workflows/image.yml
@@ -25,7 +25,6 @@
        platforms: ${{ matrix.platforms }}
        runs-on: ${{ matrix.runs-on }}
        base-image: ${{ matrix.base-image }}
-        grpc-base-image: ${{ matrix.grpc-base-image }}
        makeflags: ${{ matrix.makeflags }}
        ubuntu-version: ${{ matrix.ubuntu-version }}
        ubuntu-codename: ${{ matrix.ubuntu-codename }}
@@ -42,12 +41,11 @@
              tag-latest: 'auto'
              tag-suffix: '-gpu-hipblas'
              base-image: "rocm/dev-ubuntu-24.04:7.2.1"
-              grpc-base-image: "ubuntu:24.04"
              runs-on: 'ubuntu-latest'
              makeflags: "--jobs=3 --output-sync=target"
              ubuntu-version: '2404'
              ubuntu-codename: 'noble'
-  
+
    core-image-build:
      if: github.repository == 'mudler/LocalAI'
      uses: ./.github/workflows/image_build.yml
@@ -60,7 +58,6 @@
        platforms: ${{ matrix.platforms }}
        runs-on: ${{ matrix.runs-on }}
        base-image: ${{ matrix.base-image }}
-        grpc-base-image: ${{ matrix.grpc-base-image }}
        makeflags: ${{ matrix.makeflags }}
        skip-drivers: ${{ matrix.skip-drivers }}
        ubuntu-version: ${{ matrix.ubuntu-version }}
@@ -121,8 +118,7 @@
            - build-type: 'intel'
              platforms: 'linux/amd64'
              tag-latest: 'auto'
-              base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
-              grpc-base-image: "ubuntu:24.04"
+              base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
              tag-suffix: '-gpu-intel'
              runs-on: 'ubuntu-latest'
              makeflags: "--jobs=3 --output-sync=target"
@@ -141,7 +137,6 @@
        platforms: ${{ matrix.platforms }}
        runs-on: ${{ matrix.runs-on }}
        base-image: ${{ matrix.base-image }}
-        grpc-base-image: ${{ matrix.grpc-base-image }}
        makeflags: ${{ matrix.makeflags }}
        skip-drivers: ${{ matrix.skip-drivers }}
        ubuntu-version: ${{ matrix.ubuntu-version }}
--- a/.github/workflows/image_build.yml
+++ b/.github/workflows/image_build.yml
@@ -8,11 +8,6 @@ on:
        description: 'Base image'
        required: true
        type: string
-      grpc-base-image:
-        description: 'GRPC Base image, must be a compatible image with base-image'
-        required: false
-        default: ''
-        type: string
      build-type:
        description: 'Build type'
        default: ''
@@ -201,25 +196,19 @@ jobs:
        if: github.event_name != 'pull_request'
        with:
          builder: ${{ steps.buildx.outputs.name }}
-          # The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
-          # This means that even the MAKEFLAGS have to be an EXACT match.
-          # If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
-          # This is why some build args like GRPC_VERSION and MAKEFLAGS are hardcoded
          build-args: |
            BUILD_TYPE=${{ inputs.build-type }}
            CUDA_MAJOR_VERSION=${{ inputs.cuda-major-version }}
            CUDA_MINOR_VERSION=${{ inputs.cuda-minor-version }}
            BASE_IMAGE=${{ inputs.base-image }}
-            GRPC_BASE_IMAGE=${{ inputs.grpc-base-image || inputs.base-image }}
-            GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
-            GRPC_VERSION=v1.65.0
            MAKEFLAGS=${{ inputs.makeflags }}
            SKIP_DRIVERS=${{ inputs.skip-drivers }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
            UBUNTU_CODENAME=${{ inputs.ubuntu-codename }}
          context: .
          file: ./Dockerfile
-          cache-from: type=gha
+          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}
+          cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }},mode=max,ignore-error=true
          platforms: ${{ inputs.platforms }}
          push: ${{ github.event_name != 'pull_request' }}
          tags: ${{ steps.meta.outputs.tags }}
@@ -230,25 +219,18 @@ jobs:
        if: github.event_name == 'pull_request'
        with:
          builder: ${{ steps.buildx.outputs.name }}
-          # The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
-          # This means that even the MAKEFLAGS have to be an EXACT match.
-          # If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
-          # This is why some build args like GRPC_VERSION and MAKEFLAGS are hardcoded
          build-args: |
            BUILD_TYPE=${{ inputs.build-type }}
            CUDA_MAJOR_VERSION=${{ inputs.cuda-major-version }}
            CUDA_MINOR_VERSION=${{ inputs.cuda-minor-version }}
            BASE_IMAGE=${{ inputs.base-image }}
-            GRPC_BASE_IMAGE=${{ inputs.grpc-base-image || inputs.base-image }}
-            GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
-            GRPC_VERSION=v1.65.0
            MAKEFLAGS=${{ inputs.makeflags }}
            SKIP_DRIVERS=${{ inputs.skip-drivers }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
            UBUNTU_CODENAME=${{ inputs.ubuntu-codename }}
          context: .
          file: ./Dockerfile
-          cache-from: type=gha
+          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}
          platforms: ${{ inputs.platforms }}
          #push: true
          tags: ${{ steps.meta_pull_request.outputs.tags }}
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -32,7 +32,6 @@ jobs:
      llama-cpp: ${{ steps.detect.outputs.llama-cpp }}
      ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
      turboquant: ${{ steps.detect.outputs.turboquant }}
-      buun-llama-cpp: ${{ steps.detect.outputs['buun-llama-cpp'] }}
      vllm: ${{ steps.detect.outputs.vllm }}
      sglang: ${{ steps.detect.outputs.sglang }}
      acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
@@ -614,30 +613,6 @@ jobs:
      - name: Build turboquant backend image and run gRPC e2e tests
        run: |
          make test-extra-backend-turboquant
-  tests-buun-llama-cpp-grpc:
-    needs: detect-changes
-    if: needs.detect-changes.outputs['buun-llama-cpp'] == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    timeout-minutes: 90
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Setup Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: '1.25.4'
-      # Exercises the buun-llama-cpp (fork-of-a-fork) backend with the
-      # fork-specific TurboQuant/TCQ KV-cache types. BACKEND_TEST_CACHE_TYPE_V
-      # is set to turbo3 so the test round-trips through the fork's KV
-      # allow-list — picking a stock llama.cpp type would only re-test the
-      # shared code path. DFlash speculative decoding is not exercised here
-      # because the one known public target/drafter pair (Qwen3.5-27B) is too
-      # large for CI.
-      - name: Build buun-llama-cpp backend image and run gRPC e2e tests
-        run: |
-          make test-extra-backend-buun-llama-cpp
  # tests-vllm-grpc is currently disabled in CI.
  #
  # The prebuilt vllm CPU wheel is compiled with AVX-512 VNNI/BF16
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -9,9 +9,6 @@ on:
    tags:
      - '*'

-env:
-  GRPC_VERSION: v1.65.0
-
 concurrency:
  group: ci-tests-${{ github.head_ref || github.ref }}-${{ github.repository }}
  cancel-in-progress: true
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -19,6 +19,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 |------|-------------|
 | [.agents/ai-coding-assistants.md](.agents/ai-coding-assistants.md) | Policy for AI-assisted contributions — licensing, DCO, attribution |
 | [.agents/building-and-testing.md](.agents/building-and-testing.md) | Building the project, running tests, Docker builds for specific platforms |
+| [.agents/ci-caching.md](.agents/ci-caching.md) | CI build cache layout (registry-backed BuildKit cache on quay.io/go-skynet/ci-cache), `DEPS_REFRESH` weekly cache-buster for unpinned Python deps, manual eviction |
 | [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist, including importer integration (the `/import-model` dropdown is server-driven from `GET /backends/known`) |
 | [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
 | [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
--- a/2
+++ b/2
@@ -1,5 +1,4 @@
 ARG BASE_IMAGE=ubuntu:24.04
-ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
 ARG INTEL_BASE_IMAGE=${BASE_IMAGE}
 ARG UBUNTU_CODENAME=noble

@@ -149,6 +148,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
+            hipblaslt-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
--- a/25
+++ b/25
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/buun-llama-cpp backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad backends/sherpa-onnx
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad backends/sherpa-onnx

 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -545,19 +545,6 @@ test-extra-backend-turboquant: docker-build-turboquant
 	BACKEND_TEST_CACHE_TYPE_V=turbo3 \
 	$(MAKE) test-extra-backend

-## buun-llama-cpp: exercises the fork-of-a-fork backend (spiritbuun/buun-llama-cpp)
-## with the *TurboQuant/TCQ-specific* KV-cache types (turbo3 for V). Same rationale
-## as turboquant above: picking a standard llama.cpp type would only re-test the
-## shared code path. buun inherits turboquant's turbo2/turbo3/turbo4 and adds
-## turbo2_tcq / turbo3_tcq on top. DFlash speculative decoding is not exercised
-## here because no small DFlash drafter model exists (the known public pair is
-## Qwen3.5-27B, ~54 GB).
-test-extra-backend-buun-llama-cpp: docker-build-buun-llama-cpp
-	BACKEND_IMAGE=local-ai-backend:buun-llama-cpp \
-	BACKEND_TEST_CACHE_TYPE_K=q8_0 \
-	BACKEND_TEST_CACHE_TYPE_V=turbo3 \
-	$(MAKE) test-extra-backend
-
 ## Audio transcription wrapper for the llama-cpp backend.
 ## Drives the new AudioTranscription / AudioTranscriptionStream RPCs against
 ## ggml-org/Qwen3-ASR-0.6B-GGUF (a small ASR model that requires its mmproj
@@ -896,7 +883,7 @@ docker-cuda12:

 docker-image-intel:
 	docker build \
-		--build-arg BASE_IMAGE=intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04 \
+		--build-arg BASE_IMAGE=intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04 \
 		--build-arg IMAGE_TYPE=$(IMAGE_TYPE) \
 		--build-arg GO_TAGS="$(GO_TAGS)" \
 		--build-arg MAKEFLAGS="$(DOCKER_MAKEFLAGS)" \
@@ -962,11 +949,6 @@ BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false
 # turboquant is a llama.cpp fork with TurboQuant KV-cache quantization.
 # Reuses backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile.
 BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
-# buun-llama-cpp is a fork-of-a-fork (spiritbuun/buun-llama-cpp forks
-# TheTom/llama-cpp-turboquant) that adds DFlash block-diffusion speculative
-# decoding and extra TCQ KV-cache variants on top of TurboQuant. Same thin
-# wrapper pattern as turboquant — reuses backend/cpp/llama-cpp grpc-server.
-BACKEND_BUUN_LLAMA_CPP = buun-llama-cpp|buun-llama-cpp|.|false|false

 # Golang backends
 BACKEND_PIPER = piper|golang|.|false|true
@@ -1047,7 +1029,6 @@ endef
 $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
-$(eval $(call generate-docker-build-target,$(BACKEND_BUUN_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
@@ -1099,7 +1080,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar

-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-buun-llama-cpp docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx

 ########################################################
 ### Mock Backend for E2E Tests
--- a/backend/Dockerfile.buun-llama-cpp
+++ b/backend/Dockerfile.buun-llama-cpp
@@ -1,290 +0,0 @@
-ARG BASE_IMAGE=ubuntu:24.04
-ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
-
-
-# The grpc target does one thing, it builds and installs GRPC.  This is in it's own layer so that it can be effectively cached by CI.
-# You probably don't need to change anything here, and if you do, make sure that CI is adjusted so that the cache continues to work.
-FROM ${GRPC_BASE_IMAGE} AS grpc
-
-# This is a bit of a hack, but it's required in order to be able to effectively cache this layer in CI
-ARG GRPC_MAKEFLAGS="-j4 -Otarget"
-ARG GRPC_VERSION=v1.65.0
-ARG CMAKE_FROM_SOURCE=false
-# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
-ARG CMAKE_VERSION=3.31.10
-
-ENV MAKEFLAGS=${GRPC_MAKEFLAGS}
-
-WORKDIR /build
-
-RUN apt-get update && \
-    apt-get install -y --no-install-recommends \
-        ca-certificates \
-        build-essential curl libssl-dev \
-        git wget && \
-    apt-get clean && \
-    rm -rf /var/lib/apt/lists/*
-
-# Install CMake (the version in 22.04 is too old)
-RUN <<EOT bash
-    if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
-        curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
-    else
-        apt-get update && \
-        apt-get install -y \
-            cmake && \
-        apt-get clean && \
-        rm -rf /var/lib/apt/lists/*
-    fi
-EOT
-
-# We install GRPC to a different prefix here so that we can copy in only the build artifacts later
-# saves several hundred MB on the final docker image size vs copying in the entire GRPC source tree
-# and running make install in the target container
-RUN git clone --recurse-submodules --jobs 4 -b ${GRPC_VERSION} --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
-    mkdir -p /build/grpc/cmake/build && \
-    cd /build/grpc/cmake/build && \
-    sed -i "216i\  TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt" && \
-    cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../.. && \
-    make && \
-    make install && \
-    rm -rf /build
-
-FROM ${BASE_IMAGE} AS builder
-ARG CMAKE_FROM_SOURCE=false
-ARG CMAKE_VERSION=3.31.10
-# We can target specific CUDA ARCHITECTURES like --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
-ARG CUDA_DOCKER_ARCH
-ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
-ARG CMAKE_ARGS
-ENV CMAKE_ARGS=${CMAKE_ARGS}
-ARG BACKEND=rerankers
-ARG BUILD_TYPE
-ENV BUILD_TYPE=${BUILD_TYPE}
-ARG CUDA_MAJOR_VERSION
-ARG CUDA_MINOR_VERSION
-ARG SKIP_DRIVERS=false
-ENV CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION}
-ENV CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION}
-ENV DEBIAN_FRONTEND=noninteractive
-ARG TARGETARCH
-ARG TARGETVARIANT
-ARG GO_VERSION=1.25.4
-ARG UBUNTU_VERSION=2404
-
-RUN apt-get update && \
-    apt-get install -y --no-install-recommends \
-        build-essential \
-        ccache git \
-        ca-certificates \
-        make \
-        pkg-config libcurl4-openssl-dev \
-        curl unzip \
-        libssl-dev wget && \
-    apt-get clean && \
-    rm -rf /var/lib/apt/lists/*
-
-# Cuda
-ENV PATH=/usr/local/cuda/bin:${PATH}
-
-# HipBLAS requirements
-ENV PATH=/opt/rocm/bin:${PATH}
-
-
-# Vulkan requirements
-RUN <<EOT bash
-    if [ "${BUILD_TYPE}" = "vulkan" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
-        apt-get update && \
-        apt-get install -y  --no-install-recommends \
-            software-properties-common pciutils wget gpg-agent && \
-        apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
-            libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
-            libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
-            git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
-            ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
-            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
-        if [ "amd64" = "$TARGETARCH" ]; then
-            wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
-            tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
-            rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
-            mkdir -p /opt/vulkan-sdk && \
-            mv 1.4.335.0 /opt/vulkan-sdk/ && \
-            cd /opt/vulkan-sdk/1.4.335.0 && \
-            ./vulkansdk --no-deps --maxjobs \
-                vulkan-loader \
-                vulkan-validationlayers \
-                vulkan-extensionlayer \
-                vulkan-tools \
-                shaderc && \
-            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/ && \
-            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/ && \
-            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/ && \
-            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/ && \
-            rm -rf /opt/vulkan-sdk
-        fi
-        if [ "arm64" = "$TARGETARCH" ]; then
-            mkdir vulkan && cd vulkan && \
-            curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
-            tar -xvf vulkan-sdk.tar.xz && \
-            rm vulkan-sdk.tar.xz && \
-            cd 1.4.335.0 && \
-            cp -rfv aarch64/bin/* /usr/bin/ && \
-            cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
-            cp -rfv aarch64/include/* /usr/include/ && \
-            cp -rfv aarch64/share/* /usr/share/ && \
-            cd ../.. && \
-            rm -rf vulkan
-        fi
-        ldconfig && \
-        apt-get clean && \
-        rm -rf /var/lib/apt/lists/*
-    fi
-EOT
-
-# CuBLAS requirements
-RUN <<EOT bash
-    if ( [ "${BUILD_TYPE}" = "cublas" ] || [ "${BUILD_TYPE}" = "l4t" ] ) && [ "${SKIP_DRIVERS}" = "false" ]; then
-        apt-get update && \
-        apt-get install -y  --no-install-recommends \
-            software-properties-common pciutils
-        if [ "amd64" = "$TARGETARCH" ]; then
-            curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb
-        fi
-        if [ "arm64" = "$TARGETARCH" ]; then
-            if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
-                curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb
-            else
-                curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb
-            fi
-        fi
-        dpkg -i cuda-keyring_1.1-1_all.deb && \
-        rm -f cuda-keyring_1.1-1_all.deb && \
-        apt-get update && \
-        apt-get install -y --no-install-recommends \
-            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
-            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
-            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
-            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
-            libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
-            libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
-        if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "$TARGETARCH" ]; then
-            apt-get install -y --no-install-recommends \
-            libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
-        fi
-        apt-get clean && \
-        rm -rf /var/lib/apt/lists/*
-    fi
-EOT
-
-
-# https://github.com/NVIDIA/Isaac-GR00T/issues/343
-RUN <<EOT bash
-    if [ "${BUILD_TYPE}" = "cublas" ] && [ "${TARGETARCH}" = "arm64" ]; then
-        wget https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
-        dpkg -i cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
-        cp /var/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/ && \
-        apt-get update && apt-get -y install cudss cudss-cuda-${CUDA_MAJOR_VERSION} && \
-        wget https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
-        dpkg -i nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
-        cp /var/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/ && \
-        apt-get update && apt-get install -y nvpl
-    fi
-EOT
-
-# If we are building with clblas support, we need the libraries for the builds
-RUN if [ "${BUILD_TYPE}" = "clblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
-        apt-get update && \
-        apt-get install -y --no-install-recommends \
-            libclblast-dev && \
-        apt-get clean && \
-        rm -rf /var/lib/apt/lists/* \
-    ; fi
-
-RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
-        apt-get update && \
-        apt-get install -y --no-install-recommends \
-            hipblas-dev \
-            rocblas-dev && \
-        apt-get clean && \
-        rm -rf /var/lib/apt/lists/* && \
-        # I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install, which results in local-ai and others not being able
-        # to locate the libraries. We run ldconfig ourselves to work around this packaging deficiency
-        ldconfig && \
-        # Log which GPU architectures have rocBLAS kernel support
-        echo "rocBLAS library data architectures:" && \
-        (ls /opt/rocm*/lib/rocblas/library/Kernels* 2>/dev/null || ls /opt/rocm*/lib64/rocblas/library/Kernels* 2>/dev/null) | grep -oP 'gfx[0-9a-z+-]+' | sort -u || \
-        echo "WARNING: No rocBLAS kernel data found" \
-    ; fi
-
-RUN echo "TARGETARCH: $TARGETARCH"
-
-# We need protoc installed, and the version in 22.04 is too old.  We will create one as part installing the GRPC build below
-# but that will also being in a newer version of absl which stablediffusion cannot compile with.  This version of protoc is only
-# here so that we can generate the grpc code for the stablediffusion build
-RUN <<EOT bash
-    if [ "amd64" = "$TARGETARCH" ]; then
-        curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip && \
-        unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-        rm protoc.zip
-    fi
-    if [ "arm64" = "$TARGETARCH" ]; then
-        curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip && \
-        unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-        rm protoc.zip
-    fi
-EOT
-
-# Install CMake (the version in 22.04 is too old)
-RUN <<EOT bash
-    if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
-        curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
-    else
-        apt-get update && \
-        apt-get install -y \
-            cmake && \
-        apt-get clean && \
-        rm -rf /var/lib/apt/lists/*
-    fi
-EOT
-
-COPY --from=grpc /opt/grpc /usr/local
-
-
-COPY . /LocalAI
-
-RUN <<'EOT' bash
-set -euxo pipefail
-
-if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
-  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
-  export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
-  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
-  rm -rf /LocalAI/backend/cpp/buun-llama-cpp-*-build
-fi
-
-cd /LocalAI/backend/cpp/buun-llama-cpp
-
-if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
-  make buun-llama-cpp-fallback
-  make buun-llama-cpp-grpc
-  make buun-llama-cpp-rpc-server
-else
-  make buun-llama-cpp-avx
-  make buun-llama-cpp-avx2
-  make buun-llama-cpp-avx512
-  make buun-llama-cpp-fallback
-  make buun-llama-cpp-grpc
-  make buun-llama-cpp-rpc-server
-fi
-EOT
-
-
-# Copy libraries using a script to handle architecture differences
-RUN make -BC /LocalAI/backend/cpp/buun-llama-cpp package
-
-
-FROM scratch
-
-
-# Copy all available binaries (the build process only creates the appropriate ones for the target architecture)
-COPY --from=builder /LocalAI/backend/cpp/buun-llama-cpp/package/. ./
--- a/backend/Dockerfile.golang
+++ b/backend/Dockerfile.golang
@@ -147,6 +147,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
+            hipblaslt-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
--- a/backend/Dockerfile.ik-llama-cpp
+++ b/backend/Dockerfile.ik-llama-cpp
@@ -204,6 +204,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
+            hipblaslt-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
--- a/backend/Dockerfile.llama-cpp
+++ b/backend/Dockerfile.llama-cpp
@@ -206,6 +206,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
+            hipblaslt-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
--- a/backend/Dockerfile.python
+++ b/backend/Dockerfile.python
@@ -162,6 +162,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
+            hipblaslt-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
@@ -202,6 +203,13 @@ COPY scripts/build/package-gpu-libs.sh /package-gpu-libs.sh
 ARG FROM_SOURCE=""
 ENV FROM_SOURCE=${FROM_SOURCE}

+# Cache-buster for the per-backend `make` step. Most Python backends list
+# unpinned deps (torch, transformers, vllm, ...), so a warm registry cache
+# would otherwise freeze upstream versions indefinitely. CI passes a value
+# that rolls weekly so the install layer is rebuilt at most once per week
+# and picks up newer wheels from PyPI / nightly indexes.
+ARG DEPS_REFRESH=initial
+
 RUN cd /${BACKEND} && PORTABLE_PYTHON=true make

 # Package GPU libraries into the backend's lib directory
@@ -216,4 +224,4 @@ RUN if [ -f "/${BACKEND}/package.sh" ]; then \

 FROM scratch
 ARG BACKEND=rerankers
-COPY --from=builder /${BACKEND}/ /
+COPY --from=builder /${BACKEND}/ /
--- a/backend/Dockerfile.turboquant
+++ b/backend/Dockerfile.turboquant
@@ -204,6 +204,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
+            hipblaslt-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
--- a/backend/cpp/buun-llama-cpp/Makefile
+++ b/backend/cpp/buun-llama-cpp/Makefile
@@ -1,85 +0,0 @@
-
-# Pinned to the HEAD of master on https://github.com/spiritbuun/buun-llama-cpp.
-# Auto-bumped nightly by .github/workflows/bump_deps.yaml.
-BUUN_LLAMA_VERSION?=22464d0848b87c5d56b52fdf6af2e5da46bf803e
-LLAMA_REPO?=https://github.com/spiritbuun/buun-llama-cpp
-
-CMAKE_ARGS?=
-BUILD_TYPE?=
-NATIVE?=false
-ONEAPI_VARS?=/opt/intel/oneapi/setvars.sh
-TARGET?=--target grpc-server
-JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1)
-ARCH?=$(shell uname -m)
-
-CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
-LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp
-
-GREEN := \033[0;32m
-RESET := \033[0m
-
-# buun-llama-cpp is a llama.cpp fork-of-a-fork (spiritbuun/buun-llama-cpp forked
-# TheTom/llama-cpp-turboquant, which itself forked ggml-org/llama.cpp). Rather
-# than duplicating grpc-server.cpp / CMakeLists.txt / prepare.sh we reuse the
-# ones in backend/cpp/llama-cpp, and only swap which repo+sha the fetch step
-# pulls. Each flavor target copies ../llama-cpp into a sibling
-# ../buun-llama-cpp-<flavor>-build directory, then invokes llama-cpp's own
-# build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point
-# at the fork.
-PATCHES_DIR := $(CURRENT_MAKEFILE_DIR)/patches
-
-# Each flavor target:
-#   1. copies backend/cpp/llama-cpp/ (grpc-server.cpp + prepare.sh + CMakeLists.txt + Makefile)
-#      into a sibling buun-llama-cpp-<flavor>-build directory;
-#   2. clones the buun fork into buun-llama-cpp-<flavor>-build/llama.cpp via the
-#      copy's own `llama.cpp` target, overriding LLAMA_REPO/LLAMA_VERSION;
-#   3. applies patches from backend/cpp/buun-llama-cpp/patches/ to the cloned
-#      fork sources (for backporting upstream commits the fork hasn't pulled);
-#   4. runs the copy's `grpc-server` target, which produces the binary we copy
-#      up as buun-llama-cpp-<flavor>.
-define buun-llama-cpp-build
-	rm -rf $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build
-	cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build
-	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build purge
-	# Augment the copied grpc-server.cpp's KV-cache allow-list with the
-	# fork's turbo2/turbo3/turbo4/turbo2_tcq/turbo3_tcq types and wire up the
-	# DFlash-specific option handlers (tree_budget / draft_topk). We patch the
-	# *copy*, never the original under backend/cpp/llama-cpp/, so the stock
-	# llama-cpp build stays compiling against vanilla upstream.
-	bash $(CURRENT_MAKEFILE_DIR)/patch-grpc-server.sh $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build/grpc-server.cpp
-	$(info $(GREEN)I buun-llama-cpp build info:$(1)$(RESET))
-	LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(BUUN_LLAMA_VERSION) \
-	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build llama.cpp
-	bash $(CURRENT_MAKEFILE_DIR)/apply-patches.sh $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build/llama.cpp $(PATCHES_DIR)
-	CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" \
-	LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(BUUN_LLAMA_VERSION) \
-	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build grpc-server
-	cp -rfv $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build/grpc-server buun-llama-cpp-$(1)
-endef
-
-buun-llama-cpp-avx2:
-	$(call buun-llama-cpp-build,avx2,-DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
-
-buun-llama-cpp-avx512:
-	$(call buun-llama-cpp-build,avx512,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
-
-buun-llama-cpp-avx:
-	$(call buun-llama-cpp-build,avx,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
-
-buun-llama-cpp-fallback:
-	$(call buun-llama-cpp-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
-
-buun-llama-cpp-grpc:
-	$(call buun-llama-cpp-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target rpc-server)
-
-buun-llama-cpp-rpc-server: buun-llama-cpp-grpc
-	cp -rf $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-grpc-build/llama.cpp/build/bin/rpc-server buun-llama-cpp-rpc-server
-
-package:
-	bash package.sh
-
-purge:
-	rm -rf $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-*-build
-	rm -rf buun-llama-cpp-* package
-
-clean: purge
--- a/backend/cpp/buun-llama-cpp/apply-patches.sh
+++ b/backend/cpp/buun-llama-cpp/apply-patches.sh
@@ -1,50 +0,0 @@
-#!/bin/bash
-# Apply the buun-llama-cpp patch series to a cloned buun-llama-cpp checkout.
-#
-# buun-llama-cpp is a fork-of-a-fork that branched off upstream llama.cpp
-# before some API changes the shared backend/cpp/llama-cpp/grpc-server.cpp
-# depends on. We carry those upstream commits as patch files under
-# backend/cpp/buun-llama-cpp/patches/ and apply them here so the reused
-# grpc-server source compiles against the fork unmodified.
-#
-# Drop the corresponding patch from patches/ whenever the fork catches up with
-# upstream — the build will fail fast if a patch stops applying, which is the
-# signal to retire it.
-
-set -euo pipefail
-
-if [[ $# -ne 2 ]]; then
-    echo "usage: $0 <llama.cpp-src-dir> <patches-dir>" >&2
-    exit 2
-fi
-
-SRC_DIR=$1
-PATCHES_DIR=$2
-
-if [[ ! -d "$SRC_DIR" ]]; then
-    echo "source dir does not exist: $SRC_DIR" >&2
-    exit 2
-fi
-
-if [[ ! -d "$PATCHES_DIR" ]]; then
-    echo "no patches dir at $PATCHES_DIR, nothing to apply"
-    exit 0
-fi
-
-shopt -s nullglob
-patches=("$PATCHES_DIR"/*.patch)
-shopt -u nullglob
-
-if [[ ${#patches[@]} -eq 0 ]]; then
-    echo "no .patch files in $PATCHES_DIR, nothing to apply"
-    exit 0
-fi
-
-cd "$SRC_DIR"
-
-for patch in "${patches[@]}"; do
-    echo "==> applying $patch"
-    git apply --verbose "$patch"
-done
-
-echo "all buun-llama-cpp patches applied successfully"
--- a/backend/cpp/buun-llama-cpp/package.sh
+++ b/backend/cpp/buun-llama-cpp/package.sh
@@ -1,57 +0,0 @@
-#!/bin/bash
-
-# Script to copy the appropriate libraries based on architecture
-# This script is used in the final stage of the Dockerfile
-
-set -e
-
-CURDIR=$(dirname "$(realpath $0)")
-REPO_ROOT="${CURDIR}/../../.."
-
-# Create lib directory
-mkdir -p $CURDIR/package/lib
-
-cp -avrf $CURDIR/buun-llama-cpp-* $CURDIR/package/
-cp -rfv $CURDIR/run.sh $CURDIR/package/
-
-# Detect architecture and copy appropriate libraries
-if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
-    # x86_64 architecture
-    echo "Detected x86_64 architecture, copying x86_64 libraries..."
-    cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
-    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
-    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
-    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
-    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
-    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
-    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
-    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
-    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
-elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
-    # ARM64 architecture
-    echo "Detected ARM64 architecture, copying ARM64 libraries..."
-    cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
-    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
-    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
-    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
-    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
-    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
-    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
-    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
-    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
-else
-    echo "Error: Could not detect architecture"
-    exit 1
-fi
-
-# Package GPU libraries based on BUILD_TYPE
-GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
-if [ -f "$GPU_LIB_SCRIPT" ]; then
-    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
-    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
-    package_gpu_libs
-fi
-
-echo "Packaging completed successfully"
-ls -liah $CURDIR/package/
-ls -liah $CURDIR/package/lib/
--- a/backend/cpp/buun-llama-cpp/patch-grpc-server.sh
+++ b/backend/cpp/buun-llama-cpp/patch-grpc-server.sh
@@ -1,162 +0,0 @@
-#!/bin/bash
-# Patch the shared backend/cpp/llama-cpp/grpc-server.cpp *copy* used by the
-# buun-llama-cpp build to account for three gaps between upstream and the fork:
-#
-#   1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
-#      fork-specific `turbo2` / `turbo3` / `turbo4` cache types plus the buun
-#      additions `turbo2_tcq` / `turbo3_tcq`.
-#
-#   2. Wire up buun-exclusive speculative-decoding option handlers
-#      (tree_budget / draft_topk) alongside the existing spec_* handlers.
-#      These reference struct fields (common_params.speculative.tree_budget
-#      and .draft_topk) that only exist in buun's common/common.h — adding
-#      them to the shared backend/cpp/llama-cpp/grpc-server.cpp would break
-#      the stock llama-cpp build, so we inject them only into the buun copy.
-#
-#   3. Replace `get_media_marker()` (added upstream in ggml-org/llama.cpp#21962,
-#      server-side random per-instance marker) with the legacy "<__media__>"
-#      literal. The fork branched before that PR, so server-common.cpp has no
-#      get_media_marker symbol. The fork's mtmd_default_marker() still returns
-#      "<__media__>", and Go-side tooling falls back to that sentinel when the
-#      backend does not expose media_marker, so substituting the literal keeps
-#      behavior identical on the buun path.
-#
-# We patch the *copy* sitting in buun-llama-cpp-<flavor>-build/, never the
-# original under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps
-# compiling against vanilla upstream.
-#
-# Idempotent: skips each insertion if its marker is already present (so re-runs
-# of the same build dir don't double-insert).
-
-set -euo pipefail
-
-if [[ $# -ne 1 ]]; then
-    echo "usage: $0 <grpc-server.cpp>" >&2
-    exit 2
-fi
-
-SRC=$1
-
-if [[ ! -f "$SRC" ]]; then
-    echo "grpc-server.cpp not found at $SRC" >&2
-    exit 2
-fi
-
-if grep -q 'GGML_TYPE_TURBO2_TCQ' "$SRC"; then
-    echo "==> $SRC already has buun cache types, skipping KV allow-list patch"
-else
-    echo "==> patching $SRC to allow turbo2/turbo3/turbo4/turbo2_tcq/turbo3_tcq KV-cache types"
-
-    # Insert the five TURBO entries right after the first `    GGML_TYPE_Q5_1,`
-    # line (the kv_cache_types[] allow-list). Using awk because the builder
-    # image does not ship python3, and GNU sed's multi-line `a\` quoting is
-    # awkward.
-    awk '
-        /^    GGML_TYPE_Q5_1,$/ && !done {
-            print
-            print "    // buun-llama-cpp fork extras — added by patch-grpc-server.sh"
-            print "    GGML_TYPE_TURBO2_0,"
-            print "    GGML_TYPE_TURBO3_0,"
-            print "    GGML_TYPE_TURBO4_0,"
-            print "    GGML_TYPE_TURBO2_TCQ,"
-            print "    GGML_TYPE_TURBO3_TCQ,"
-            done = 1
-            next
-        }
-        { print }
-        END {
-            if (!done) {
-                print "patch-grpc-server.sh: anchor `    GGML_TYPE_Q5_1,` not found" > "/dev/stderr"
-                exit 1
-            }
-        }
-    ' "$SRC" > "$SRC.tmp"
-    mv "$SRC.tmp" "$SRC"
-
-    echo "==> KV allow-list patch OK"
-fi
-
-if grep -q 'optname, "tree_budget"' "$SRC"; then
-    echo "==> $SRC already has DFlash option handlers, skipping"
-else
-    echo "==> patching $SRC to add tree_budget / draft_topk option handlers"
-
-    # Insert two new `else if` handlers between the inner close-brace of the
-    # `spec_p_split` block and the next `} else if (…spec_ngram_size_n…)` line.
-    # Upstream writes each `} else if` as a single physical line, so we don't
-    # emit an outer `}` ourselves — the existing next line provides both the
-    # close of our `draft_topk` block and the open of `spec_ngram_size_n`.
-    # Anchor on the exact 3-line body of spec_p_split so we can't drift.
-    awk '
-        prev2 == "        } else if (!strcmp(optname, \"spec_p_split\")) {" &&
-        prev1 ~ /^ +if \(optval != NULL\) \{$/ &&
-        $0    ~ /^ +try \{ params\.speculative\.p_split = std::stof\(optval_str\); \} catch \(\.\.\.\) \{\}$/ &&
-        !done {
-            print                        # print the try-line itself
-            getline inner_close          # read "            }" closing the inner if
-            print inner_close            # print it — this closes spec_p_split body
-            print "        // buun-llama-cpp DFlash options — added by patch-grpc-server.sh"
-            print "        } else if (!strcmp(optname, \"tree_budget\")) {"
-            print "            if (optval != NULL) {"
-            print "                try { params.speculative.tree_budget = std::stoi(optval_str); } catch (...) {}"
-            print "            }"
-            print "        } else if (!strcmp(optname, \"draft_topk\")) {"
-            print "            if (optval != NULL) {"
-            print "                try { params.speculative.draft_topk = std::stoi(optval_str); } catch (...) {}"
-            print "            }"
-            # The next source line (`} else if (…spec_ngram_size_n…) {`) closes
-            # our draft_topk block and continues the chain naturally; fall back
-            # into the main loop to emit it and everything after.
-            done = 1
-            prev2 = prev1
-            prev1 = inner_close
-            next
-        }
-        { print; prev2 = prev1; prev1 = $0 }
-        END {
-            if (!done) {
-                print "patch-grpc-server.sh: spec_p_split anchor not found" > "/dev/stderr"
-                exit 1
-            }
-        }
-    ' "$SRC" > "$SRC.tmp"
-    mv "$SRC.tmp" "$SRC"
-
-    echo "==> DFlash option-handler patch OK"
-fi
-
-if grep -qE 'ctx_server\.get_meta\(\)\.logit_bias_eog|params_base\.sampling\.logit_bias_eog,' "$SRC"; then
-    echo "==> patching $SRC to drop the logit_bias_eog arg from params_from_json_cmpl() callsites (buun still uses the pre-refactor 4-arg signature)"
-    # Upstream llama.cpp refactored params_from_json_cmpl to take a precomputed
-    # logit_bias_eog vector after buun's 2026-04-05 fork-point — simultaneously
-    # adding server_context_meta::logit_bias_eog as the supplier. Buun carries
-    # neither change: its params_from_json_cmpl is still 4-arg, and internally
-    # derives logit_bias_eog from the common_params it's passed. So we just
-    # delete the argument line entirely — the remaining 4 args match buun's
-    # signature and the resulting behavior matches upstream bit-for-bit
-    # (upstream's 5th arg is the same data buun derives internally).
-    #
-    # Guard is broad so this works whether the line has been run through this
-    # block before (leaving params_base.sampling.logit_bias_eog,) or not
-    # (leaving the original ctx_server.get_meta().logit_bias_eog,).
-    sed -E '/^[[:space:]]+(ctx_server\.get_meta\(\)\.logit_bias_eog|params_base\.sampling\.logit_bias_eog),$/d' "$SRC" > "$SRC.tmp"
-    mv "$SRC.tmp" "$SRC"
-    echo "==> logit_bias_eog arg drop OK"
-else
-    echo "==> $SRC has no logit_bias_eog arg line, skipping"
-fi
-
-if grep -q 'get_media_marker()' "$SRC"; then
-    echo "==> patching $SRC to replace get_media_marker() with legacy \"<__media__>\" literal"
-    # Only one call site today (ModelMetadata), but replace all occurrences to
-    # stay robust if upstream adds more. Use a temp file to avoid relying on
-    # sed -i portability (the builder image uses GNU sed, but keeping this
-    # consistent with the awk block above).
-    sed 's/get_media_marker()/"<__media__>"/g' "$SRC" > "$SRC.tmp"
-    mv "$SRC.tmp" "$SRC"
-    echo "==> get_media_marker() substitution OK"
-else
-    echo "==> $SRC has no get_media_marker() call, skipping media-marker patch"
-fi
-
-echo "==> all patches applied"
--- a/backend/cpp/buun-llama-cpp/patches/0001-fattn-atomicAdd-double-shim.patch
+++ b/backend/cpp/buun-llama-cpp/patches/0001-fattn-atomicAdd-double-shim.patch
@@ -1,46 +0,0 @@
-Subject: [PATCH] ggml-cuda/fattn: provide atomicAdd(double*,double) shim for pre-sm_60
-
-Buun's Q² calibration path in ggml_cuda_turbo_scale_q calls
-  atomicAdd(&d_q_channel_sq_fattn[threadIdx.x], (double)(val * val));
-but native double atomicAdd is only available on compute capability 6.0
-and newer. Compiling against a CUDA arch list that includes older
-architectures (LocalAI's CUDA 12 Docker image builds for the full
-published arch range) fails with:
-
-    fattn.cu(812): error: no instance of overloaded function "atomicAdd"
-      matches the argument list, argument types are: (double *, double)
-
-Add the canonical CUDA-programming-guide shim at the top of fattn.cu so
-pre-sm_60 codegen has a definition to call. On sm_60+ the native CUDA
-intrinsic is used and the shim is elided via __CUDA_ARCH__.
-
--- a/ggml/src/ggml-cuda/fattn.cu
-+++ b/ggml/src/ggml-cuda/fattn.cu
-@@ -7,6 +7,27 @@
-
- #include <atomic>
-
-+// Pre-sm_60 double atomicAdd shim. Native double atomicAdd(double*,double)
-+// is only available on CUDA compute capability 6.0+ (see CUDA C Programming
-+// Guide, B.15 Atomic Functions). Buun's Q² calibration path below calls
-+// atomicAdd with a double*; without this definition, nvcc fails to find a
-+// matching overload whenever the compile target list includes pre-sm_60
-+// architectures. The standard CAS loop implementation below matches the
-+// semantics of the native intrinsic.
-+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 600
-+static __device__ double atomicAdd(double * address, double val) {
-+    unsigned long long int * address_as_ull = (unsigned long long int *)address;
-+    unsigned long long int old = *address_as_ull;
-+    unsigned long long int assumed;
-+    do {
-+        assumed = old;
-+        old = atomicCAS(address_as_ull, assumed,
-+                        __double_as_longlong(val + __longlong_as_double(assumed)));
-+    } while (assumed != old);
-+    return __longlong_as_double(old);
-+}
-+#endif
-+
- // InnerQ: update the fattn-side inverse scale array from host (all devices)
- void turbo_innerq_update_fattn_scales(const float * scale_inv) {
-     int cur_device;
--- a/backend/cpp/buun-llama-cpp/patches/0002-argmax-shfl-xor-sync-add-width.patch
+++ b/backend/cpp/buun-llama-cpp/patches/0002-argmax-shfl-xor-sync-add-width.patch
@@ -1,32 +0,0 @@
-Subject: [PATCH] ggml-cuda/argmax: pass WARP_SIZE to the top-K __shfl_xor_sync calls
-
-Two __shfl_xor_sync calls in the top-K intra-warp merge drop the `width`
-argument and rely on the CUDA default (warpSize). Every other call in
-the same file already passes WARP_SIZE explicitly, and the HIP/ROCm
-compatibility shim at ggml/src/ggml-cuda/vendors/hip.h:33 is a 4-arg
-function-like macro — so the 3-arg form fails to preprocess when
-building with hipcc against ROCm:
-
-    argmax.cu:265: error: too few arguments provided to function-like
-      macro invocation
-    note: macro '__shfl_xor_sync' defined here:
-      #define __shfl_xor_sync(mask, var, laneMask, width) \
-              __shfl_xor(var, laneMask, width)
-
-Align the two call sites with the rest of the file by passing WARP_SIZE
-explicitly. On CUDA the generated code is unchanged (warpSize is the
-default); on HIP it now matches the macro's arity.
-
--- a/ggml/src/ggml-cuda/argmax.cu
-+++ b/ggml/src/ggml-cuda/argmax.cu
-@@ -262,8 +262,8 @@
-     // Each step: lane gets partner's min element, if it beats our min, replace and re-heapify
-     for (int offset = WARP_SIZE / 2; offset > 0; offset >>= 1) {
-         for (int i = 0; i < K; i++) {
-            float partner_val = __shfl_xor_sync(0xFFFFFFFF, heap_val[i], offset);
-            int partner_idx = __shfl_xor_sync(0xFFFFFFFF, heap_idx[i], offset);
-+            float partner_val = __shfl_xor_sync(0xFFFFFFFF, heap_val[i], offset, WARP_SIZE);
-+            int partner_idx = __shfl_xor_sync(0xFFFFFFFF, heap_idx[i], offset, WARP_SIZE);
-             if (partner_val > heap_val[0]) {
-                 heap_val[0] = partner_val;
-                 heap_idx[0] = partner_idx;
--- a/backend/cpp/buun-llama-cpp/patches/0003-hip-add-memcpy-symbol-aliases.patch
+++ b/backend/cpp/buun-llama-cpp/patches/0003-hip-add-memcpy-symbol-aliases.patch
@@ -1,24 +0,0 @@
-Subject: [PATCH] ggml-cuda/vendors/hip: alias cudaMemcpy{To,From}Symbol to hip counterparts
-
-Buun's Q² calibration + TCQ codebook upload paths in fattn.cu use
-cudaMemcpyToSymbol / cudaMemcpyFromSymbol. The HIP-compat header in
-ggml/src/ggml-cuda/vendors/hip.h already aliases the scalar cudaMemcpy
-family (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but is
-missing the symbol variants. Building with hipcc therefore fails with
-15+ "use of undeclared identifier 'cudaMemcpyToSymbol'" errors.
-
-Add the two missing aliases alongside the existing memcpy block. HIP
-provides hipMemcpy{To,From}Symbol with the same signature as CUDA's
-equivalents, so this is a straight name substitution.
-
--- a/ggml/src/ggml-cuda/vendors/hip.h
-+++ b/ggml/src/ggml-cuda/vendors/hip.h
-@@ -85,6 +85,8 @@
- #define cudaMemcpyDeviceToDevice hipMemcpyDeviceToDevice
- #define cudaMemcpyDeviceToHost hipMemcpyDeviceToHost
- #define cudaMemcpyHostToDevice hipMemcpyHostToDevice
-+#define cudaMemcpyToSymbol hipMemcpyToSymbol
-+#define cudaMemcpyFromSymbol hipMemcpyFromSymbol
- #define cudaMemcpyKind hipMemcpyKind
- #define cudaMemset hipMemset
- #define cudaMemsetAsync hipMemsetAsync
--- a/backend/cpp/buun-llama-cpp/patches/0004-fattn-fwht128-shfl-xor-sync-add-width.patch
+++ b/backend/cpp/buun-llama-cpp/patches/0004-fattn-fwht128-shfl-xor-sync-add-width.patch
@@ -1,36 +0,0 @@
-Subject: [PATCH] ggml-cuda/fattn: pass WARP_SIZE to fwht128 __shfl_xor_sync calls
-
-Same issue as the argmax top-K fix: two __shfl_xor_sync call sites in
-the FWHT-128 butterfly kernels (ggml_cuda_fwht128 and fwht128_store_half)
-use the 3-arg CUDA form and omit the `width` argument that the HIP
-function-like macro in vendors/hip.h:33 requires. Hipcc fails with:
-
-    fattn.cu:512: too few arguments provided to function-like macro
-      invocation
-    note: macro '__shfl_xor_sync' defined here:
-      #define __shfl_xor_sync(mask, var, laneMask, width) \
-              __shfl_xor(var, laneMask, width)
-
-Add WARP_SIZE to both calls. CUDA codegen is unchanged (warpSize is the
-default); HIP now matches the macro arity.
-
--- a/ggml/src/ggml-cuda/fattn.cu
-+++ b/ggml/src/ggml-cuda/fattn.cu
-@@ -509,7 +509,7 @@
-     // Intra-warp passes: shuffle xor with stride h, no smem, no sync.
-     #pragma unroll
-     for (int h = 1; h <= 16; h *= 2) {
-        const float other = __shfl_xor_sync(0xFFFFFFFF, val, h);
-+        const float other = __shfl_xor_sync(0xFFFFFFFF, val, h, WARP_SIZE);
-         val = (tid & h) ? (other - val) : (val + other);
-     }
-
-@@ -533,7 +533,7 @@
- static __device__ __forceinline__ void fwht128_store_half(
-         float val, half * dst_base) {
-     const int tid = threadIdx.x;
-    const float neighbor = __shfl_xor_sync(0xFFFFFFFF, val, 1);
-+    const float neighbor = __shfl_xor_sync(0xFFFFFFFF, val, 1, WARP_SIZE);
-     if ((tid & 1) == 0) {
-         const half2 packed = __floats2half2_rn(val, neighbor);
-         *((half2 *)(dst_base + tid)) = packed;
--- a/backend/cpp/buun-llama-cpp/run.sh
+++ b/backend/cpp/buun-llama-cpp/run.sh
@@ -1,65 +0,0 @@
-#!/bin/bash
-set -ex
-
-# Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath $0)")
-
-cd /
-
-echo "CPU info:"
-grep -e "model\sname" /proc/cpuinfo | head -1
-grep -e "flags" /proc/cpuinfo | head -1
-
-BINARY=buun-llama-cpp-fallback
-
-if grep -q -e "\savx\s" /proc/cpuinfo ; then
-	echo "CPU:    AVX    found OK"
-	if [ -e $CURDIR/buun-llama-cpp-avx ]; then
-		BINARY=buun-llama-cpp-avx
-	fi
-fi
-
-if grep -q -e "\savx2\s" /proc/cpuinfo ; then
-	echo "CPU:    AVX2   found OK"
-	if [ -e $CURDIR/buun-llama-cpp-avx2 ]; then
-		BINARY=buun-llama-cpp-avx2
-	fi
-fi
-
-# Check avx 512
-if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
-	echo "CPU:    AVX512F found OK"
-	if [ -e $CURDIR/buun-llama-cpp-avx512 ]; then
-		BINARY=buun-llama-cpp-avx512
-	fi
-fi
-
-if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then
-	if [ -e $CURDIR/buun-llama-cpp-grpc ]; then
-		BINARY=buun-llama-cpp-grpc
-	fi
-fi
-
-# Extend ld library path with the dir where this script is located/lib
-if [ "$(uname)" == "Darwin" ]; then
-	export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
-else
-	export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
-	# Tell rocBLAS where to find TensileLibrary data (GPU kernel tuning files)
-	if [ -d "$CURDIR/lib/rocblas/library" ]; then
-		export ROCBLAS_TENSILE_LIBPATH=$CURDIR/lib/rocblas/library
-	fi
-fi
-
-# If there is a lib/ld.so, use it
-if [ -f $CURDIR/lib/ld.so ]; then
-	echo "Using lib/ld.so"
-	echo "Using binary: $BINARY"
-	exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
-fi
-
-echo "Using binary: $BINARY"
-exec $CURDIR/$BINARY "$@"
-
-# We should never reach this point, however just in case we do, run fallback
-exec $CURDIR/buun-llama-cpp-fallback "$@"
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=16996aeab772c69b6473597038b2ef0b85297e8b
+IK_LLAMA_VERSION?=3a945af45d45936341a45bbf7deda56776a4af26
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@

-LLAMA_VERSION?=187a45637054881ecacf17f8e2f6f8f2ba7df1c7
+LLAMA_VERSION?=f53577432541bb9edc1588c4ef45c66bf07e4468
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -642,6 +642,21 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            } else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
                params.no_op_offload = false;
            }
+        } else if (!strcmp(optname, "split_mode") || !strcmp(optname, "sm")) {
+            // Accepts: none | layer | row | tensor (the latter requires a llama.cpp build
+            // that includes ggml-org/llama.cpp#19378, FlashAttention enabled, and KV-cache
+            // quantization disabled).
+            if (optval != NULL) {
+                if (optval_str == "none") {
+                    params.split_mode = LLAMA_SPLIT_MODE_NONE;
+                } else if (optval_str == "layer") {
+                    params.split_mode = LLAMA_SPLIT_MODE_LAYER;
+                } else if (optval_str == "row") {
+                    params.split_mode = LLAMA_SPLIT_MODE_ROW;
+                } else if (optval_str == "tensor") {
+                    params.split_mode = LLAMA_SPLIT_MODE_TENSOR;
+                }
+            }
        } else if (!strcmp(optname, "kv_unified") || !strcmp(optname, "unified_kv")) {
            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
                params.kv_unified = true;
--- a/backend/cpp/turboquant/Makefile
+++ b/backend/cpp/turboquant/Makefile
@@ -1,7 +1,7 @@

 # Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
 # Auto-bumped nightly by .github/workflows/bump_deps.yaml.
-TURBOQUANT_VERSION?=627ebbc6e27727bd4f65422d8aa60b13404993c8
+TURBOQUANT_VERSION?=11a241d0db78a68e0a5b99fe6f36de6683100f6a
 LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant

 CMAKE_ARGS?=
--- a/backend/go/stablediffusion-ggml/Makefile
+++ b/backend/go/stablediffusion-ggml/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # stablediffusion.cpp (ggml)
 STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
-STABLEDIFFUSION_GGML_VERSION?=c97702e1057c2fe13a7074cd9069cb9dd6edc1bf
+STABLEDIFFUSION_GGML_VERSION?=b8bdffc19962be7e5a84bfefeb2e31bd885b571a

 CMAKE_ARGS+=-DGGML_MAX_NAME=128

--- a/backend/go/whisper/gowhisper.go
+++ b/backend/go/whisper/gowhisper.go
@@ -139,7 +139,10 @@ func (w *Whisper) AudioTranscription(opts *pb.TranscriptRequest) (pb.TranscriptR
 		// segment start/end conversion factor taken from https://github.com/ggml-org/whisper.cpp/blob/master/examples/cli/cli.cpp#L895
 		s := CppGetSegmentStart(i) * (10000000)
 		t := CppGetSegmentEnd(i) * (10000000)
-		txt := strings.Clone(CppGetSegmentText(i))
+		// whisper.cpp can emit bytes that aren't valid UTF-8 (e.g. a multibyte
+		// codepoint split across token boundaries); protobuf string fields
+		// reject those at marshal time. Scrub before the value escapes cgo.
+		txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
 		tokens := make([]int32, CppNTokens(i))

 		if opts.Diarize && CppGetSegmentSpeakerTurnNext(i) {
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -263,6 +263,8 @@
    amd: "rocm-vllm"
    intel: "intel-vllm"
    nvidia-cuda-12: "cuda12-vllm"
+    nvidia-cuda-13: "cuda13-vllm"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm"
    cpu: "cpu-vllm"
 - &sglang
  name: "sglang"
@@ -285,6 +287,7 @@
    amd: "rocm-sglang"
    intel: "intel-sglang"
    nvidia-cuda-12: "cuda12-sglang"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sglang"
    cpu: "cpu-sglang"
 - &vllm-omni
  name: "vllm-omni"
@@ -311,6 +314,8 @@
    nvidia: "cuda12-vllm-omni"
    amd: "rocm-vllm-omni"
    nvidia-cuda-12: "cuda12-vllm-omni"
+    nvidia-cuda-13: "cuda13-vllm-omni"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm-omni"
 - &mlx
  name: "mlx"
  uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-mlx"
@@ -1608,6 +1613,20 @@
  mirrors:
    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant
 ## whisper
+- !!merge <<: *whispercpp
+  name: "whisper-development"
+  capabilities:
+    default: "cpu-whisper-development"
+    nvidia: "cuda12-whisper-development"
+    intel: "intel-sycl-f16-whisper-development"
+    metal: "metal-whisper-development"
+    amd: "rocm-whisper-development"
+    vulkan: "vulkan-whisper-development"
+    nvidia-l4t: "nvidia-l4t-arm64-whisper-development"
+    nvidia-cuda-13: "cuda13-whisper-development"
+    nvidia-cuda-12: "cuda12-whisper-development"
+    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-whisper-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-whisper-development"
 - !!merge <<: *whispercpp
  name: "nvidia-l4t-arm64-whisper"
  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-whisper"
@@ -1814,12 +1833,25 @@
    nvidia: "cuda12-vllm-development"
    amd: "rocm-vllm-development"
    intel: "intel-vllm-development"
+    nvidia-cuda-12: "cuda12-vllm-development"
+    nvidia-cuda-13: "cuda13-vllm-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm-development"
    cpu: "cpu-vllm-development"
 - !!merge <<: *vllm
  name: "cuda12-vllm"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vllm"
  mirrors:
    - localai/localai-backends:latest-gpu-nvidia-cuda-12-vllm
+- !!merge <<: *vllm
+  name: "cuda13-vllm"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-vllm"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-13-vllm
+- !!merge <<: *vllm
+  name: "cuda13-nvidia-l4t-arm64-vllm"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-vllm"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-vllm
 - !!merge <<: *vllm
  name: "rocm-vllm"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-vllm"
@@ -1840,6 +1872,16 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-vllm"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-vllm
+- !!merge <<: *vllm
+  name: "cuda13-vllm-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-vllm"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-13-vllm
+- !!merge <<: *vllm
+  name: "cuda13-nvidia-l4t-arm64-vllm-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-vllm"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-vllm
 - !!merge <<: *vllm
  name: "rocm-vllm-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-vllm"
@@ -1862,12 +1904,19 @@
    nvidia: "cuda12-sglang-development"
    amd: "rocm-sglang-development"
    intel: "intel-sglang-development"
+    nvidia-cuda-12: "cuda12-sglang-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sglang-development"
    cpu: "cpu-sglang-development"
 - !!merge <<: *sglang
  name: "cuda12-sglang"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sglang"
  mirrors:
    - localai/localai-backends:latest-gpu-nvidia-cuda-12-sglang
+- !!merge <<: *sglang
+  name: "cuda13-nvidia-l4t-arm64-sglang"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-sglang"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-sglang
 - !!merge <<: *sglang
  name: "rocm-sglang"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-sglang"
@@ -1888,6 +1937,11 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sglang"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-sglang
+- !!merge <<: *sglang
+  name: "cuda13-nvidia-l4t-arm64-sglang-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-sglang"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-sglang
 - !!merge <<: *sglang
  name: "rocm-sglang-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-sglang"
@@ -1910,11 +1964,23 @@
    nvidia: "cuda12-vllm-omni-development"
    amd: "rocm-vllm-omni-development"
    nvidia-cuda-12: "cuda12-vllm-omni-development"
+    nvidia-cuda-13: "cuda13-vllm-omni-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm-omni-development"
 - !!merge <<: *vllm-omni
  name: "cuda12-vllm-omni"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vllm-omni"
  mirrors:
    - localai/localai-backends:latest-gpu-nvidia-cuda-12-vllm-omni
+- !!merge <<: *vllm-omni
+  name: "cuda13-vllm-omni"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-vllm-omni"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-13-vllm-omni
+- !!merge <<: *vllm-omni
+  name: "cuda13-nvidia-l4t-arm64-vllm-omni"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-vllm-omni"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-vllm-omni
 - !!merge <<: *vllm-omni
  name: "rocm-vllm-omni"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-vllm-omni"
@@ -1925,6 +1991,16 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-vllm-omni"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-vllm-omni
+- !!merge <<: *vllm-omni
+  name: "cuda13-vllm-omni-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-vllm-omni"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-13-vllm-omni
+- !!merge <<: *vllm-omni
+  name: "cuda13-nvidia-l4t-arm64-vllm-omni-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-vllm-omni"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-vllm-omni
 - !!merge <<: *vllm-omni
  name: "rocm-vllm-omni-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-vllm-omni"
--- a/backend/python/mlx-vlm/requirements-cpu.txt
+++ b/backend/python/mlx-vlm/requirements-cpu.txt
@@ -1,2 +1,2 @@
-git+https://github.com/Blaizzy/mlx-vlm
+git+https://github.com/Blaizzy/mlx-vlm@v0.4.4
 mlx[cpu]
--- a/backend/python/mlx-vlm/requirements-cublas12.txt
+++ b/backend/python/mlx-vlm/requirements-cublas12.txt
@@ -1,2 +1,2 @@
-git+https://github.com/Blaizzy/mlx-vlm
+git+https://github.com/Blaizzy/mlx-vlm@v0.4.4
 mlx[cuda12]
--- a/backend/python/mlx-vlm/requirements-cublas13.txt
+++ b/backend/python/mlx-vlm/requirements-cublas13.txt
@@ -1,2 +1,2 @@
-git+https://github.com/Blaizzy/mlx-vlm
+git+https://github.com/Blaizzy/mlx-vlm@v0.4.4
 mlx[cuda13]
--- a/backend/python/mlx-vlm/requirements-l4t12.txt
+++ b/backend/python/mlx-vlm/requirements-l4t12.txt
@@ -1,2 +1,2 @@
-git+https://github.com/Blaizzy/mlx-vlm
+git+https://github.com/Blaizzy/mlx-vlm@v0.4.4
 mlx[cuda12]
--- a/backend/python/mlx-vlm/requirements-l4t13.txt
+++ b/backend/python/mlx-vlm/requirements-l4t13.txt
@@ -1,2 +1,2 @@
-git+https://github.com/Blaizzy/mlx-vlm
+git+https://github.com/Blaizzy/mlx-vlm@v0.4.4
 mlx[cuda13]
--- a/backend/python/mlx-vlm/requirements-mps.txt
+++ b/backend/python/mlx-vlm/requirements-mps.txt
@@ -1 +1 @@
-git+https://github.com/Blaizzy/mlx-vlm
+git+https://github.com/Blaizzy/mlx-vlm@v0.4.4
--- a/backend/python/sglang/install.sh
+++ b/backend/python/sglang/install.sh
@@ -23,6 +23,19 @@ if [ "x${BUILD_PROFILE}" == "xcpu" ]; then
    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
 fi

+# JetPack 7 / L4T arm64 wheels are built for cp312 and shipped via
+# pypi.jetson-ai-lab.io. Bump the venv Python so the prebuilt sglang
+# wheel resolves cleanly. unsafe-best-match is required because the
+# jetson-ai-lab index lists transitive deps (e.g. decord) at older
+# versions only — without it uv refuses to fall through to PyPI for a
+# compatible wheel and resolution fails.
+if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
+    PYTHON_VERSION="3.12"
+    PYTHON_PATCH="12"
+    PY_STANDALONE_TAG="20251120"
+    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
+fi
+
 # sglang's CPU path has no prebuilt wheel on PyPI — upstream publishes
 # a separate pyproject_cpu.toml that must be swapped in before `pip install`.
 # Reference: docker/xeon.Dockerfile in the sglang upstream repo.
--- a/backend/python/sglang/requirements-l4t13.txt
+++ b/backend/python/sglang/requirements-l4t13.txt
@@ -0,0 +1,12 @@
+--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
+accelerate
+torch
+torchvision
+torchaudio
+transformers
+# Drop the [all] extra: it pulls outlines/decord, and decord has no
+# aarch64 cp312 wheel anywhere (PyPI nor the jetson-ai-lab index ships
+# only legacy cp35-cp37). With [all] uv backtracks through versions
+# trying to satisfy decord and lands on sglang==0.1.16. Floor at 0.5.0
+# so uv can't silently downgrade if a future resolution misfires.
+sglang>=0.5.0
--- a/backend/python/vllm-omni/install.sh
+++ b/backend/python/vllm-omni/install.sh
@@ -12,11 +12,15 @@ else
    source $backend_dir/../common/libbackend.sh
 fi

-# Handle l4t build profiles (Python 3.12, pip fallback) if needed
+# Handle l4t build profiles (Python 3.12, pip fallback) if needed.
+# unsafe-best-match is required on l4t13 because the jetson-ai-lab index
+# lists transitive deps at limited versions — without it uv pins to the
+# first matching index and fails to resolve a compatible wheel from PyPI.
 if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
  PYTHON_VERSION="3.12"
  PYTHON_PATCH="12"
  PY_STANDALONE_TAG="20251120"
+  EXTRA_PIP_INSTALL_FLAGS="${EXTRA_PIP_INSTALL_FLAGS:-} --index-strategy=unsafe-best-match"
 fi

 if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then
@@ -26,7 +30,11 @@ fi
 # Install base requirements first
 installRequirements

-# Install vllm based on build type
+# Install vllm based on build type. vllm-omni tracks vllm master from
+# source (cloned below) so we leave the upstream vllm dependency unpinned
+# — vllm 0.19+ ships cu130 wheels by default, which is what we want for
+# cublas13. Older cuda12/rocm/cpu paths still resolve a compatible wheel
+# from the relevant channel.
 if [ "x${BUILD_TYPE}" == "xhipblas" ]; then
    # ROCm
    if [ "x${USE_PIP}" == "xtrue" ]; then
@@ -34,8 +42,26 @@ if [ "x${BUILD_TYPE}" == "xhipblas" ]; then
    else
        uv pip install vllm==0.14.0 --extra-index-url https://wheels.vllm.ai/rocm/0.14.0/rocm700
    fi
+elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
+    # JetPack 7 / L4T arm64 cu130 — vllm comes from the prebuilt SBSA wheel
+    # at jetson-ai-lab. Version is unpinned: the index ships whatever build
+    # matches the cu130/cp312 ABI. unsafe-best-match lets uv fall through
+    # to PyPI for transitive deps not present on the jetson-ai-lab index.
+    if [ "x${USE_PIP}" == "xtrue" ]; then
+        pip install vllm --extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
+    else
+        uv pip install --index-strategy=unsafe-best-match vllm --extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
+    fi
+elif [ "x${BUILD_PROFILE}" == "xcublas13" ]; then
+    # vllm 0.19+ defaults to cu130 wheels on PyPI, no extra index needed.
+    if [ "x${USE_PIP}" == "xtrue" ]; then
+        pip install vllm --torch-backend=auto
+    else
+        uv pip install vllm --torch-backend=auto
+    fi
 elif [ "x${BUILD_TYPE}" == "xcublas" ] || [ "x${BUILD_TYPE}" == "x" ]; then
-    # CUDA (default) or CPU
+    # cuda12 / CPU — keep the 0.14.0 pin for compatibility with the existing
+    # cuda12 vllm-omni image; bumping should be its own change.
    if [ "x${USE_PIP}" == "xtrue" ]; then
        pip install vllm==0.14.0 --torch-backend=auto
    else
--- a/backend/python/vllm-omni/requirements-cublas13.txt
+++ b/backend/python/vllm-omni/requirements-cublas13.txt
@@ -0,0 +1,5 @@
+--extra-index-url https://download.pytorch.org/whl/cu130
+accelerate
+torch
+transformers
+bitsandbytes
--- a/backend/python/vllm-omni/requirements-l4t13.txt
+++ b/backend/python/vllm-omni/requirements-l4t13.txt
@@ -0,0 +1,13 @@
+--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
+accelerate
+torch
+torchvision
+torchaudio
+transformers
+bitsandbytes
+flash-attn
+diffusers
+librosa
+soundfile
+pillow
+numpy
--- a/backend/python/vllm/install.sh
+++ b/backend/python/vllm/install.sh
@@ -32,6 +32,22 @@ if [ "x${BUILD_PROFILE}" == "xcpu" ]; then
    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
 fi

+# JetPack 7 / L4T arm64 wheels (torch, vllm, flash-attn) live on
+# pypi.jetson-ai-lab.io and are built for cp312, so bump the venv Python
+# accordingly. JetPack 6 keeps cp310 + USE_PIP=true. unsafe-best-match
+# is required because the jetson-ai-lab index lists transitive deps at
+# limited versions — without it uv pins to the first matching index and
+# fails to resolve a compatible wheel from PyPI.
+if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then
+    USE_PIP=true
+fi
+if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
+    PYTHON_VERSION="3.12"
+    PYTHON_PATCH="12"
+    PY_STANDALONE_TAG="20251120"
+    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
+fi
+
 # FROM_SOURCE=true on a CPU build skips the prebuilt vllm wheel in
 # requirements-cpu-after.txt and compiles vllm locally against the host's
 # actual CPU. Not used by default because it takes ~30-40 minutes, but
--- a/backend/python/vllm/requirements-cublas12-after.txt
+++ b/backend/python/vllm/requirements-cublas12-after.txt
@@ -1,2 +1,9 @@
-https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
+# flash-attn wheels are ABI-tied to a specific torch version. vllm forces
+# torch==2.10.0 as a hard dep, but flash-attn 2.8.3 (latest) only ships
+# prebuilt wheels up to torch 2.8 — any wheel we pin here gets silently
+# broken when vllm upgrades torch during install, producing an undefined
+# libc10_cuda symbol at import time. FlashInfer (required by vllm) covers
+# attention, and rotary_embedding/common.py guards the flash_attn import
+# with find_spec(), so skipping flash-attn is safe and the only stable
+# choice until upstream ships a torch-2.10 wheel.
 vllm
--- a/backend/python/vllm/requirements-cublas12.txt
+++ b/backend/python/vllm/requirements-cublas12.txt
@@ -1,4 +1,4 @@
 accelerate
-torch==2.7.0
+torch
 transformers
 bitsandbytes
--- a/backend/python/vllm/requirements-cublas13-after.txt
+++ b/backend/python/vllm/requirements-cublas13-after.txt
@@ -0,0 +1,2 @@
+--extra-index-url https://download.pytorch.org/whl/cu130
+vllm
--- a/backend/python/vllm/requirements-cublas13.txt
+++ b/backend/python/vllm/requirements-cublas13.txt
@@ -0,0 +1,5 @@
+--extra-index-url https://download.pytorch.org/whl/cu130
+accelerate
+torch
+transformers
+bitsandbytes
--- a/backend/python/vllm/requirements-l4t13-after.txt
+++ b/backend/python/vllm/requirements-l4t13-after.txt
@@ -0,0 +1,2 @@
+--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
+vllm
--- a/backend/python/vllm/requirements-l4t13.txt
+++ b/backend/python/vllm/requirements-l4t13.txt
@@ -0,0 +1,8 @@
+--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
+accelerate
+torch
+torchvision
+torchaudio
+transformers
+bitsandbytes
+flash-attn
--- a/backend/rust/kokoros/Cargo.lock
+++ b/backend/rust/kokoros/Cargo.lock
@@ -1867,9 +1867,9 @@ dependencies = [

 [[package]]
 name = "rustls-webpki"
-version = "0.103.10"
+version = "0.103.13"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "df33b2b81ac578cabaf06b89b0631153a3f416b0a886e8a7a1707fb51abbd1ef"
+checksum = "61c429a8649f110dddef65e2a5ad240f747e85f7758a6bccc7e5777bd33f756e"
 dependencies = [
 "ring",
 "rustls-pki-types",
--- a/core/cli/worker.go
+++ b/core/cli/worker.go
@@ -90,6 +90,14 @@ type WorkerCMD struct {
 	RegistrationToken string `env:"LOCALAI_REGISTRATION_TOKEN" help:"Token for authenticating with the frontend" group:"registration"`
 	HeartbeatInterval string `env:"LOCALAI_HEARTBEAT_INTERVAL" default:"10s" help:"Interval between heartbeats" group:"registration"`
 	NodeLabels        string `env:"LOCALAI_NODE_LABELS" help:"Comma-separated key=value labels for this node (e.g. tier=fast,gpu=a100)" group:"registration"`
+	// MaxReplicasPerModel caps how many replicas of any one model can run on
+	// this worker concurrently. Default 1 = historical single-replica
+	// behavior. Set higher when a node has enough VRAM to host multiple
+	// copies of the same model (e.g. a fat 128 GiB box running 4× of a
+	// 24 GiB model for throughput). The auto-label `node.replica-slots=N`
+	// is published so model schedulers can target high-capacity nodes via
+	// the existing label selector.
+	MaxReplicasPerModel int `env:"LOCALAI_MAX_REPLICAS_PER_MODEL" default:"1" help:"Max replicas of any single model on this worker. Default 1 preserves single-replica behavior; set higher to allow stacking replicas on a fat node." group:"registration"`

 	// NATS (required)
 	NatsURL string `env:"LOCALAI_NATS_URL" required:"" help:"NATS server URL" group:"distributed"`
@@ -567,22 +575,35 @@ func (s *backendSupervisor) getAddr(backend string) string {
 	return ""
 }

+// buildProcessKey is the supervisor's stable identifier for a backend gRPC
+// process. It includes the replica index so the same model can run multiple
+// processes on a worker simultaneously without colliding on the same map slot
+// or port. The "#N" suffix is purely internal — the controller never reads it.
+func buildProcessKey(modelID, backend string, replicaIndex int) string {
+	base := modelID
+	if base == "" {
+		base = backend
+	}
+	return fmt.Sprintf("%s#%d", base, replicaIndex)
+}
+
 // installBackend handles the backend.install flow:
-// 1. If already running for this model, return existing address
+// 1. If already running for this (model, replica) slot, return existing address
 // 2. Install backend from gallery (if not already installed)
 // 3. Find backend binary
 // 4. Start gRPC process on a new port
 // Returns the gRPC address of the backend process.
+//
+// ProcessKey includes the replica index so a worker with MaxReplicasPerModel>1
+// can host multiple processes for the same model on distinct ports. Old
+// controllers (no replica_index in the request) implicitly target replica 0,
+// which preserves single-replica behavior.
 func (s *backendSupervisor) installBackend(req messaging.BackendInstallRequest) (string, error) {
-	// Process key: use ModelID if provided (per-model process), else backend name
-	processKey := req.ModelID
-	if processKey == "" {
-		processKey = req.Backend
-	}
+	processKey := buildProcessKey(req.ModelID, req.Backend, int(req.ReplicaIndex))

-	// If already running for this model, return its address
+	// If already running for this model+replica, return its address
 	if addr := s.getAddr(processKey); addr != "" {
-		xlog.Info("Backend already running for model", "backend", req.Backend, "model", req.ModelID, "addr", addr)
+		xlog.Info("Backend already running for model replica", "backend", req.Backend, "model", req.ModelID, "replica", req.ReplicaIndex, "addr", addr)
 		return addr, nil
 	}

@@ -886,13 +907,18 @@ func (cmd *WorkerCMD) registrationBody() map[string]any {
 	totalVRAM, _ := xsysinfo.TotalAvailableVRAM()
 	gpuVendor, _ := xsysinfo.DetectGPUVendor()

+	maxReplicas := cmd.MaxReplicasPerModel
+	if maxReplicas < 1 {
+		maxReplicas = 1
+	}
 	body := map[string]any{
-		"name":           nodeName,
-		"address":        cmd.advertiseAddr(),
-		"http_address":   cmd.advertiseHTTPAddr(),
-		"total_vram":     totalVRAM,
-		"available_vram": totalVRAM, // initially all VRAM is available
-		"gpu_vendor":     gpuVendor,
+		"name":                   nodeName,
+		"address":                cmd.advertiseAddr(),
+		"http_address":           cmd.advertiseHTTPAddr(),
+		"total_vram":             totalVRAM,
+		"available_vram":         totalVRAM, // initially all VRAM is available
+		"gpu_vendor":             gpuVendor,
+		"max_replicas_per_model": maxReplicas,
 	}

 	// If no GPU detected, report system RAM so the scheduler/UI has capacity info
@@ -906,39 +932,40 @@ func (cmd *WorkerCMD) registrationBody() map[string]any {
 		body["token"] = cmd.RegistrationToken
 	}

-	// Parse and add static node labels
+	// Parse and add static node labels. Always include the auto-label
+	// `node.replica-slots=N` so AND-selectors in ModelSchedulingConfig can
+	// target high-capacity nodes (e.g. {"node.replica-slots":"4"}).
+	labels := make(map[string]string)
 	if cmd.NodeLabels != "" {
-		labels := make(map[string]string)
 		for _, pair := range strings.Split(cmd.NodeLabels, ",") {
 			pair = strings.TrimSpace(pair)
 			if k, v, ok := strings.Cut(pair, "="); ok {
 				labels[strings.TrimSpace(k)] = strings.TrimSpace(v)
 			}
 		}
-		if len(labels) > 0 {
-			body["labels"] = labels
-		}
 	}
+	labels["node.replica-slots"] = strconv.Itoa(maxReplicas)
+	body["labels"] = labels

 	return body
 }

 // heartbeatBody returns the current VRAM/RAM stats for heartbeat payloads.
+//
+// When aggregate VRAM usage is unknown (no GPU, or temporary detection
+// failure), we deliberately OMIT available_vram so the frontend keeps its
+// last good value — overwriting with 0 makes the UI show the node as "fully
+// used", while reporting total-as-available lies to the scheduler about
+// free capacity.
 func (cmd *WorkerCMD) heartbeatBody() map[string]any {
-	var availVRAM uint64
+	body := map[string]any{}
 	aggregate := xsysinfo.GetGPUAggregateInfo()
 	if aggregate.TotalVRAM > 0 {
-		availVRAM = aggregate.FreeVRAM
-	} else {
-		// Fallback: report total as available (no usage tracking possible)
-		availVRAM, _ = xsysinfo.TotalAvailableVRAM()
+		body["available_vram"] = aggregate.FreeVRAM
 	}

-	body := map[string]any{
-		"available_vram": availVRAM,
-	}
-
-	// If no GPU, report system RAM usage instead
+	// CPU-only workers (or workers that lost GPU visibility momentarily):
+	// report system RAM so the scheduler still has capacity info.
 	if aggregate.TotalVRAM == 0 {
 		if ramInfo, err := xsysinfo.GetSystemRAMInfo(); err == nil {
 			body["available_ram"] = ramInfo.Available
--- a/core/cli/worker_replica_test.go
+++ b/core/cli/worker_replica_test.go
@@ -0,0 +1,70 @@
+package cli
+
+import (
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("Worker per-replica process keying", func() {
+	Describe("buildProcessKey", func() {
+		// Pin the supervisor's keying contract: distinct replica indexes for
+		// the same modelID produce distinct process keys, so the supervisor
+		// map can hold multiple processes for one model. Dropping the suffix
+		// would re-introduce the original flap (one model, one slot, churn).
+		DescribeTable("produces stable, distinct keys",
+			func(modelID, backend string, replica int, want string) {
+				Expect(buildProcessKey(modelID, backend, replica)).To(Equal(want))
+			},
+			Entry("modelID present, replica 0", "Qwen3-35B", "llama-cpp", 0, "Qwen3-35B#0"),
+			Entry("modelID present, replica 1", "Qwen3-35B", "llama-cpp", 1, "Qwen3-35B#1"),
+			Entry("falls back to backend when modelID empty", "", "llama-cpp", 0, "llama-cpp#0"),
+			Entry("backend fallback with replica 2", "", "llama-cpp", 2, "llama-cpp#2"),
+		)
+
+		It("makes replicas distinguishable", func() {
+			r0 := buildProcessKey("model-a", "llama-cpp", 0)
+			r1 := buildProcessKey("model-a", "llama-cpp", 1)
+			Expect(r0).ToNot(Equal(r1), "replicas of the same model must produce distinct keys")
+		})
+	})
+
+	Describe("registrationBody", func() {
+		It("includes max_replicas_per_model and the auto-label", func() {
+			cmd := &WorkerCMD{
+				Addr:                "worker.example.com:50051",
+				MaxReplicasPerModel: 4,
+			}
+			body := cmd.registrationBody()
+
+			Expect(body).To(HaveKey("max_replicas_per_model"))
+			Expect(body["max_replicas_per_model"]).To(Equal(4))
+
+			labels, ok := body["labels"].(map[string]string)
+			Expect(ok).To(BeTrue(), "labels must be present so selectors can target the slot count")
+			Expect(labels).To(HaveKeyWithValue("node.replica-slots", "4"))
+		})
+
+		It("coerces zero/unset MaxReplicasPerModel to 1", func() {
+			cmd := &WorkerCMD{Addr: "worker.example.com:50051"}
+			body := cmd.registrationBody()
+			Expect(body["max_replicas_per_model"]).To(Equal(1),
+				"unset must default to single-replica behavior, not capacity 0")
+
+			labels := body["labels"].(map[string]string)
+			Expect(labels).To(HaveKeyWithValue("node.replica-slots", "1"))
+		})
+
+		It("preserves user-provided labels alongside the auto-label", func() {
+			cmd := &WorkerCMD{
+				Addr:                "worker.example.com:50051",
+				MaxReplicasPerModel: 2,
+				NodeLabels:          "tier=fast,gpu=a100",
+			}
+			body := cmd.registrationBody()
+			labels := body["labels"].(map[string]string)
+			Expect(labels).To(HaveKeyWithValue("tier", "fast"))
+			Expect(labels).To(HaveKeyWithValue("gpu", "a100"))
+			Expect(labels).To(HaveKeyWithValue("node.replica-slots", "2"))
+		})
+	})
+})
--- a/core/config/meta/constants.go
+++ b/core/config/meta/constants.go
@@ -37,14 +37,6 @@ var CacheTypeOptions = []FieldOption{
 	{Value: "q4_1", Label: "Q4_1"},
 	{Value: "q5_0", Label: "Q5_0"},
 	{Value: "q5_1", Label: "Q5_1"},
-	// TurboQuant KV-cache types — accepted by the turboquant and
-	// buun-llama-cpp fork backends; stock llama-cpp will reject them at load.
-	{Value: "turbo2", Label: "Turbo2 (TurboQuant)"},
-	{Value: "turbo3", Label: "Turbo3 (TurboQuant)"},
-	{Value: "turbo4", Label: "Turbo4 (TurboQuant)"},
-	// Trellis-Coded Quantization variants — buun-llama-cpp only.
-	{Value: "turbo2_tcq", Label: "Turbo2 TCQ (buun-llama-cpp)"},
-	{Value: "turbo3_tcq", Label: "Turbo3 TCQ (buun-llama-cpp)"},
 }

 var DiffusersPipelineOptions = []FieldOption{
--- a/core/gallery/importers/llama-cpp.go
+++ b/core/gallery/importers/llama-cpp.go
@@ -34,7 +34,6 @@ func (i *LlamaCPPImporter) AdditionalBackends() []KnownBackendEntry {
 	return []KnownBackendEntry{
 		{Name: "ik-llama-cpp", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with ik-quants"},
 		{Name: "turboquant", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with TurboQuant optimizations"},
-		{Name: "buun-llama-cpp", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with DFlash speculative decoding and TurboQuant/TCQ KV-cache quantization"},
 	}
 }

@@ -128,7 +127,7 @@ func (i *LlamaCPPImporter) Import(details Details) (gallery.ModelConfig, error)
 	backend := "llama-cpp"
 	if b, ok := preferencesMap["backend"].(string); ok {
 		switch b {
-		case "ik-llama-cpp", "turboquant", "buun-llama-cpp":
+		case "ik-llama-cpp", "turboquant":
 			backend = b
 		}
 	}
--- a/core/gallery/importers/llama-cpp_test.go
+++ b/core/gallery/importers/llama-cpp_test.go
@@ -181,23 +181,6 @@ var _ = Describe("LlamaCPPImporter", func() {
 			Expect(modelConfig.Files[0].Filename).To(Equal("my-model.gguf"))
 		})

-		It("swaps the emitted backend to buun-llama-cpp when preferred", func() {
-			preferences := json.RawMessage(`{"backend": "buun-llama-cpp"}`)
-			details := Details{
-				URI:         "https://example.com/my-model.gguf",
-				Preferences: preferences,
-			}
-
-			modelConfig, err := importer.Import(details)
-
-			Expect(err).ToNot(HaveOccurred())
-			Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: buun-llama-cpp"), fmt.Sprintf("Model config: %+v", modelConfig))
-			Expect(modelConfig.ConfigFile).NotTo(ContainSubstring("backend: llama-cpp\n"), fmt.Sprintf("Model config: %+v", modelConfig))
-			Expect(modelConfig.ConfigFile).To(ContainSubstring("model: my-model.gguf"), fmt.Sprintf("Model config: %+v", modelConfig))
-			Expect(len(modelConfig.Files)).To(Equal(1))
-			Expect(modelConfig.Files[0].Filename).To(Equal("my-model.gguf"))
-		})
-
 		It("keeps backend: llama-cpp for unknown backend preferences", func() {
 			// Unknown backend values must not leak into the emitted YAML —
 			// we only honour the two curated drop-in replacements.
@@ -392,7 +375,7 @@ var _ = Describe("LlamaCPPImporter", func() {
 	})

 	Context("AdditionalBackends", func() {
-		It("advertises ik-llama-cpp, turboquant, and buun-llama-cpp as drop-in replacements", func() {
+		It("advertises ik-llama-cpp and turboquant as drop-in replacements", func() {
 			entries := importer.AdditionalBackends()

 			names := make([]string, 0, len(entries))
@@ -401,7 +384,7 @@ var _ = Describe("LlamaCPPImporter", func() {
 				names = append(names, e.Name)
 				byName[e.Name] = e
 			}
-			Expect(names).To(ConsistOf("ik-llama-cpp", "turboquant", "buun-llama-cpp"))
+			Expect(names).To(ConsistOf("ik-llama-cpp", "turboquant"))

 			ik := byName["ik-llama-cpp"]
 			Expect(ik.Modality).To(Equal("text"))
@@ -410,10 +393,6 @@ var _ = Describe("LlamaCPPImporter", func() {
 			tq := byName["turboquant"]
 			Expect(tq.Modality).To(Equal("text"))
 			Expect(tq.Description).NotTo(BeEmpty())
-
-			bn := byName["buun-llama-cpp"]
-			Expect(bn.Modality).To(Equal("text"))
-			Expect(bn.Description).NotTo(BeEmpty())
 		})
 	})
 })
--- a/core/http/endpoints/localai/backend.go
+++ b/core/http/endpoints/localai/backend.go
@@ -98,7 +98,7 @@ func (mgs *BackendEndpointService) GetAllStatusEndpoint() echo.HandlerFunc {
 // @Param request body GalleryBackend true "query params"
 // @Success 200 {object} schema.BackendResponse "Response"
 // @Router /backends/apply [post]
-func (mgs *BackendEndpointService) ApplyBackendEndpoint() echo.HandlerFunc {
+func (mgs *BackendEndpointService) ApplyBackendEndpoint(systemState *system.SystemState) echo.HandlerFunc {
 	return func(c echo.Context) error {
 		input := new(GalleryBackend)
 		// Get input data from the request body
@@ -106,6 +106,18 @@ func (mgs *BackendEndpointService) ApplyBackendEndpoint() echo.HandlerFunc {
 			return err
 		}

+		// In distributed mode, refuse to fan out a hardware-specific build to
+		// every node — a CPU build landing on a GPU cluster is almost always
+		// wrong, and the silent footgun is exactly what this guard exists for.
+		// Auto-resolving (meta) backends are fine because each node picks its
+		// own variant. Tooling can recover by hitting
+		// POST /api/nodes/{id}/backends/install per target node.
+		if mgs.backendApplier.BackendManager().IsDistributed() && input.ID != "" {
+			if guard := concreteFanOutGuard(c, mgs.galleries, systemState, input.ID); guard != nil {
+				return guard
+			}
+		}
+
 		uuid, err := uuid.NewUUID()
 		if err != nil {
 			return err
@@ -120,6 +132,66 @@ func (mgs *BackendEndpointService) ApplyBackendEndpoint() echo.HandlerFunc {
 	}
 }

+// concreteFanOutGuard returns a 409 response if the requested backend is a
+// hardware-specific build (not auto-resolving / meta) and we are in
+// distributed mode. It looks up the backend in the configured galleries; if
+// the lookup itself fails (gallery unreachable, name not found), the guard
+// stays out of the way and lets the install enqueue normally — a missing
+// name will surface from the worker as a clearer error than the guard could
+// produce here. The response body deliberately speaks human, with `code` and
+// `meta_alternative` as the programmatic contract for tooling.
+func concreteFanOutGuard(c echo.Context, galleries []config.Gallery, systemState *system.SystemState, backendID string) error {
+	// Use the unfiltered listing because in distributed mode the frontend's
+	// hardware is irrelevant — the install targets workers, not us — and the
+	// filtered list would hide variants that don't match the frontend host
+	// (e.g. a CUDA build on a CPU-only frontend), preventing the guard from
+	// firing for exactly the cases it's meant to protect against.
+	available, err := gallery.AvailableBackendsUnfiltered(galleries, systemState)
+	if err != nil {
+		return nil
+	}
+	requested := available.FindByName(backendID)
+	if requested == nil || requested.IsMeta() {
+		return nil
+	}
+
+	// Try to find an auto-resolving (meta) backend that has this concrete
+	// variant in its CapabilitiesMap, so we can suggest it as a one-shot
+	// alternative. Optional — empty string is fine if no parent exists.
+	metaAlternative := ""
+	for _, b := range available {
+		if !b.IsMeta() {
+			continue
+		}
+		for _, concrete := range b.CapabilitiesMap {
+			if concrete == backendID {
+				metaAlternative = b.Name
+				break
+			}
+		}
+		if metaAlternative != "" {
+			break
+		}
+	}
+
+	msg := fmt.Sprintf(
+		"Backend %q is a hardware-specific build and won't run correctly on every node in this cluster. In distributed mode, install it on specific nodes:\n\n  POST /api/nodes/{node_id}/backends/install\n  {\"backend\": %q}",
+		backendID, backendID,
+	)
+	if metaAlternative != "" {
+		msg += fmt.Sprintf(
+			"\n\nTo install across all nodes, use the auto-resolving backend %q — each node picks its own variant based on its hardware.",
+			metaAlternative,
+		)
+	}
+
+	return c.JSON(409, map[string]any{
+		"error":            msg,
+		"code":             "concrete_backend_requires_target",
+		"meta_alternative": metaAlternative,
+	})
+}
+
 // DeleteBackendEndpoint lets delete backends from a LocalAI instance
 // @Summary delete backends from LocalAI.
 // @Tags backends
--- a/core/http/endpoints/localai/nodes.go
+++ b/core/http/endpoints/localai/nodes.go
@@ -73,6 +73,10 @@ type RegisterNodeRequest struct {
 	AvailableRAM  uint64 `json:"available_ram,omitempty"`
 	GPUVendor     string            `json:"gpu_vendor,omitempty"`
 	Labels        map[string]string `json:"labels,omitempty"`
+	// MaxReplicasPerModel is the per-node cap on replicas of any single model.
+	// Workers older than this field omit it; we coerce 0 → 1 below to preserve
+	// historical single-replica behavior.
+	MaxReplicasPerModel int `json:"max_replicas_per_model,omitempty"`
 }

 // RegisterNodeEndpoint registers a new backend node.
@@ -131,17 +135,26 @@ func RegisterNodeEndpoint(registry *nodes.NodeRegistry, expectedToken string, au
 			tokenHash = hex.EncodeToString(h[:])
 		}

+		// Coerce 0 → 1 for backward compat with workers that don't send the field.
+		// GORM's `default:1` only fires for a missing column; once Go zero-values
+		// reach the struct field they're written as 0 unless explicitly set here.
+		maxReplicasPerModel := req.MaxReplicasPerModel
+		if maxReplicasPerModel < 1 {
+			maxReplicasPerModel = 1
+		}
+
 		node := &nodes.BackendNode{
-			Name:          req.Name,
-			NodeType:      nodeType,
-			Address:       req.Address,
-			HTTPAddress:   req.HTTPAddress,
-			TokenHash:     tokenHash,
-			TotalVRAM:     req.TotalVRAM,
-			AvailableVRAM: req.AvailableVRAM,
-			TotalRAM:      req.TotalRAM,
-			AvailableRAM:  req.AvailableRAM,
-			GPUVendor:     req.GPUVendor,
+			Name:                req.Name,
+			NodeType:            nodeType,
+			Address:             req.Address,
+			HTTPAddress:         req.HTTPAddress,
+			TokenHash:           tokenHash,
+			TotalVRAM:           req.TotalVRAM,
+			AvailableVRAM:       req.AvailableVRAM,
+			TotalRAM:            req.TotalRAM,
+			AvailableRAM:        req.AvailableRAM,
+			GPUVendor:           req.GPUVendor,
+			MaxReplicasPerModel: maxReplicasPerModel,
 		}

 		ctx := c.Request().Context()
@@ -363,6 +376,9 @@ func ResumeNodeEndpoint(registry *nodes.NodeRegistry) echo.HandlerFunc {
 }

 // InstallBackendOnNodeEndpoint triggers backend installation on a worker node via NATS.
+// Backend can be either a gallery ID (resolved against BackendGalleries) or a
+// direct URI install (URI + Name + optional Alias) — same shape as the
+// standalone /api/backends/install-external path, just scoped to one node.
 func InstallBackendOnNodeEndpoint(unloader nodes.NodeCommandSender) echo.HandlerFunc {
 	return func(c echo.Context) error {
 		if unloader == nil {
@@ -372,17 +388,27 @@ func InstallBackendOnNodeEndpoint(unloader nodes.NodeCommandSender) echo.Handler
 		var req struct {
 			Backend          string `json:"backend"`
 			BackendGalleries string `json:"backend_galleries,omitempty"`
+			URI              string `json:"uri,omitempty"`
+			Name             string `json:"name,omitempty"`
+			Alias            string `json:"alias,omitempty"`
 		}
-		if err := c.Bind(&req); err != nil || req.Backend == "" {
-			return c.JSON(http.StatusBadRequest, nodeError(http.StatusBadRequest, "backend name required"))
+		if err := c.Bind(&req); err != nil {
+			return c.JSON(http.StatusBadRequest, nodeError(http.StatusBadRequest, "invalid request body"))
 		}
-		reply, err := unloader.InstallBackend(nodeID, req.Backend, "", req.BackendGalleries, "", "", "")
+		// Either a gallery backend name or a direct URI must be supplied.
+		if req.Backend == "" && req.URI == "" {
+			return c.JSON(http.StatusBadRequest, nodeError(http.StatusBadRequest, "backend name or uri required"))
+		}
+		// Admin-driven backend install: not tied to a specific replica slot
+		// (no model is being loaded). Pass replica 0 to match the worker's
+		// admin process-key convention (`backend#0`).
+		reply, err := unloader.InstallBackend(nodeID, req.Backend, "", req.BackendGalleries, req.URI, req.Name, req.Alias, 0)
 		if err != nil {
-			xlog.Error("Failed to install backend on node", "node", nodeID, "backend", req.Backend, "error", err)
+			xlog.Error("Failed to install backend on node", "node", nodeID, "backend", req.Backend, "uri", req.URI, "error", err)
 			return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "failed to install backend on node"))
 		}
 		if !reply.Success {
-			xlog.Error("Backend install failed on node", "node", nodeID, "backend", req.Backend, "error", reply.Error)
+			xlog.Error("Backend install failed on node", "node", nodeID, "backend", req.Backend, "uri", req.URI, "error", reply.Error)
 			return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "backend installation failed"))
 		}
 		return c.JSON(http.StatusOK, map[string]string{"message": "backend installed"})
@@ -457,8 +483,8 @@ func UnloadModelOnNodeEndpoint(unloader nodes.NodeCommandSender, registry *nodes
 			xlog.Error("Failed to stop backend after model unload", "node", nodeID, "model", req.ModelName, "error", err)
 			return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "model unloaded but backend stop failed"))
 		}
-		// Remove from registry
-		registry.RemoveNodeModel(c.Request().Context(), nodeID, req.ModelName)
+		// Remove every replica of this model on the node from the registry.
+		registry.RemoveAllNodeModelReplicas(c.Request().Context(), nodeID, req.ModelName)
 		return c.JSON(http.StatusOK, map[string]string{"message": "model unloaded"})
 	}
 }
@@ -484,7 +510,7 @@ func DeleteModelOnNodeEndpoint(unloader nodes.NodeCommandSender, registry *nodes
 			// Non-fatal — backend process may not be running
 			xlog.Warn("StopBackend failed during model deletion (non-fatal)", "node", nodeID, "model", req.ModelName, "error", err)
 		}
-		registry.RemoveNodeModel(c.Request().Context(), nodeID, req.ModelName)
+		registry.RemoveAllNodeModelReplicas(c.Request().Context(), nodeID, req.ModelName)
 		return c.JSON(http.StatusOK, map[string]string{"message": "model deleted from node"})
 	}
 }
@@ -659,6 +685,78 @@ func GetNodeLabelsEndpoint(registry *nodes.NodeRegistry) echo.HandlerFunc {
 	}
 }

+// UpdateMaxReplicasPerModelRequest is the body for the per-node replica cap endpoint.
+type UpdateMaxReplicasPerModelRequest struct {
+	// Value is the new per-model replica cap on this node. Must be >= 1.
+	Value int `json:"value"`
+}
+
+// UpdateMaxReplicasPerModelEndpoint sets the per-node cap on how many replicas
+// of any one model can be loaded concurrently. The corresponding
+// `node.replica-slots` auto-label is refreshed so existing AND-selectors keep
+// matching, and any unsatisfiable scheduling cooldowns are cleared so the
+// reconciler retries on the next tick.
+//
+// This is a transient admin override — a worker re-registration restores the
+// value the worker was started with (--max-replicas-per-model). For permanent
+// fleet changes, change the worker flag.
+//
+// @Summary Update a node's max replicas per model
+// @Tags Nodes
+// @Param id path string true "Node ID"
+// @Param request body UpdateMaxReplicasPerModelRequest true "New value"
+// @Success 200 {object} map[string]int
+// @Failure 400 {object} map[string]any "value must be >= 1"
+// @Failure 404 {object} map[string]any "node not found"
+// @Router /api/nodes/{id}/max-replicas-per-model [put]
+func UpdateMaxReplicasPerModelEndpoint(registry *nodes.NodeRegistry) echo.HandlerFunc {
+	return func(c echo.Context) error {
+		ctx := c.Request().Context()
+		nodeID := c.Param("id")
+		if _, err := registry.Get(ctx, nodeID); err != nil {
+			return c.JSON(http.StatusNotFound, nodeError(http.StatusNotFound, "node not found"))
+		}
+		var req UpdateMaxReplicasPerModelRequest
+		if err := c.Bind(&req); err != nil {
+			return c.JSON(http.StatusBadRequest, nodeError(http.StatusBadRequest, "invalid request body"))
+		}
+		if req.Value < 1 {
+			return c.JSON(http.StatusBadRequest, nodeError(http.StatusBadRequest, "value must be >= 1"))
+		}
+		if err := registry.UpdateMaxReplicasPerModel(ctx, nodeID, req.Value); err != nil {
+			xlog.Error("Failed to update max_replicas_per_model", "node", nodeID, "value", req.Value, "error", err)
+			return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "failed to update max replicas per model"))
+		}
+		return c.JSON(http.StatusOK, map[string]int{"max_replicas_per_model": req.Value})
+	}
+}
+
+// ResetMaxReplicasPerModelEndpoint clears the admin override on a node, so
+// the next worker re-registration is allowed to update the value from its
+// CLI flag again. The current value is left in place until the worker calls
+// register.
+//
+// @Summary Reset a node's max replicas per model to the worker default
+// @Tags Nodes
+// @Param id path string true "Node ID"
+// @Success 200 {object} map[string]bool
+// @Failure 404 {object} map[string]any "node not found"
+// @Router /api/nodes/{id}/max-replicas-per-model [delete]
+func ResetMaxReplicasPerModelEndpoint(registry *nodes.NodeRegistry) echo.HandlerFunc {
+	return func(c echo.Context) error {
+		ctx := c.Request().Context()
+		nodeID := c.Param("id")
+		if _, err := registry.Get(ctx, nodeID); err != nil {
+			return c.JSON(http.StatusNotFound, nodeError(http.StatusNotFound, "node not found"))
+		}
+		if err := registry.ResetMaxReplicasPerModel(ctx, nodeID); err != nil {
+			xlog.Error("Failed to reset max_replicas_per_model override", "node", nodeID, "error", err)
+			return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "failed to reset override"))
+		}
+		return c.JSON(http.StatusOK, map[string]bool{"reset": true})
+	}
+}
+
 // SetNodeLabelsEndpoint replaces all labels for a node.
 func SetNodeLabelsEndpoint(registry *nodes.NodeRegistry) echo.HandlerFunc {
 	return func(c echo.Context) error {
--- a/core/http/react-ui/e2e/manage-logs-link.spec.js
+++ b/core/http/react-ui/e2e/manage-logs-link.spec.js
@@ -1,29 +1,32 @@
 import { test, expect } from '@playwright/test'

 test.describe('Manage Page - Backend Logs Link', () => {
-  test('models table shows terminal icon for logs', async ({ page }) => {
+  test('row action menu exposes Backend logs entry with terminal icon', async ({ page }) => {
    await page.goto('/app/manage')
-    // Wait for models to load
    await expect(page.locator('.table')).toBeVisible({ timeout: 10_000 })

-    // Check for terminal icon (backend logs link)
-    const terminalIcon = page.locator('a[title="Backend logs"] i.fa-terminal')
-    await expect(terminalIcon.first()).toBeVisible()
+    // Row actions live behind the kebab (ActionMenu) — open the first row's menu.
+    const trigger = page.locator('button.action-menu__trigger').first()
+    await expect(trigger).toBeVisible()
+    await trigger.click()
+
+    const logsItem = page.getByRole('menuitem', { name: 'Backend logs' })
+    await expect(logsItem).toBeVisible()
+    await expect(logsItem.locator('i.fa-terminal')).toBeVisible()
  })

-  test('terminal icon links to backend-logs page', async ({ page }) => {
+  test('Backend logs menu item navigates to backend-logs page', async ({ page }) => {
    await page.goto('/app/manage')
    await expect(page.locator('.table')).toBeVisible({ timeout: 10_000 })

-    const logsLink = page.locator('a[title="Backend logs"]').first()
-    await expect(logsLink).toBeVisible()
+    const trigger = page.locator('button.action-menu__trigger').first()
+    await expect(trigger).toBeVisible()
+    await trigger.click()

-    // Link uses href="#" with onClick for navigation
-    const href = await logsLink.getAttribute('href')
-    expect(href).toBe('#')
+    const logsItem = page.getByRole('menuitem', { name: 'Backend logs' })
+    await expect(logsItem).toBeVisible()
+    await logsItem.click()

-    // Click and verify navigation
-    await logsLink.click()
    await expect(page).toHaveURL(/\/app\/backend-logs\//)
  })
 })
--- a/core/http/react-ui/e2e/nodes-per-node-backend-actions.spec.js
+++ b/core/http/react-ui/e2e/nodes-per-node-backend-actions.spec.js
@@ -0,0 +1,166 @@
+import { test, expect } from '@playwright/test'
+
+// These specs cover the per-node backend row in the Nodes page:
+//   - the upgrade affordance is self-explanatory (icon + tooltip)
+//   - a delete affordance is present and goes through ConfirmDialog
+//
+// We mock the distributed-mode API so the tests can run against the
+// standalone ui-test-server without spinning up workers/NATS.
+
+const NODE_ID = 'test-node-1'
+const NODE_NAME = 'worker-test'
+const BACKEND_NAME = 'cuda12-vllm-development'
+
+async function mockDistributedNodes(page, { onDelete } = {}) {
+  await page.route('**/api/nodes', (route) => {
+    route.fulfill({
+      status: 200,
+      contentType: 'application/json',
+      body: JSON.stringify([
+        {
+          id: NODE_ID,
+          name: NODE_NAME,
+          node_type: 'backend',
+          address: '10.0.0.1:50051',
+          http_address: '10.0.0.1:8090',
+          status: 'healthy',
+          total_vram: 0,
+          available_vram: 0,
+          total_ram: 8_000_000_000,
+          available_ram: 4_000_000_000,
+          gpu_vendor: '',
+          last_heartbeat: new Date().toISOString(),
+          created_at: new Date().toISOString(),
+          updated_at: new Date().toISOString(),
+        },
+      ]),
+    })
+  })
+
+  await page.route('**/api/nodes/scheduling', (route) => {
+    route.fulfill({
+      status: 200,
+      contentType: 'application/json',
+      body: '[]',
+    })
+  })
+
+  await page.route(`**/api/nodes/${NODE_ID}/models`, (route) => {
+    route.fulfill({
+      status: 200,
+      contentType: 'application/json',
+      body: '[]',
+    })
+  })
+
+  await page.route(`**/api/nodes/${NODE_ID}/backends`, (route) => {
+    route.fulfill({
+      status: 200,
+      contentType: 'application/json',
+      body: JSON.stringify([
+        {
+          name: BACKEND_NAME,
+          is_system: false,
+          is_meta: false,
+          installed_at: new Date().toISOString(),
+        },
+      ]),
+    })
+  })
+
+  await page.route(`**/api/nodes/${NODE_ID}/backends/delete`, async (route) => {
+    if (onDelete) {
+      await onDelete(route)
+    }
+    route.fulfill({
+      status: 200,
+      contentType: 'application/json',
+      body: JSON.stringify({ message: 'backend deleted' }),
+    })
+  })
+}
+
+async function expandNodeAndWaitForBackends(page) {
+  await page.goto('/app/nodes')
+  // Click the row to expand it. The chevron toggle and the row both work,
+  // but clicking the name cell is the most user-like.
+  await page.getByText(NODE_NAME).first().click()
+  // Backends, Capacity and Labels live behind a "Manage" <details>
+  // disclosure (the drawer was distilled to keep at-a-glance content
+  // lean — see distill refactor in the multi-replica branch). Open it
+  // by clicking the summary inside the .node-manage scope so the
+  // per-node backend table is in the DOM before assertions run.
+  await page.locator('.node-manage > summary').first().click()
+  await expect(page.getByRole('cell', { name: BACKEND_NAME, exact: true })).toBeVisible({ timeout: 10_000 })
+}
+
+test.describe('Nodes page — per-node backend actions', () => {
+  test('upgrade affordance is self-explanatory (not "Reinstall backend" with a sync icon)', async ({ page }) => {
+    await mockDistributedNodes(page)
+    await expandNodeAndWaitForBackends(page)
+
+    // Negative: the old, ambiguous wording must not be used.
+    await expect(page.locator('button[title="Reinstall backend"]')).toHaveCount(0)
+    await expect(page.locator('button[title="Reinstall backend"] i.fa-sync-alt')).toHaveCount(0)
+
+    // Positive: a self-explanatory upgrade affordance is rendered next to the
+    // backend row. We accept either an arrow-up or arrows-rotate glyph; both
+    // map to "upgrade" semantics in FontAwesome 6 unambiguously.
+    const upgradeBtn = page.locator('button[title="Upgrade backend on this node"]')
+    await expect(upgradeBtn).toBeVisible()
+    const iconClass = await upgradeBtn.locator('i').getAttribute('class')
+    expect(iconClass).toMatch(/fa-(arrow-up|arrows-rotate|up-long)/)
+  })
+
+  test('per-node backend row shows a delete (trash) button next to upgrade', async ({ page }) => {
+    await mockDistributedNodes(page)
+    await expandNodeAndWaitForBackends(page)
+
+    const deleteBtn = page.locator('button[title="Delete backend from this node"]')
+    await expect(deleteBtn).toBeVisible()
+    await expect(deleteBtn.locator('i.fa-trash')).toBeVisible()
+  })
+
+  test('clicking delete opens the confirm dialog and POSTs to the per-node delete endpoint', async ({ page }) => {
+    let postedBody = null
+    await mockDistributedNodes(page, {
+      onDelete: async (route) => {
+        postedBody = route.request().postDataJSON()
+      },
+    })
+    await expandNodeAndWaitForBackends(page)
+
+    await page.locator('button[title="Delete backend from this node"]').click()
+
+    // ConfirmDialog uses role="alertdialog" and a danger confirm button.
+    const dialog = page.getByRole('alertdialog')
+    await expect(dialog).toBeVisible()
+    const confirmBtn = dialog.locator('button.btn-danger')
+    await expect(confirmBtn).toBeVisible()
+    await confirmBtn.click()
+
+    // Wait until the POST landed.
+    await expect.poll(() => postedBody, { timeout: 5_000 }).toEqual({ backend: BACKEND_NAME })
+  })
+
+  test('clicking delete and cancelling does not POST', async ({ page }) => {
+    let deleteCalls = 0
+    await mockDistributedNodes(page, {
+      onDelete: () => {
+        deleteCalls += 1
+      },
+    })
+    await expandNodeAndWaitForBackends(page)
+
+    await page.locator('button[title="Delete backend from this node"]').click()
+
+    const dialog = page.getByRole('alertdialog')
+    await expect(dialog).toBeVisible()
+    await dialog.getByRole('button', { name: /cancel/i }).click()
+    await expect(dialog).toBeHidden()
+
+    // Give any errant request a moment to fire so a regression would be caught.
+    await page.waitForTimeout(500)
+    expect(deleteCalls).toBe(0)
+  })
+})
--- a/core/http/react-ui/index.html
+++ b/core/http/react-ui/index.html
@@ -7,7 +7,7 @@
    <link rel="icon" type="image/svg+xml" href="/favicon.svg" />
    <link rel="preconnect" href="https://fonts.googleapis.com" />
    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
-    <link href="https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500;600&display=swap" rel="stylesheet" />
+    <link href="https://fonts.googleapis.com/css2?family=Geist:wght@300..700&family=Geist+Mono:wght@300..700&display=swap" rel="stylesheet" />
  </head>
  <body>
    <div id="root"></div>
--- a/core/http/react-ui/package-lock.json
+++ b/core/http/react-ui/package-lock.json
@@ -3258,9 +3258,9 @@
      }
    },
    "node_modules/postcss": {
-      "version": "8.5.8",
-      "resolved": "https://registry.npmjs.org/postcss/-/postcss-8.5.8.tgz",
-      "integrity": "sha512-OW/rX8O/jXnm82Ey1k44pObPtdblfiuWnrd8X7GJ7emImCOstunGbXUpp7HdBrFQX6rJzn3sPT397Wp5aCwCHg==",
+      "version": "8.5.10",
+      "resolved": "https://registry.npmjs.org/postcss/-/postcss-8.5.10.tgz",
+      "integrity": "sha512-pMMHxBOZKFU6HgAZ4eyGnwXF/EvPGGqUr0MnZ5+99485wwW41kW91A4LOGxSHhgugZmSChL5AlElNdwlNgcnLQ==",
      "dev": true,
      "funding": [
        {
@@ -3276,6 +3276,7 @@
          "url": "https://github.com/sponsors/ai"
        }
      ],
+      "license": "MIT",
      "dependencies": {
        "nanoid": "^3.3.11",
        "picocolors": "^1.1.1",
--- a/core/http/react-ui/src/App.css
+++ b/core/http/react-ui/src/App.css
--- a/core/http/react-ui/src/App.jsx
+++ b/core/http/react-ui/src/App.jsx
@@ -1,9 +1,11 @@
-import { useState, useEffect } from 'react'
-import { Outlet, useLocation } from 'react-router-dom'
+import { useState, useEffect, useRef } from 'react'
+import { Outlet, useLocation, useNavigate } from 'react-router-dom'
 import Sidebar from './components/Sidebar'
 import OperationsBar from './components/OperationsBar'
 import { ToastContainer, useToast } from './components/Toast'
 import { systemApi } from './utils/api'
+import { useTheme } from './contexts/ThemeContext'
+import { useAuth } from './context/AuthContext'

 const COLLAPSED_KEY = 'localai_sidebar_collapsed'

@@ -15,6 +17,10 @@ export default function App() {
  const { toasts, addToast, removeToast } = useToast()
  const [version, setVersion] = useState('')
  const location = useLocation()
+  const navigate = useNavigate()
+  const { theme, toggleTheme } = useTheme()
+  const { authEnabled, user } = useAuth()
+  const hamburgerRef = useRef(null)
  const isChatRoute = location.pathname.match(/\/chat(\/|$)/) || location.pathname.match(/\/agents\/[^/]+\/chat/)

  useEffect(() => {
@@ -34,26 +40,80 @@ export default function App() {
    window.scrollTo(0, 0)
  }, [location.pathname])

+  // Drawer polish: lock body scroll, close on Escape, return focus to the
+  // hamburger when the drawer closes. Only engages when the drawer is open;
+  // desktop and tablet rail mode are unaffected.
+  useEffect(() => {
+    if (!sidebarOpen) return
+    const prevOverflow = document.body.style.overflow
+    document.body.style.overflow = 'hidden'
+    const onKey = (e) => { if (e.key === 'Escape') setSidebarOpen(false) }
+    window.addEventListener('keydown', onKey)
+    return () => {
+      document.body.style.overflow = prevOverflow
+      window.removeEventListener('keydown', onKey)
+      // Restore focus to the trigger so keyboard users land back where
+      // they invoked the drawer from.
+      hamburgerRef.current?.focus()
+    }
+  }, [sidebarOpen])
+
  const layoutClasses = [
    'app-layout',
    isChatRoute ? 'app-layout-chat' : '',
    sidebarCollapsed ? 'sidebar-is-collapsed' : '',
  ].filter(Boolean).join(' ')

+  const showAvatar = authEnabled && user
+  const accountLabel = user?.name || user?.email || 'Account'
+
  return (
    <div className={layoutClasses}>
      <Sidebar isOpen={sidebarOpen} onClose={() => setSidebarOpen(false)} />
-      <main className="main-content">
+      <main className="main-content" {...(sidebarOpen ? { 'aria-hidden': 'true', inert: '' } : {})}>
        <OperationsBar />
-        {/* Mobile header */}
+        {/* Mobile header — primary actions reachable without opening the
+            drawer. Hamburger is the only way to expand the nav on phones;
+            theme toggle and account avatar are mirrored from the sidebar
+            footer so they remain one tap away. */}
        <header className="mobile-header">
          <button
+            ref={hamburgerRef}
            className="hamburger-btn"
            onClick={() => setSidebarOpen(true)}
+            aria-label="Open menu"
+            aria-expanded={sidebarOpen}
+            aria-controls="app-sidebar"
          >
-            <i className="fas fa-bars" />
+            <i className="fas fa-bars" aria-hidden="true" />
          </button>
          <span className="mobile-title">LocalAI</span>
+          <div className="mobile-header-actions">
+            <button
+              type="button"
+              className="mobile-header-btn"
+              onClick={toggleTheme}
+              aria-label={`Switch to ${theme === 'dark' ? 'light' : 'dark'} mode`}
+              title={`Switch to ${theme === 'dark' ? 'light' : 'dark'} mode`}
+            >
+              <i className={`fas ${theme === 'dark' ? 'fa-sun' : 'fa-moon'}`} aria-hidden="true" />
+            </button>
+            {showAvatar && (
+              <button
+                type="button"
+                className="mobile-header-btn mobile-header-avatar"
+                onClick={() => navigate('/app/account')}
+                aria-label={`Account: ${accountLabel}`}
+                title={accountLabel}
+              >
+                {user.avatarUrl ? (
+                  <img src={user.avatarUrl} alt="" />
+                ) : (
+                  <i className="fas fa-user-circle" aria-hidden="true" />
+                )}
+              </button>
+            )}
+          </div>
        </header>
        <div className="main-content-inner">
          <div className="page-transition" key={location.pathname}>
--- a/core/http/react-ui/src/components/ActionMenu.jsx
+++ b/core/http/react-ui/src/components/ActionMenu.jsx
@@ -0,0 +1,141 @@
+import { useRef, useState, useEffect, useCallback } from 'react'
+import Popover from './Popover'
+
+// ActionMenu renders a kebab (three-dot) button that opens a popover with a
+// list of row actions. Replaces the inline cluster of icon buttons that made
+// dense tables feel like a control panel — actions stay out of the way until
+// the user reaches for them, the way Linear/Vercel/Notion handle row menus.
+//
+// Items shape:
+//   { key, icon?, label, onClick, danger?, disabled?, hidden?, shortcut? }
+//   { divider: true }                       // visual separator
+//   { type: 'badge', icon?, label }         // non-interactive badge row
+//
+// Hidden items are filtered out so callers can write conditional menus
+// inline (`{ key: 'stop', visible: isRunning, ... }` style) without ternaries.
+//
+// Keyboard:
+//   ArrowUp / ArrowDown  — move highlight (skipping dividers + badges)
+//   Enter / Space        — activate
+//   Escape               — close, return focus to trigger
+export default function ActionMenu({ items, ariaLabel = 'Actions', triggerLabel, compact = false }) {
+  const triggerRef = useRef(null)
+  const [open, setOpen] = useState(false)
+  const [activeIdx, setActiveIdx] = useState(-1)
+
+  const interactive = (Array.isArray(items) ? items : []).filter(it => it && !it.divider && it.type !== 'badge' && !it.hidden)
+  const visible = (Array.isArray(items) ? items : []).filter(it => it && !it.hidden)
+
+  const close = useCallback(() => {
+    setOpen(false)
+    setActiveIdx(-1)
+  }, [])
+
+  // Move highlight to the first interactive item when opening, so keyboard
+  // users land somewhere meaningful instead of having to arrow into the menu.
+  useEffect(() => {
+    if (open && activeIdx === -1 && interactive.length > 0) {
+      setActiveIdx(0)
+    }
+  }, [open, activeIdx, interactive.length])
+
+  const handleTriggerKeyDown = (e) => {
+    if (e.key === 'ArrowDown' || e.key === 'Enter' || e.key === ' ') {
+      e.preventDefault()
+      e.stopPropagation()
+      setOpen(true)
+    }
+  }
+
+  const handleMenuKeyDown = (e) => {
+    if (e.key === 'ArrowDown') {
+      e.preventDefault()
+      setActiveIdx(i => Math.min(interactive.length - 1, (i < 0 ? -1 : i) + 1))
+    } else if (e.key === 'ArrowUp') {
+      e.preventDefault()
+      setActiveIdx(i => Math.max(0, (i < 0 ? interactive.length : i) - 1))
+    } else if (e.key === 'Home') {
+      e.preventDefault()
+      setActiveIdx(0)
+    } else if (e.key === 'End') {
+      e.preventDefault()
+      setActiveIdx(interactive.length - 1)
+    } else if (e.key === 'Enter' || e.key === ' ') {
+      e.preventDefault()
+      const item = interactive[activeIdx]
+      if (item && !item.disabled) {
+        close()
+        item.onClick?.()
+      }
+    }
+  }
+
+  if (interactive.length === 0 && !visible.some(it => it.type === 'badge')) {
+    return null
+  }
+
+  return (
+    <>
+      <button
+        ref={triggerRef}
+        type="button"
+        className={`action-menu__trigger${compact ? ' action-menu__trigger--compact' : ''}${open ? ' is-open' : ''}`}
+        aria-haspopup="menu"
+        aria-expanded={open}
+        aria-label={triggerLabel || ariaLabel}
+        onClick={(e) => { e.stopPropagation(); setOpen(v => !v) }}
+        onKeyDown={handleTriggerKeyDown}
+      >
+        <i className="fas fa-ellipsis-vertical" />
+      </button>
+      <Popover anchor={triggerRef} open={open} onClose={close} ariaLabel={ariaLabel}>
+        <div
+          role="menu"
+          aria-label={ariaLabel}
+          className="action-menu"
+          onKeyDown={handleMenuKeyDown}
+          // Capture focus when the menu opens so arrow keys work without the
+          // user clicking inside first.
+          tabIndex={-1}
+          ref={el => { if (el && open) el.focus() }}
+        >
+          {visible.map((item, i) => {
+            if (item.divider) {
+              return <div key={`d-${i}`} className="action-menu__divider" role="separator" />
+            }
+            if (item.type === 'badge') {
+              return (
+                <div key={item.key || `b-${i}`} className="action-menu__badge" role="presentation">
+                  {item.icon && <i className={`fas ${item.icon}`} aria-hidden="true" />}
+                  <span>{item.label}</span>
+                </div>
+              )
+            }
+            const idx = interactive.indexOf(item)
+            const active = idx === activeIdx
+            return (
+              <button
+                key={item.key}
+                type="button"
+                role="menuitem"
+                disabled={item.disabled}
+                className={`action-menu__item${item.danger ? ' is-danger' : ''}${active ? ' is-active' : ''}`}
+                onMouseEnter={() => setActiveIdx(idx)}
+                onClick={(e) => {
+                  e.stopPropagation()
+                  if (item.disabled) return
+                  close()
+                  item.onClick?.()
+                }}
+              >
+                {item.icon && <i className={`fas ${item.icon} action-menu__icon`} aria-hidden="true" />}
+                <span className="action-menu__label">{item.label}</span>
+                {item.shortcut && <span className="action-menu__shortcut">{item.shortcut}</span>}
+              </button>
+            )
+          })}
+        </div>
+      </Popover>
+    </>
+  )
+}
--- a/core/http/react-ui/src/components/ClientMCPDropdown.jsx
+++ b/core/http/react-ui/src/components/ClientMCPDropdown.jsx
@@ -80,7 +80,7 @@ export default function ClientMCPDropdown({
                placeholder="Server URL (e.g. https://mcp.example.com/sse)"
                value={url}
                onChange={e => setUrl(e.target.value)}
-                style={{ width: '100%', marginBottom: '4px' }}
+                style={{ width: '100%', marginBottom: 'var(--spacing-xs)' }}
              />
              <input
                type="text"
@@ -88,7 +88,7 @@ export default function ClientMCPDropdown({
                placeholder="Name (optional)"
                value={name}
                onChange={e => setName(e.target.value)}
-                style={{ width: '100%', marginBottom: '4px' }}
+                style={{ width: '100%', marginBottom: 'var(--spacing-xs)' }}
              />
              <input
                type="password"
@@ -96,13 +96,13 @@ export default function ClientMCPDropdown({
                placeholder="Auth token (optional)"
                value={authToken}
                onChange={e => setAuthToken(e.target.value)}
-                style={{ width: '100%', marginBottom: '4px' }}
+                style={{ width: '100%', marginBottom: 'var(--spacing-xs)' }}
              />
              <label style={{ display: 'flex', alignItems: 'center', gap: '6px', fontSize: '0.8rem', marginBottom: '6px' }}>
                <input type="checkbox" checked={useProxy} onChange={e => setUseProxy(e.target.checked)} />
                Use CORS proxy
              </label>
-              <div style={{ display: 'flex', gap: '4px', justifyContent: 'flex-end' }}>
+              <div style={{ display: 'flex', gap: 'var(--spacing-xs)', justifyContent: 'flex-end' }}>
                <button type="button" className="btn btn-sm btn-secondary" onClick={() => setAddDialog(false)}>Cancel</button>
                <button type="button" className="btn btn-sm btn-primary" onClick={handleAdd} disabled={!url.trim()}>Add</button>
              </div>
--- a/core/http/react-ui/src/components/ConfigFieldRenderer.jsx
+++ b/core/http/react-ui/src/components/ConfigFieldRenderer.jsx
@@ -135,7 +135,7 @@ function JsonEditor({ value, onChange }) {
        className="input"
        value={text}
        onChange={e => handleChange(e.target.value)}
-        style={{ width: '100%', minHeight: 80, fontFamily: 'monospace', fontSize: '0.8125rem', resize: 'vertical' }}
+        style={{ width: '100%', minHeight: 80, fontFamily: 'var(--font-mono)', fontSize: '0.8125rem', resize: 'vertical' }}
      />
      {parseError && <div style={{ color: 'var(--color-error)', fontSize: '0.75rem', marginTop: 2 }}>{parseError}</div>}
    </div>
--- a/core/http/react-ui/src/components/FieldBrowser.jsx
+++ b/core/http/react-ui/src/components/FieldBrowser.jsx
@@ -158,7 +158,7 @@ export default function FieldBrowser({ fields, activeFieldPaths, onAddField }) {
                      {field.description}
                    </div>
                  )}
-                  <div style={{ fontSize: '0.6875rem', color: 'var(--color-text-muted)', marginTop: 1, fontFamily: 'monospace' }}>
+                  <div style={{ fontSize: '0.6875rem', color: 'var(--color-text-muted)', marginTop: 1, fontFamily: 'var(--font-mono)' }}>
                    {field.path}
                  </div>
                </div>
--- a/core/http/react-ui/src/components/GalleryLoader.jsx
+++ b/core/http/react-ui/src/components/GalleryLoader.jsx
@@ -0,0 +1,79 @@
+import { useState, useEffect } from 'react'
+
+const LOADING_PHRASES = [
+  { text: 'Loading models...', icon: 'fa-brain' },
+  { text: 'Fetching gallery...', icon: 'fa-download' },
+  { text: 'Checking availability...', icon: 'fa-circle-check' },
+  { text: 'Almost ready...', icon: 'fa-hourglass-half' },
+  { text: 'Preparing gallery...', icon: 'fa-store' },
+]
+
+// GalleryLoader is the animated skeleton used while the gallery list loads.
+// Used by Models, Backends, and (now) the Manage page so an empty fetch state
+// reads the same everywhere instead of one tab showing pulsing dots and the
+// other showing "Loading...".
+export default function GalleryLoader() {
+  const [idx, setIdx] = useState(() => Math.floor(Math.random() * LOADING_PHRASES.length))
+  const [fade, setFade] = useState(true)
+
+  useEffect(() => {
+    const interval = setInterval(() => {
+      setFade(false)
+      setTimeout(() => {
+        setIdx(prev => (prev + 1) % LOADING_PHRASES.length)
+        setFade(true)
+      }, 300)
+    }, 2800)
+    return () => clearInterval(interval)
+  }, [])
+
+  const phrase = LOADING_PHRASES[idx]
+
+  return (
+    <div style={{
+      display: 'flex', flexDirection: 'column', alignItems: 'center',
+      justifyContent: 'center', padding: 'var(--spacing-xl) var(--spacing-md)',
+      minHeight: '280px', gap: 'var(--spacing-lg)',
+    }}>
+      <div style={{ display: 'flex', gap: 'var(--spacing-sm)' }}>
+        {[0, 1, 2, 3, 4].map(i => (
+          <div key={i} style={{
+            width: 10, height: 10, borderRadius: '50%',
+            background: 'var(--color-primary)',
+            animation: `galleryDot 1.4s ease-in-out ${i * 0.15}s infinite`,
+          }} />
+        ))}
+      </div>
+      <div style={{
+        display: 'flex', alignItems: 'center', gap: 'var(--spacing-sm)',
+        opacity: fade ? 1 : 0,
+        transition: 'opacity 300ms ease',
+        color: 'var(--color-text-secondary)',
+        fontSize: '0.9375rem',
+        fontWeight: 500,
+      }}>
+        <i className={`fas ${phrase.icon}`} style={{ color: 'var(--color-accent)', fontSize: '1.125rem' }} />
+        {phrase.text}
+      </div>
+      <div style={{ width: '100%', maxWidth: '700px', display: 'flex', flexDirection: 'column', gap: '12px' }}>
+        {[0.9, 0.7, 0.5].map((opacity, i) => (
+          <div key={i} style={{
+            height: '48px', borderRadius: 'var(--radius-md)',
+            background: 'var(--color-bg-tertiary)', opacity,
+            animation: `galleryShimmer 1.8s ease-in-out ${i * 0.2}s infinite`,
+          }} />
+        ))}
+      </div>
+      <style>{`
+        @keyframes galleryDot {
+          0%, 80%, 100% { transform: scale(0.4); opacity: 0.3; }
+          40% { transform: scale(1); opacity: 1; }
+        }
+        @keyframes galleryShimmer {
+          0%, 100% { opacity: var(--shimmer-base, 0.15); }
+          50% { opacity: var(--shimmer-peak, 0.3); }
+        }
+      `}</style>
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/ManageSummary.jsx
+++ b/core/http/react-ui/src/components/ManageSummary.jsx
@@ -0,0 +1,47 @@
+import StatCard from './StatCard'
+
+// ManageSummary anchors the Manage page with the same StatCard pattern the
+// Nodes dashboard uses, so the page reads as a real overview rather than
+// "two tabs in a hat". Counts are derived in-memory by the parent — this
+// component is purely presentational. Cards are clickable and route the
+// user to the relevant tab + filter.
+export default function ManageSummary({
+  modelsCount,
+  backendsCount,
+  runningCount,
+  updatesCount,
+  onCardClick,
+}) {
+  const click = (tab, filter) => onCardClick && onCardClick(tab, filter)
+
+  return (
+    <div className="stat-grid manage-summary">
+      <StatCard
+        icon="fas fa-brain"
+        label="Models Installed"
+        value={modelsCount}
+        onClick={() => click('models', 'all')}
+      />
+      <StatCard
+        icon="fas fa-server"
+        label="Backends Installed"
+        value={backendsCount}
+        onClick={() => click('backends', 'all')}
+      />
+      <StatCard
+        icon="fas fa-circle-play"
+        label="Currently Running"
+        value={runningCount}
+        accentVar={runningCount > 0 ? '--color-success' : undefined}
+        onClick={() => click('models', 'running')}
+      />
+      <StatCard
+        icon="fas fa-arrow-up"
+        label="Updates Available"
+        value={updatesCount}
+        accentVar={updatesCount > 0 ? '--color-warning' : undefined}
+        onClick={() => click('backends', updatesCount > 0 ? 'upgradable' : 'all')}
+      />
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/MetaBadgeRow.jsx
+++ b/core/http/react-ui/src/components/MetaBadgeRow.jsx
@@ -0,0 +1,30 @@
+// MetaBadgeRow renders the System / User / Meta / Dev badge cluster the same
+// way everywhere — Manage tabs and (in future) Install gallery. The badges
+// already exist as classes; this component locks down the icons + labels so
+// the same backend type doesn't read "User" in one tab and "downloaded" in
+// another.
+export default function MetaBadgeRow({ isSystem, isMeta, isDevelopment }) {
+  return (
+    <div className="badge-row">
+      {isSystem ? (
+        <span className="badge badge-info" title="Bundled with the LocalAI runtime">
+          <i className="fas fa-shield-alt" /> System
+        </span>
+      ) : (
+        <span className="badge badge-success" title="Installed from the gallery or external source">
+          <i className="fas fa-download" /> User
+        </span>
+      )}
+      {isMeta && (
+        <span className="badge badge-accent" title="Meta backend — selects a concrete variant per node">
+          <i className="fas fa-layer-group" /> Meta
+        </span>
+      )}
+      {isDevelopment && (
+        <span className="badge badge-warning" title="Marked as development / pre-release by the gallery">
+          <i className="fas fa-flask" /> Dev
+        </span>
+      )}
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/NodeInstallPicker.jsx
+++ b/core/http/react-ui/src/components/NodeInstallPicker.jsx
@@ -0,0 +1,668 @@
+import { useState, useMemo, useEffect, useRef } from 'react'
+import Modal from './Modal'
+import SearchableSelect from './SearchableSelect'
+import { nodesApi } from '../utils/api'
+
+// NodeInstallPicker is the single multi-node install surface used both from
+// the Backends gallery split-button and from the "Install on more nodes" `+`
+// affordance in the Nodes column. Submit fires N parallel per-node install
+// calls; rows transition inline so the user sees per-node success/failure
+// without leaving the modal.
+//
+// Props:
+//   open               — controls visibility
+//   onClose            — close handler (header X / Cancel / Esc / backdrop)
+//   onComplete         — fired after at least one node install succeeded;
+//                        gallery uses this to refetch and update the Nodes
+//                        column without a manual reload
+//   backend            — { name, isMeta, capabilities, metaBackendFor }
+//   nodes              — BackendNode[] from /api/nodes
+//   installedNodeIds   — Set/array of node IDs that already have this backend
+//   initialSelection   — optional pre-selected node IDs (e.g. "missing nodes"
+//                        when opened from the Nodes column `+` affordance)
+
+const STATUS_LABELS = { healthy: 'Healthy', draining: 'Draining', unhealthy: 'Unhealthy', offline: 'Offline' }
+
+function formatVRAM(bytes) {
+  if (!bytes || bytes === 0) return null
+  const gb = bytes / (1024 * 1024 * 1024)
+  return gb >= 1 ? `${gb.toFixed(1)} GB` : `${(bytes / (1024 * 1024)).toFixed(0)} MB`
+}
+
+function gpuVendorLabel(vendor) {
+  const labels = { nvidia: 'NVIDIA', amd: 'AMD', intel: 'Intel', vulkan: 'Vulkan' }
+  return labels[vendor] || null
+}
+
+// hardwareTargetOf parses the capability key that points to a concrete
+// variant in the parent meta's CapabilitiesMap. e.g. cpu-llama-cpp comes
+// from {"cpu": "cpu-llama-cpp"} → "cpu". Falls back to "" when the parent
+// is unknown (the gallery list payload still gives us metaBackendFor).
+function hardwareTargetOf(backend, allBackends) {
+  if (!backend || !backend.name || backend.isMeta) return ''
+  const parentName = backend.metaBackendFor
+  if (!parentName) return ''
+  const parent = (allBackends || []).find(b => b.name === parentName || b.id === parentName)
+  if (!parent || !parent.capabilities) return ''
+  for (const [cap, concreteName] of Object.entries(parent.capabilities)) {
+    if (concreteName === backend.name) return cap
+  }
+  return ''
+}
+
+// humanTargetLabel turns a capability key into a user-facing phrase used in
+// the picker header note: "CPU build", "CUDA 12 build", etc. Keep it
+// concrete and product-recognisable, not the raw token from the gallery.
+function humanTargetLabel(target) {
+  if (!target) return 'hardware-specific build'
+  const t = target.toLowerCase()
+  if (t.startsWith('cpu') || t === 'default') return 'CPU build'
+  if (t.includes('cuda-13') || t.includes('cuda13')) return 'CUDA 13 build'
+  if (t.includes('cuda-12') || t.includes('cuda12')) return 'CUDA 12 build'
+  if (t.includes('cuda')) return 'NVIDIA CUDA build'
+  if (t.includes('l4t')) return 'NVIDIA Jetson (L4T) build'
+  if (t.includes('nvidia')) return 'NVIDIA build'
+  if (t.includes('rocm') || t.includes('amd')) return 'AMD ROCm build'
+  if (t.includes('metal')) return 'Apple Metal build'
+  if (t.includes('sycl') || t.includes('intel')) return 'Intel SYCL build'
+  if (t.includes('vulkan')) return 'Vulkan build'
+  if (t.includes('darwin-x86')) return 'macOS x86 build'
+  return 'hardware-specific build'
+}
+
+// suitabilityFor returns the picker's per-row suitability state for the
+// requested backend. Already-installed wins over compatible/override so
+// the user sees a single signal per row.
+function suitabilityFor({ node, backend, hardwareTarget, alreadyInstalled }) {
+  if (alreadyInstalled) return 'installed'
+  // backend can be null on the first render before pickerBackend is set —
+  // this function is invoked from useMemo, which runs regardless of the
+  // outer open guard. Treat missing data as "compatible" so the placeholder
+  // render doesn't blow up; the picker won't actually paint anything until
+  // the early-return below the hooks fires.
+  if (!backend || backend.isMeta || !hardwareTarget) return 'compatible'
+  const vendor = (node.gpu_vendor || '').toLowerCase()
+  const t = hardwareTarget.toLowerCase()
+  if (t.startsWith('cpu') || t === 'default') {
+    // CPU builds always run; they're never marked Override (running CPU on a
+    // GPU node is the headline use case the user is choosing intentionally).
+    return 'compatible'
+  }
+  if (t.includes('nvidia') || t.includes('cuda') || t.includes('l4t')) {
+    return vendor === 'nvidia' ? 'compatible' : 'override'
+  }
+  if (t.includes('amd') || t.includes('rocm') || t.includes('hip')) {
+    return vendor === 'amd' ? 'compatible' : 'override'
+  }
+  if (t.includes('intel') || t.includes('sycl')) {
+    return vendor === 'intel' ? 'compatible' : 'override'
+  }
+  if (t.includes('metal') || t.includes('darwin')) {
+    // No vendor reporting for Metal; trust the user.
+    return 'compatible'
+  }
+  return 'compatible'
+}
+
+export default function NodeInstallPicker({
+  open, onClose, onComplete,
+  backend,
+  nodes = [],
+  allBackends = [],
+  installedNodeIds = [],
+  initialSelection,
+  addToast,
+}) {
+  const [search, setSearch] = useState('')
+  const [showHealthy, setShowHealthy] = useState(true)
+  const [showDraining, setShowDraining] = useState(false)
+  const [selected, setSelected] = useState(() => new Set())
+  const [overrideVariant, setOverrideVariant] = useState('') // chosen concrete name
+  const [overrideExpanded, setOverrideExpanded] = useState(false)
+  const [submitting, setSubmitting] = useState(false)
+  const [showMismatchConfirm, setShowMismatchConfirm] = useState(false)
+  // Per-node submission state: { [nodeId]: { status: 'pending'|'installing'|'done'|'error', error? , version? } }
+  const [perNode, setPerNode] = useState({})
+  const headerInputRef = useRef(null)
+
+  // Backend-derived metadata used throughout the picker.
+  const hardwareTarget = useMemo(() => hardwareTargetOf(backend, allBackends), [backend, allBackends])
+  const targetLabel = humanTargetLabel(hardwareTarget)
+  const concreteVariants = useMemo(() => {
+    if (!backend?.isMeta || !backend.capabilities) return []
+    return Object.entries(backend.capabilities).map(([cap, concrete]) => ({
+      value: concrete,
+      label: `${concrete}  ·  ${cap}`,
+    }))
+  }, [backend])
+
+  // Pending nodes are surgically removed from the list — they can't accept
+  // installs until approved. Surface the count instead of dead-disabled rows.
+  const pendingCount = nodes.filter(n => n.status === 'pending').length
+  const backendNodes = nodes.filter(n =>
+    (!n.node_type || n.node_type === 'backend') && n.status !== 'pending'
+  )
+
+  const installedSet = useMemo(() => {
+    const s = new Set()
+    if (Array.isArray(installedNodeIds)) installedNodeIds.forEach(id => s.add(id))
+    else if (installedNodeIds && typeof installedNodeIds.has === 'function') {
+      installedNodeIds.forEach(id => s.add(id))
+    }
+    return s
+  }, [installedNodeIds])
+
+  const filteredNodes = useMemo(() => {
+    let list = backendNodes
+    if (!showHealthy) list = list.filter(n => n.status !== 'healthy')
+    if (!showDraining) list = list.filter(n => n.status !== 'draining')
+    if (search.trim()) {
+      const q = search.toLowerCase()
+      list = list.filter(n =>
+        (n.name || '').toLowerCase().includes(q) ||
+        Object.entries(n.labels || {}).some(([k, v]) => `${k}=${v}`.toLowerCase().includes(q))
+      )
+    }
+    return list
+  }, [backendNodes, showHealthy, showDraining, search])
+
+  // Pre-seed selection on open. Reset all transient state so reopening
+  // doesn't surface ghost progress from the prior submit.
+  useEffect(() => {
+    if (!open) return
+    const initial = new Set()
+    if (Array.isArray(initialSelection)) initialSelection.forEach(id => initial.add(id))
+    setSelected(initial)
+    setSearch('')
+    setOverrideVariant('')
+    setOverrideExpanded(false)
+    setPerNode({})
+    setSubmitting(false)
+    setShowMismatchConfirm(false)
+  }, [open, initialSelection])
+
+  // Auto-expand the variant override disclosure when at least one selected
+  // node lacks a working GPU. This is the headline use case the feature
+  // exists for; surfacing it instead of hiding behind a click.
+  useEffect(() => {
+    if (!backend?.isMeta) return
+    const someGPUMissing = Array.from(selected).some(id => {
+      const n = backendNodes.find(x => x.id === id)
+      return n && (!n.gpu_vendor || n.gpu_vendor === '' || n.gpu_vendor === 'unknown')
+    })
+    if (someGPUMissing && !overrideExpanded) setOverrideExpanded(true)
+  }, [selected, backend, backendNodes]) // eslint-disable-line react-hooks/exhaustive-deps
+
+  // The effective backend that gets installed on each node. For
+  // hardware-specific backends this is just backend.name. For meta backends
+  // with no override, the worker picks per-node — we pass backend.name and
+  // the worker resolves. With an override set, the picker installs that
+  // exact concrete variant on every selected node.
+  const effectiveBackendName = overrideVariant || backend?.name
+
+  const counts = useMemo(() => {
+    let already = 0, overrides = 0
+    selected.forEach(id => {
+      const n = backendNodes.find(x => x.id === id)
+      if (!n) return
+      if (installedSet.has(id)) { already++; return }
+      const eff = overrideVariant
+        ? { name: overrideVariant, isMeta: false, metaBackendFor: backend?.name }
+        : backend
+      const target = overrideVariant ? hardwareTargetOf(eff, allBackends) : hardwareTarget
+      const s = suitabilityFor({ node: n, backend: eff, hardwareTarget: target, alreadyInstalled: false })
+      if (s === 'override') overrides++
+    })
+    return { already, overrides, selected: selected.size }
+  }, [selected, backendNodes, installedSet, overrideVariant, backend, hardwareTarget, allBackends])
+
+  const toggle = (nodeId) => {
+    setSelected(prev => {
+      const next = new Set(prev)
+      next.has(nodeId) ? next.delete(nodeId) : next.add(nodeId)
+      return next
+    })
+  }
+
+  const selectAllHealthy = () => {
+    setSelected(new Set(filteredNodes.filter(n => n.status === 'healthy').map(n => n.id)))
+  }
+  const selectCompatible = () => {
+    const eff = overrideVariant
+      ? { name: overrideVariant, isMeta: false, metaBackendFor: backend?.name }
+      : backend
+    const target = overrideVariant ? hardwareTargetOf(eff, allBackends) : hardwareTarget
+    setSelected(new Set(
+      filteredNodes
+        .filter(n => suitabilityFor({ node: n, backend: eff, hardwareTarget: target, alreadyInstalled: false }) === 'compatible')
+        .map(n => n.id)
+    ))
+  }
+  const clearSelection = () => setSelected(new Set())
+
+  const submit = async () => {
+    if (selected.size === 0 || submitting) return
+    if (counts.overrides > 0 && !showMismatchConfirm) {
+      setShowMismatchConfirm(true)
+      return
+    }
+    setShowMismatchConfirm(false)
+    setSubmitting(true)
+    const ids = Array.from(selected)
+    setPerNode(prev => {
+      const next = { ...prev }
+      ids.forEach(id => { next[id] = { status: 'installing' } })
+      return next
+    })
+
+    const results = await Promise.allSettled(ids.map(id =>
+      nodesApi.installBackend(id, effectiveBackendName)
+        .then(r => ({ id, ok: true, message: r?.message }))
+        .catch(err => ({ id, ok: false, error: err?.message || 'install failed' }))
+    ))
+
+    let successCount = 0, failCount = 0
+    setPerNode(prev => {
+      const next = { ...prev }
+      for (const r of results) {
+        if (r.status !== 'fulfilled') continue
+        const v = r.value
+        if (v.ok) {
+          next[v.id] = { status: 'done' }
+          successCount++
+        } else {
+          next[v.id] = { status: 'error', error: v.error }
+          failCount++
+        }
+      }
+      return next
+    })
+    setSubmitting(false)
+
+    if (successCount > 0 && onComplete) onComplete()
+
+    if (failCount === 0) {
+      addToast?.(`Installed on ${successCount} node${successCount === 1 ? '' : 's'}`, 'success')
+      setTimeout(() => onClose?.(), 800)
+    } else if (successCount === 0) {
+      addToast?.(`Install failed on all ${failCount} node${failCount === 1 ? '' : 's'}`, 'error')
+    } else {
+      addToast?.(`Installed on ${successCount}, failed on ${failCount}`, 'warning')
+    }
+  }
+
+  const retryFailed = async () => {
+    const failedIds = Object.entries(perNode)
+      .filter(([, v]) => v.status === 'error')
+      .map(([id]) => id)
+    if (failedIds.length === 0) return
+    setSelected(new Set(failedIds))
+    // Replace state for failed rows so they show "installing" again, not stale errors.
+    setPerNode(prev => {
+      const next = { ...prev }
+      failedIds.forEach(id => { next[id] = { status: 'installing' } })
+      return next
+    })
+    setSubmitting(true)
+    const results = await Promise.allSettled(failedIds.map(id =>
+      nodesApi.installBackend(id, effectiveBackendName)
+        .then(r => ({ id, ok: true, message: r?.message }))
+        .catch(err => ({ id, ok: false, error: err?.message || 'install failed' }))
+    ))
+    let successCount = 0, failCount = 0
+    setPerNode(prev => {
+      const next = { ...prev }
+      for (const r of results) {
+        if (r.status !== 'fulfilled') continue
+        const v = r.value
+        if (v.ok) { next[v.id] = { status: 'done' }; successCount++ }
+        else { next[v.id] = { status: 'error', error: v.error }; failCount++ }
+      }
+      return next
+    })
+    setSubmitting(false)
+    if (successCount > 0 && onComplete) onComplete()
+    if (failCount === 0) {
+      addToast?.(`Installed on ${successCount} node${successCount === 1 ? '' : 's'}`, 'success')
+      setTimeout(() => onClose?.(), 800)
+    }
+  }
+
+  const doneCount = Object.values(perNode).filter(v => v.status === 'done').length
+  const errorCount = Object.values(perNode).filter(v => v.status === 'error').length
+  const totalAttempted = Object.keys(perNode).length
+
+  if (!open || !backend) return null
+
+  const noNodes = backendNodes.length === 0
+
+  return (
+    <Modal onClose={onClose} maxWidth="780px">
+      <div style={{
+        padding: 'var(--spacing-md) var(--spacing-lg)',
+        borderBottom: '1px solid var(--color-border-subtle)',
+        display: 'flex',
+        alignItems: 'center',
+        justifyContent: 'space-between',
+        gap: 'var(--spacing-sm)',
+      }}>
+        <h2 style={{ margin: 0, fontSize: '1rem', display: 'flex', alignItems: 'center', gap: 'var(--spacing-sm)' }}>
+          <i className="fas fa-cog" style={{ color: 'var(--color-primary)' }} />
+          Install <span style={{ fontFamily: 'var(--font-mono)' }}>{backend.name}</span>
+          {backend.isMeta ? (
+            <span className="badge badge-info" style={{ fontSize: '0.6875rem' }}>Auto-resolving</span>
+          ) : (
+            <span className="badge badge-warning" style={{ fontSize: '0.6875rem' }}>Hardware-specific</span>
+          )}
+        </h2>
+        <button
+          type="button"
+          className="btn btn-ghost btn-sm"
+          onClick={onClose}
+          aria-label="Close"
+          style={{ fontSize: '1.125rem', lineHeight: 1, padding: '4px 10px' }}
+        >×</button>
+      </div>
+
+      <div style={{ padding: 'var(--spacing-md) var(--spacing-lg)' }}>
+        {!backend.isMeta && (
+          <div className="card" style={{
+            marginBottom: 'var(--spacing-md)',
+            padding: 'var(--spacing-sm) var(--spacing-md)',
+            background: 'var(--color-warning-light)',
+            border: '1px solid var(--color-warning-border)',
+            borderRadius: 'var(--radius-md)',
+            display: 'flex',
+            alignItems: 'center',
+            gap: 'var(--spacing-sm)',
+          }}>
+            <i className="fas fa-microchip" style={{ color: 'var(--color-warning)' }} />
+            <span style={{ color: 'var(--color-warning)', fontSize: '0.8125rem' }}>
+              {targetLabel}. Install only on nodes where you want this build to run.
+              {hardwareTarget && ` Targets: ${humanTargetLabel(hardwareTarget).replace(' build', '')}.`}
+            </span>
+          </div>
+        )}
+
+        {noNodes ? (
+          <div className="empty-state" style={{ padding: 'var(--spacing-xl) 0' }}>
+            <div className="empty-state-icon"><i className="fas fa-server" /></div>
+            <h3 className="empty-state-title">No backend nodes available</h3>
+            <p className="empty-state-text">
+              Approve pending workers or register new ones.
+              {pendingCount > 0 && ` (${pendingCount} awaiting approval.)`}
+            </p>
+            <a className="btn btn-secondary btn-sm" href="/app/nodes">
+              <i className="fas fa-network-wired" /> Manage nodes
+            </a>
+          </div>
+        ) : (
+          <>
+            {/* Filter row */}
+            <div style={{ display: 'flex', gap: 'var(--spacing-sm)', alignItems: 'center', marginBottom: 'var(--spacing-sm)', flexWrap: 'wrap' }}>
+              <div className="search-bar" style={{ flex: 1, minWidth: 180 }}>
+                <i className="fas fa-search search-icon" />
+                <input
+                  ref={headerInputRef}
+                  className="input"
+                  placeholder="Filter nodes by name or label..."
+                  value={search}
+                  onChange={e => setSearch(e.target.value)}
+                />
+              </div>
+              <button className="btn btn-secondary btn-sm" onClick={selectAllHealthy} type="button">
+                Select all healthy
+              </button>
+              <button className="btn btn-secondary btn-sm" onClick={selectCompatible} type="button">
+                Select compatible nodes
+              </button>
+              {selected.size > 0 && (
+                <button className="btn btn-ghost btn-sm" onClick={clearSelection} type="button">
+                  Clear
+                </button>
+              )}
+            </div>
+
+            {/* Variant override (auto-resolving only) */}
+            {backend.isMeta && concreteVariants.length > 0 && (
+              <div style={{ marginBottom: 'var(--spacing-sm)' }}>
+                <button
+                  type="button"
+                  className="btn btn-ghost btn-sm"
+                  onClick={() => setOverrideExpanded(v => !v)}
+                  aria-expanded={overrideExpanded}
+                  style={{ padding: '4px 8px' }}
+                >
+                  <i className={`fas fa-chevron-${overrideExpanded ? 'down' : 'right'}`} style={{ marginRight: 4, fontSize: '0.625rem' }} />
+                  Override variant for selected nodes…
+                </button>
+                {overrideExpanded && (
+                  <div className="card" style={{ marginTop: 4, padding: 'var(--spacing-sm) var(--spacing-md)' }}>
+                    <p style={{ fontSize: '0.75rem', color: 'var(--color-text-secondary)', marginTop: 0, marginBottom: 'var(--spacing-xs)' }}>
+                      By default each node picks its own variant. Override to install one specific variant on every selected node — useful when GPU detection fails on a node and you want the CPU build there instead.
+                    </p>
+                    <SearchableSelect
+                      value={overrideVariant}
+                      onChange={setOverrideVariant}
+                      options={concreteVariants}
+                      placeholder="Per-node auto-resolve (default)"
+                      allOption={{ value: '', label: 'Per-node auto-resolve (default)' }}
+                    />
+                  </div>
+                )}
+              </div>
+            )}
+
+            {/* Node table */}
+            <div className="table-container" style={{ marginBottom: 'var(--spacing-sm)', maxHeight: '40vh', overflowY: 'auto' }}>
+              <table className="table" style={{ margin: 0 }}>
+                <thead>
+                  <tr>
+                    <th style={{ width: 28 }}>
+                      <input
+                        type="checkbox"
+                        aria-label="Select all visible"
+                        checked={filteredNodes.length > 0 && filteredNodes.every(n => selected.has(n.id))}
+                        onChange={(e) => {
+                          setSelected(prev => {
+                            const next = new Set(prev)
+                            if (e.target.checked) filteredNodes.forEach(n => next.add(n.id))
+                            else filteredNodes.forEach(n => next.delete(n.id))
+                            return next
+                          })
+                        }}
+                      />
+                    </th>
+                    <th>Node</th>
+                    <th>Status</th>
+                    <th>Hardware</th>
+                    <th>Suitability</th>
+                  </tr>
+                </thead>
+                <tbody>
+                  {filteredNodes.map(node => {
+                    const installed = installedSet.has(node.id)
+                    const eff = overrideVariant
+                      ? { name: overrideVariant, isMeta: false, metaBackendFor: backend.name }
+                      : backend
+                    const target = overrideVariant ? hardwareTargetOf(eff, allBackends) : hardwareTarget
+                    const suit = suitabilityFor({ node, backend: eff, hardwareTarget: target, alreadyInstalled: installed })
+                    const isSel = selected.has(node.id)
+                    const rowState = perNode[node.id]
+                    const vendor = gpuVendorLabel(node.gpu_vendor)
+                    const totalVRAM = formatVRAM(node.total_vram)
+                    const totalRAM = formatVRAM(node.total_ram)
+                    return (
+                      <tr key={node.id}>
+                        <td>
+                          <input
+                            type="checkbox"
+                            aria-label={`Select ${node.name}`}
+                            aria-disabled={rowState?.status === 'installing'}
+                            checked={isSel}
+                            onChange={() => toggle(node.id)}
+                          />
+                        </td>
+                        <td>
+                          <div style={{ display: 'flex', flexDirection: 'column', gap: 2 }}>
+                            <span style={{ fontWeight: 500, fontSize: '0.875rem' }}>{node.name}</span>
+                            {node.labels && Object.keys(node.labels).length > 0 && (
+                              <div style={{ display: 'flex', flexWrap: 'wrap', gap: 3 }}>
+                                {Object.entries(node.labels).slice(0, 3).map(([k, v]) => (
+                                  <span key={k} className="cell-mono" style={{
+                                    padding: '1px 5px', borderRadius: 'var(--radius-sm)', fontSize: '0.6875rem',
+                                    background: 'var(--color-bg-tertiary)', border: '1px solid var(--color-border-subtle)',
+                                  }}>{k}={v}</span>
+                                ))}
+                                {Object.keys(node.labels).length > 3 && (
+                                  <span className="cell-muted" style={{ fontSize: '0.6875rem' }}>
+                                    +{Object.keys(node.labels).length - 3}
+                                  </span>
+                                )}
+                              </div>
+                            )}
+                          </div>
+                        </td>
+                        <td>
+                          <span style={{ fontSize: '0.8125rem' }}>
+                            {STATUS_LABELS[node.status] || node.status}
+                          </span>
+                        </td>
+                        <td style={{ fontSize: '0.8125rem', fontFamily: 'var(--font-mono)', color: 'var(--color-text-secondary)' }}>
+                          {totalVRAM ? (
+                            <>{vendor && <span style={{ marginRight: 4 }}>{vendor}</span>}{totalVRAM}</>
+                          ) : totalRAM ? (
+                            <span>CPU · {totalRAM}</span>
+                          ) : <span className="cell-muted">—</span>}
+                        </td>
+                        <td>
+                          {rowState?.status === 'installing' ? (
+                            <span className="badge badge-info">
+                              <i className="fas fa-spinner fa-spin" style={{ marginRight: 4 }} />Installing
+                            </span>
+                          ) : rowState?.status === 'done' ? (
+                            <span className="badge badge-success">
+                              <i className="fas fa-check" style={{ marginRight: 4 }} />Installed
+                            </span>
+                          ) : rowState?.status === 'error' ? (
+                            <button
+                              type="button"
+                              className="badge badge-error"
+                              title={rowState.error}
+                              aria-describedby={`err-${node.id}`}
+                              style={{ border: 'none', cursor: 'help' }}
+                            >
+                              <i className="fas fa-exclamation-triangle" style={{ marginRight: 4 }} />Failed
+                              <span id={`err-${node.id}`} style={{ position: 'absolute', left: -9999 }}>{rowState.error}</span>
+                            </button>
+                          ) : suit === 'installed' ? (
+                            <span className="badge" style={{ background: 'var(--color-bg-tertiary)', color: 'var(--color-text-muted)' }}>
+                              Installed
+                            </span>
+                          ) : suit === 'override' ? (
+                            <span className="badge badge-warning">
+                              <i className="fas fa-exclamation-circle" style={{ marginRight: 4 }} />Override
+                            </span>
+                          ) : (
+                            <span className="badge badge-success" style={{ background: 'var(--color-success-light)', color: 'var(--color-success)' }}>
+                              Compatible
+                            </span>
+                          )}
+                        </td>
+                      </tr>
+                    )
+                  })}
+                  {filteredNodes.length === 0 && (
+                    <tr>
+                      <td colSpan={5} style={{ textAlign: 'center', padding: 'var(--spacing-md)', color: 'var(--color-text-muted)' }}>
+                        No nodes match the current filters.
+                      </td>
+                    </tr>
+                  )}
+                </tbody>
+              </table>
+            </div>
+
+            {pendingCount > 0 && (
+              <p style={{ fontSize: '0.75rem', color: 'var(--color-text-muted)', marginTop: 0, marginBottom: 'var(--spacing-sm)' }}>
+                +{pendingCount} awaiting approval — <a href="/app/nodes" style={{ color: 'var(--color-primary)' }}>approve from Nodes</a>.
+              </p>
+            )}
+
+            {/* Mismatch confirm */}
+            {showMismatchConfirm && (
+              <div className="card" style={{
+                marginBottom: 'var(--spacing-sm)',
+                padding: 'var(--spacing-md)',
+                background: 'var(--color-warning-light)',
+                border: '1px solid var(--color-warning-border)',
+                borderRadius: 'var(--radius-md)',
+              }}>
+                <p style={{ marginTop: 0, marginBottom: 'var(--spacing-sm)', color: 'var(--color-warning)', fontSize: '0.875rem' }}>
+                  Installing {targetLabel.toLowerCase()} on {counts.overrides} node{counts.overrides === 1 ? '' : 's'} that don't match. Those nodes will run inference on the chosen build, not their native GPU. Continue?
+                </p>
+                <div style={{ display: 'flex', gap: 'var(--spacing-sm)', justifyContent: 'flex-end' }}>
+                  <button className="btn btn-secondary btn-sm" type="button" onClick={() => setShowMismatchConfirm(false)}>
+                    Cancel
+                  </button>
+                  <button className="btn btn-primary btn-sm" type="button" onClick={submit}
+                    style={{ background: 'var(--color-warning)', borderColor: 'var(--color-warning)' }}>
+                    Install on {targetLabel.replace(' build', '')}
+                  </button>
+                </div>
+              </div>
+            )}
+          </>
+        )}
+      </div>
+
+      {!noNodes && (
+        <div style={{
+          padding: 'var(--spacing-md) var(--spacing-lg)',
+          borderTop: '1px solid var(--color-border-subtle)',
+          display: 'flex',
+          alignItems: 'center',
+          gap: 'var(--spacing-sm)',
+          flexWrap: 'wrap',
+        }}>
+          <div style={{ flex: 1, fontSize: '0.8125rem', color: 'var(--color-text-secondary)' }}>
+            {totalAttempted > 0 ? (
+              <>
+                {doneCount} of {totalAttempted} done
+                {errorCount > 0 && (
+                  <> · <span className="badge badge-error" style={{ fontSize: '0.6875rem' }}>{errorCount} failed</span></>
+                )}
+              </>
+            ) : (
+              <>
+                {counts.selected} {counts.selected === 1 ? 'node' : 'nodes'} selected
+                {counts.already > 0 && <> · {counts.already} already installed</>}
+                {counts.overrides > 0 && <> · {counts.overrides} override{counts.overrides === 1 ? '' : 's'}</>}
+              </>
+            )}
+          </div>
+          {errorCount > 0 && !submitting && (
+            <button className="btn btn-secondary btn-sm" type="button" onClick={retryFailed}>
+              <i className="fas fa-redo" /> Retry failed nodes
+            </button>
+          )}
+          <button className="btn btn-secondary btn-sm" type="button" onClick={onClose} disabled={submitting}>
+            {totalAttempted > 0 && doneCount > 0 ? 'Close' : 'Cancel'}
+          </button>
+          <button
+            className="btn btn-primary btn-sm"
+            type="button"
+            onClick={submit}
+            disabled={submitting || counts.selected === 0 || showMismatchConfirm}
+          >
+            {submitting ? (
+              <><i className="fas fa-spinner fa-spin" /> Installing…</>
+            ) : (
+              <>Install on {counts.selected} {counts.selected === 1 ? 'node' : 'nodes'}</>
+            )}
+          </button>
+        </div>
+      )}
+    </Modal>
+  )
+}
--- a/core/http/react-ui/src/components/ResourceActions.jsx
+++ b/core/http/react-ui/src/components/ResourceActions.jsx
@@ -0,0 +1,29 @@
+// ResourceActions groups row-level buttons into a lifecycle cluster (start,
+// stop, pin, reinstall, upgrade) and a destructive cluster (delete) with a
+// thin divider between them, so a destructive intent visually separates from
+// a routine one. Replaces the old 4px-gap row of buttons in the Manage page
+// where Stop / Pin / Delete sat shoulder-to-shoulder with no visual cue
+// telling apart "click to fiddle" from "click to throw away".
+//
+// `lifecycle` and `destructive` accept any ReactNode — typically one or more
+// <button>s. The wrapping div stops click propagation so action clicks don't
+// also expand the row.
+export default function ResourceActions({ lifecycle, destructive }) {
+  const hasLifecycle = !!lifecycle
+  const hasDestructive = !!destructive
+  if (!hasLifecycle && !hasDestructive) return null
+
+  return (
+    <div className="resource-actions" onClick={e => e.stopPropagation()}>
+      {hasLifecycle && (
+        <div className="resource-actions__group">{lifecycle}</div>
+      )}
+      {hasLifecycle && hasDestructive && (
+        <span className="resource-actions__divider" aria-hidden="true" />
+      )}
+      {hasDestructive && (
+        <div className="resource-actions__group">{destructive}</div>
+      )}
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/ResourceMonitor.jsx
+++ b/core/http/react-ui/src/components/ResourceMonitor.jsx
@@ -51,7 +51,7 @@ export default function ResourceMonitor() {
                  <div className="resource-bar-container" style={{ flex: 1 }}>
                    <div className="resource-bar" style={{ width: `${pct}%`, background: color }} />
                  </div>
-                  <span style={{ fontSize: '0.8125rem', fontWeight: 600, fontFamily: "'JetBrains Mono', monospace", color, minWidth: '3em', textAlign: 'right' }}>
+                  <span style={{ fontSize: '0.8125rem', fontWeight: 600, fontFamily: 'var(--font-mono)', color, minWidth: '3em', textAlign: 'right' }}>
                    {pct.toFixed(0)}%
                  </span>
                </div>
@@ -76,7 +76,7 @@ export default function ResourceMonitor() {
            <div className="resource-bar-container" style={{ flex: 1 }}>
              <div className="resource-bar" style={{ width: `${ram.usage_percent || 0}%`, background: percentColor(ram.usage_percent || 0) }} />
            </div>
-            <span style={{ fontSize: '0.8125rem', fontWeight: 600, fontFamily: "'JetBrains Mono', monospace", color: percentColor(ram.usage_percent || 0), minWidth: '3em', textAlign: 'right' }}>
+            <span style={{ fontSize: '0.8125rem', fontWeight: 600, fontFamily: 'var(--font-mono)', color: percentColor(ram.usage_percent || 0), minWidth: '3em', textAlign: 'right' }}>
              {(ram.usage_percent || 0).toFixed(0)}%
            </span>
          </div>
@@ -91,7 +91,7 @@ export default function ResourceMonitor() {
      {isGpu && aggregate.gpu_count > 1 && (
        <div style={{ fontSize: '0.75rem', color: 'var(--color-text-secondary)', marginTop: 'var(--spacing-sm)', display: 'flex', justifyContent: 'space-between' }}>
          <span>Total VRAM</span>
-          <span style={{ fontFamily: "'JetBrains Mono', monospace" }}>
+          <span style={{ fontFamily: 'var(--font-mono)' }}>
            {formatBytes(aggregate.used_memory)} / {formatBytes(aggregate.total_memory)} ({aggregate.usage_percent?.toFixed(1)}%)
          </span>
        </div>
@@ -101,7 +101,7 @@ export default function ResourceMonitor() {
      {resources.storage_size != null && (
        <div style={{ fontSize: '0.75rem', color: 'var(--color-text-secondary)', marginTop: 'var(--spacing-sm)', display: 'flex', justifyContent: 'space-between' }}>
          <span>Models storage</span>
-          <span style={{ fontFamily: "'JetBrains Mono', monospace", color: 'var(--color-text-primary)' }}>
+          <span style={{ fontFamily: 'var(--font-mono)', color: 'var(--color-text-primary)' }}>
            {formatBytes(resources.storage_size)}
          </span>
        </div>
--- a/core/http/react-ui/src/components/ResourceRow.jsx
+++ b/core/http/react-ui/src/components/ResourceRow.jsx
@@ -0,0 +1,81 @@
+import { Fragment } from 'react'
+
+// ResourceRow renders the visible row + its conditional detail row as a pair
+// of <tr>s, so the existing .table styling keeps applying and the Manage page
+// can re-use the gallery's expand-to-detail interaction without inventing a
+// new table system. The consumer owns the cells (which pass through as
+// children) — this component only manages the click-to-expand handler, the
+// dimmed state for disabled rows, and the colSpan'd detail row beneath.
+//
+// `onToggleExpand` fires on row click only. Buttons / toggles inside cells
+// must call e.stopPropagation() (or be wrapped in an .actions-stop wrapper)
+// to avoid double-triggering the expand.
+export default function ResourceRow({
+  expanded,
+  onToggleExpand,
+  detail,
+  colSpan,
+  dimmed,
+  className = '',
+  children,
+}) {
+  return (
+    <Fragment>
+      <tr
+        className={`resource-row${dimmed ? ' is-dimmed' : ''}${expanded ? ' is-expanded' : ''} ${className}`.trim()}
+        onClick={onToggleExpand}
+        style={{ cursor: onToggleExpand ? 'pointer' : 'default' }}
+      >
+        {children}
+      </tr>
+      {expanded && detail && (
+        <tr className="resource-row__detail-row">
+          <td colSpan={colSpan} className="resource-row__detail-cell">
+            {detail}
+          </td>
+        </tr>
+      )}
+    </Fragment>
+  )
+}
+
+// ChevronCell is the small rotating chevron used as the leftmost cell of an
+// expandable row. Mirrors the Nodes/Models/Backends gallery affordance so
+// users see the same "click to expand" cue everywhere.
+export function ChevronCell({ expanded }) {
+  return (
+    <td className="resource-row__chevron-cell">
+      <span className={`row-chevron${expanded ? ' is-expanded' : ''}`} aria-hidden="true">
+        <i className="fas fa-chevron-right" />
+      </span>
+    </td>
+  )
+}
+
+// IconCell renders the 48px brand icon shell — the same one the Install
+// gallery uses. `icon` is the image URL (from gallery metadata); when absent
+// or broken we fall back to a FontAwesome glyph so custom-imported items
+// still get a placeholder instead of an empty square.
+export function IconCell({ icon, fallback = 'fa-cube', alt = '' }) {
+  return (
+    <td className="resource-row__icon-cell">
+      <div className="resource-row__icon">
+        {icon ? (
+          <img src={icon} alt={alt} loading="lazy" />
+        ) : (
+          <i className={`fas ${fallback}`} />
+        )}
+      </div>
+    </td>
+  )
+}
+
+// StopPropagationCell wraps cell contents that contain interactive controls
+// (Toggle, action buttons) so a click on them doesn't also expand the row.
+export function StopPropagationCell({ children, ...props }) {
+  return (
+    <td {...props} onClick={e => e.stopPropagation()}>
+      {children}
+    </td>
+  )
+}
--- a/core/http/react-ui/src/components/SearchableSelect.jsx
+++ b/core/http/react-ui/src/components/SearchableSelect.jsx
@@ -116,7 +116,7 @@ export default function SearchableSelect({
        aria-expanded={open}
        onClick={() => { if (!disabled) { setOpen(!open); setQuery(''); setFocusIndex(-1) } }}
        style={{
-          width: '100%', padding: '4px 8px', fontSize: '0.8125rem',
+          width: '100%', padding: 'var(--spacing-xs) var(--spacing-sm)', fontSize: '0.8125rem',
          cursor: disabled ? 'not-allowed' : 'pointer',
          display: 'flex', alignItems: 'center', gap: '6px',
          background: 'var(--color-bg-primary)', border: '1px solid var(--color-border)',
@@ -145,7 +145,7 @@ export default function SearchableSelect({
              value={query}
              onChange={(e) => { setQuery(e.target.value); setFocusIndex(-1) }}
              onKeyDown={handleKeyDown}
-              style={{ width: '100%', padding: '4px 8px', fontSize: '0.8125rem' }}
+              style={{ width: '100%', padding: 'var(--spacing-xs) var(--spacing-sm)', fontSize: '0.8125rem' }}
            />
          </div>
          <div ref={listRef} role="listbox" style={{ overflowY: 'auto', maxHeight: 'min(200px, 50vh)' }}>
--- a/core/http/react-ui/src/components/Sidebar.jsx
+++ b/core/http/react-ui/src/components/Sidebar.jsx
@@ -1,4 +1,4 @@
-import { useState, useEffect } from 'react'
+import { useState, useEffect, useRef } from 'react'
 import { NavLink, useNavigate, useLocation } from 'react-router-dom'
 import ThemeToggle from './ThemeToggle'
 import { useAuth } from '../context/AuthContext'
@@ -107,11 +107,22 @@ export default function Sidebar({ isOpen, onClose }) {
  const { isAdmin, authEnabled, user, logout, hasFeature } = useAuth()
  const navigate = useNavigate()
  const location = useLocation()
+  const closeBtnRef = useRef(null)

  useEffect(() => {
    fetch(apiUrl('/api/features')).then(r => r.json()).then(setFeatures).catch(() => {})
  }, [])

+  // Move focus into the drawer when opened on mobile/tablet so keyboard
+  // and screen-reader users land inside the dialog. Targeting the close
+  // button avoids hijacking the visual focus to a nav item the user may
+  // not have meant to activate.
+  useEffect(() => {
+    if (!isOpen) return
+    const id = window.requestAnimationFrame(() => closeBtnRef.current?.focus())
+    return () => window.cancelAnimationFrame(id)
+  }, [isOpen])
+
  // Auto-expand section containing the active route
  useEffect(() => {
    for (const section of sections) {
@@ -168,7 +179,11 @@ export default function Sidebar({ isOpen, onClose }) {
    <>
      {isOpen && <div className="sidebar-overlay" onClick={onClose} />}

-      <aside className={`sidebar ${isOpen ? 'open' : ''} ${collapsed ? 'collapsed' : ''}`}>
+      <aside
+        id="app-sidebar"
+        className={`sidebar ${isOpen ? 'open' : ''} ${collapsed ? 'collapsed' : ''}`}
+        aria-label="Primary navigation"
+      >
        {/* Logo */}
        <div className="sidebar-header">
          <a href="./" className="sidebar-logo-link">
@@ -177,8 +192,13 @@ export default function Sidebar({ isOpen, onClose }) {
          <a href="./" className="sidebar-logo-icon" title="LocalAI">
            <img src={apiUrl('/static/logo.png')} alt="LocalAI" className="sidebar-logo-icon-img" />
          </a>
-          <button className="sidebar-close-btn" onClick={onClose} aria-label="Close menu">
-            <i className="fas fa-times" />
+          <button
+            ref={closeBtnRef}
+            className="sidebar-close-btn"
+            onClick={onClose}
+            aria-label="Close menu"
+          >
+            <i className="fas fa-times" aria-hidden="true" />
          </button>
        </div>

--- a/core/http/react-ui/src/components/StatCard.jsx
+++ b/core/http/react-ui/src/components/StatCard.jsx
@@ -0,0 +1,39 @@
+// StatCard renders a single cluster/dashboard metric card. The left accent
+// bar + icon chip color is driven by `accentVar` (a CSS custom property name,
+// e.g. "--color-success") so the card reads as semantic without the caller
+// having to reach into colors directly. `onClick` upgrades the card to a
+// keyboard-focusable button — used by the Manage page so cards double as
+// shortcuts to the relevant tab + filter.
+export default function StatCard({ icon, label, value, color, accentVar, onClick }) {
+  const accent = color || (accentVar ? `var(${accentVar})` : 'var(--color-text-primary)')
+  const interactive = typeof onClick === 'function'
+
+  const handleKeyDown = interactive
+    ? (e) => {
+        if (e.key === 'Enter' || e.key === ' ') {
+          e.preventDefault()
+          onClick(e)
+        }
+      }
+    : undefined
+
+  return (
+    <div
+      className="stat-card"
+      data-clickable={interactive ? 'true' : undefined}
+      role={interactive ? 'button' : undefined}
+      tabIndex={interactive ? 0 : undefined}
+      onClick={interactive ? onClick : undefined}
+      onKeyDown={handleKeyDown}
+      style={accentVar ? { ['--stat-accent']: `var(${accentVar})` } : undefined}
+    >
+      <div className="stat-card__body">
+        <div className="stat-card__label">{label}</div>
+        <div className="stat-card__value" style={{ color: accent }}>{value}</div>
+      </div>
+      <div className="stat-card__icon" style={accentVar ? { color: accent } : undefined}>
+        <i className={icon} />
+      </div>
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/TemplateSelector.jsx
+++ b/core/http/react-ui/src/components/TemplateSelector.jsx
@@ -24,7 +24,7 @@ export default function TemplateSelector({ onSelect }) {
            <p style={{ fontSize: '0.8125rem', color: 'var(--color-text-secondary)', lineHeight: 1.5, margin: 0 }}>
              {t.description}
            </p>
-            <div style={{ display: 'flex', flexWrap: 'wrap', gap: '4px', marginTop: 'var(--spacing-xs)' }}>
+            <div style={{ display: 'flex', flexWrap: 'wrap', gap: 'var(--spacing-xs)', marginTop: 'var(--spacing-xs)' }}>
              {Object.keys(t.fields).filter(k => k !== 'name').map(k => (
                <span key={k} className="badge" style={{
                  fontSize: '0.6875rem', background: 'var(--color-bg-tertiary)',
--- a/core/http/react-ui/src/components/UnifiedMCPDropdown.jsx
+++ b/core/http/react-ui/src/components/UnifiedMCPDropdown.jsx
@@ -187,7 +187,7 @@ export default function UnifiedMCPDropdown({
                    placeholder="Server URL (e.g. https://mcp.example.com/sse)"
                    value={url}
                    onChange={e => setUrl(e.target.value)}
-                    style={{ width: '100%', marginBottom: '4px' }}
+                    style={{ width: '100%', marginBottom: 'var(--spacing-xs)' }}
                  />
                  <input
                    type="text"
@@ -195,7 +195,7 @@ export default function UnifiedMCPDropdown({
                    placeholder="Name (optional)"
                    value={name}
                    onChange={e => setName(e.target.value)}
-                    style={{ width: '100%', marginBottom: '4px' }}
+                    style={{ width: '100%', marginBottom: 'var(--spacing-xs)' }}
                  />
                  <input
                    type="password"
@@ -203,13 +203,13 @@ export default function UnifiedMCPDropdown({
                    placeholder="Auth token (optional)"
                    value={authToken}
                    onChange={e => setAuthToken(e.target.value)}
-                    style={{ width: '100%', marginBottom: '4px' }}
+                    style={{ width: '100%', marginBottom: 'var(--spacing-xs)' }}
                  />
                  <label style={{ display: 'flex', alignItems: 'center', gap: '6px', fontSize: '0.8rem', marginBottom: '6px' }}>
                    <input type="checkbox" checked={useProxy} onChange={e => setUseProxy(e.target.checked)} />
                    Use CORS proxy
                  </label>
-                  <div style={{ display: 'flex', gap: '4px', justifyContent: 'flex-end' }}>
+                  <div style={{ display: 'flex', gap: 'var(--spacing-xs)', justifyContent: 'flex-end' }}>
                    <button type="button" className="btn btn-sm btn-secondary" onClick={() => setAddDialog(false)}>Cancel</button>
                    <button type="button" className="btn btn-sm btn-primary" onClick={handleAddClient} disabled={!url.trim()}>Add</button>
                  </div>
--- a/core/http/react-ui/src/hooks/useDistributedMode.js
+++ b/core/http/react-ui/src/hooks/useDistributedMode.js
@@ -0,0 +1,40 @@
+import { useState, useEffect, useCallback } from 'react'
+import { nodesApi } from '../utils/api'
+
+// useDistributedMode probes /api/nodes to decide whether the running LocalAI
+// is in distributed mode. The endpoint returns 503 when distributed mode is
+// disabled — we treat any failure as standalone, mirroring the detection
+// pattern in pages/Nodes.jsx so UI behaviour matches the Nodes page.
+//
+// Returns:
+//   enabled  — true when the cluster API answered OK at least once
+//   nodes    — the most recent /api/nodes response (array; possibly empty)
+//   loading  — true until the first probe completes
+//   refetch  — manual trigger; the picker calls this after install/delete
+//
+// Components that need a live nodes list (e.g. install picker) re-call
+// refetch after operations complete. The hook does not poll on its own —
+// the Nodes page handles its own 5s polling and the Backends gallery only
+// needs a one-shot read on mount.
+export function useDistributedMode() {
+  const [enabled, setEnabled] = useState(false)
+  const [nodes, setNodes] = useState([])
+  const [loading, setLoading] = useState(true)
+
+  const probe = useCallback(async () => {
+    try {
+      const data = await nodesApi.list()
+      setNodes(Array.isArray(data) ? data : [])
+      setEnabled(true)
+    } catch {
+      setEnabled(false)
+      setNodes([])
+    } finally {
+      setLoading(false)
+    }
+  }, [])
+
+  useEffect(() => { probe() }, [probe])
+
+  return { enabled, nodes, loading, refetch: probe }
+}
--- a/core/http/react-ui/src/hooks/useGalleryEnrichment.js
+++ b/core/http/react-ui/src/hooks/useGalleryEnrichment.js
@@ -0,0 +1,53 @@
+import { useState, useEffect, useCallback } from 'react'
+import { modelsApi, backendsApi } from '../utils/api'
+
+// useGalleryEnrichment fetches the full model + backend gallery once and
+// returns lookup helpers used by the Manage page. The Manage list APIs only
+// know name/version/alias — descriptions, icons, licenses, tags, and links
+// live on the gallery side. Cross-referencing here lets us light up the
+// installed lists with the same metadata the Install pages show, instead of
+// rendering them as bare names.
+//
+// Items not present in the gallery (custom imports, external OCI installs)
+// resolve to `null` — callers fall back to a neutral icon + "no description".
+export function useGalleryEnrichment() {
+  const [modelMap, setModelMap] = useState(() => new Map())
+  const [backendMap, setBackendMap] = useState(() => new Map())
+  const [loaded, setLoaded] = useState(false)
+
+  useEffect(() => {
+    let cancelled = false
+    Promise.allSettled([
+      modelsApi.list({ items: 9999, page: 1 }),
+      backendsApi.list({ items: 9999, page: 1 }),
+    ]).then(([m, b]) => {
+      if (cancelled) return
+      const mm = new Map()
+      if (m.status === 'fulfilled') {
+        const list = m.value?.models || []
+        for (const x of list) {
+          const key = x.name || x.id
+          if (key) mm.set(key, x)
+        }
+      }
+      const bm = new Map()
+      if (b.status === 'fulfilled') {
+        const raw = b.value
+        const list = Array.isArray(raw?.backends) ? raw.backends : Array.isArray(raw) ? raw : []
+        for (const x of list) {
+          const key = x.name || x.id
+          if (key) bm.set(key, x)
+        }
+      }
+      setModelMap(mm)
+      setBackendMap(bm)
+      setLoaded(true)
+    })
+    return () => { cancelled = true }
+  }, [])
+
+  const enrichModel = useCallback((name) => (name ? modelMap.get(name) || null : null), [modelMap])
+  const enrichBackend = useCallback((name) => (name ? backendMap.get(name) || null : null), [backendMap])
+
+  return { enrichModel, enrichBackend, loaded }
+}
--- a/core/http/react-ui/src/hooks/useModels.js
+++ b/core/http/react-ui/src/hooks/useModels.js
@@ -6,9 +6,9 @@ export function useModels(capability) {
  const [loading, setLoading] = useState(true)
  const [error, setError] = useState(null)

-  const fetchModels = useCallback(async () => {
+  const fetchModels = useCallback(async ({ silent = false } = {}) => {
    try {
-      setLoading(true)
+      if (!silent) setLoading(true)
      const data = await modelsApi.listCapabilities()
      let items = data?.data || []
      if (capability) {
@@ -30,15 +30,19 @@ export function useModels(capability) {
        setError(err.message)
      }
    } finally {
-      setLoading(false)
+      if (!silent) setLoading(false)
    }
  }, [capability])

+  // Subsequent refetches stay silent so consumers don't blank their tables
+  // (e.g. the Manage page auto-refreshes every 10s in distributed mode).
+  const refetch = useCallback(() => fetchModels({ silent: true }), [fetchModels])
+
  useEffect(() => {
    fetchModels()
  }, [fetchModels])

-  return { models, loading, error, refetch: fetchModels }
+  return { models, loading, error, refetch }
 }

 export function useGalleryModels(params = {}) {
--- a/core/http/react-ui/src/index.css
+++ b/core/http/react-ui/src/index.css
@@ -12,14 +12,17 @@ html {
 }

 body {
-  font-family: 'Space Grotesk', -apple-system, BlinkMacSystemFont, sans-serif;
+  font-family: var(--font-sans);
  font-size: var(--text-base);
  font-weight: var(--font-weight-regular);
  line-height: var(--leading-normal);
+  font-feature-settings: "ss01", "ss03", "cv11";
+  letter-spacing: -0.005em;
  min-height: 100%;
  background-color: var(--color-bg-primary);
  color: var(--color-text-primary);
-  transition: background-color 200ms ease, color 200ms ease;
+  transition: background-color var(--duration-normal) var(--ease-default),
+              color var(--duration-normal) var(--ease-default);
 }

 #root {
@@ -27,36 +30,73 @@ body {
  min-height: 100dvh;
 }

-/* Scrollbar */
-::-webkit-scrollbar { width: 6px; height: 6px; }
-::-webkit-scrollbar-track { background: var(--color-bg-primary); }
-::-webkit-scrollbar-thumb { background: var(--color-bg-secondary); border-radius: 3px; }
-::-webkit-scrollbar-thumb:hover { background: var(--color-primary); }
-* { scrollbar-width: thin; scrollbar-color: var(--color-bg-secondary) var(--color-bg-primary); }
+/* Global selection + focus */
+::selection {
+  background: var(--color-primary-light);
+  color: var(--color-text-primary);
+}

-/* Typography */
+/* Scrollbar — slightly wider, warmer thumb */
+::-webkit-scrollbar { width: 10px; height: 10px; }
+::-webkit-scrollbar-track { background: transparent; }
+::-webkit-scrollbar-thumb {
+  background: var(--color-border-default);
+  border-radius: var(--radius-sm);
+  border: 2px solid var(--color-bg-primary);
+}
+::-webkit-scrollbar-thumb:hover { background: var(--color-border-strong); }
+* { scrollbar-width: thin; scrollbar-color: var(--color-border-default) transparent; }
+
+/* Typography — editorial hierarchy */
 h1, h2, h3, h4, h5, h6 {
-  font-family: 'Space Grotesk', sans-serif;
+  font-family: var(--font-sans);
  color: var(--color-text-primary);
  line-height: var(--leading-tight);
-  letter-spacing: -0.01em;
 }
-h1 { font-size: var(--text-2xl); font-weight: var(--font-weight-semibold); }
-h2 { font-size: var(--text-xl); font-weight: var(--font-weight-semibold); }
-h3 { font-size: var(--text-lg); font-weight: var(--font-weight-semibold); }
-h4 { font-size: var(--text-base); font-weight: var(--font-weight-semibold); }
-h5, h6 { font-size: var(--text-sm); font-weight: var(--font-weight-semibold); }
+h1 { font-size: var(--text-3xl); font-weight: var(--font-weight-medium); letter-spacing: -0.02em; }
+h2 { font-size: var(--text-2xl); font-weight: var(--font-weight-medium); letter-spacing: -0.015em; }
+h3 { font-size: var(--text-xl);  font-weight: var(--font-weight-medium); letter-spacing: -0.01em; }
+h4 { font-size: var(--text-lg);  font-weight: var(--font-weight-medium); letter-spacing: -0.005em; }
+h5 { font-size: var(--text-base);font-weight: var(--font-weight-semibold); }
+h6 {
+  font-size: var(--text-xs); font-weight: var(--font-weight-semibold);
+  text-transform: uppercase; letter-spacing: 0.12em;
+  color: var(--color-text-muted);
+}

-code, pre {
-  font-family: 'JetBrains Mono', monospace;
+code, pre, kbd, .mono {
+  font-family: var(--font-mono);
+}
+
+kbd {
+  display: inline-block;
+  padding: 1px 5px;
+  font-size: 0.75em;
+  font-weight: var(--font-weight-medium);
+  background: var(--color-bg-tertiary);
+  border: 1px solid var(--color-border-default);
+  border-radius: var(--radius-sm);
+  color: var(--color-text-secondary);
+  line-height: 1.4;
 }

 a {
  color: var(--color-primary);
  text-decoration: none;
+  transition: color var(--duration-fast) var(--ease-default);
 }
 a:hover {
  color: var(--color-primary-hover);
 }

+/* Honor prefers-reduced-motion globally */
+@media (prefers-reduced-motion: reduce) {
+  *, *::before, *::after {
+    animation-duration: 0.01ms !important;
+    animation-iteration-count: 1 !important;
+    transition-duration: 0.01ms !important;
+    scroll-behavior: auto !important;
+  }
+}
+
 /* Utility classes */
--- a/core/http/react-ui/src/pages/Account.jsx
+++ b/core/http/react-ui/src/pages/Account.jsx
@@ -403,7 +403,7 @@ export default function Account() {

  if (!authEnabled) {
    return (
-      <div className="page">
+      <div className="page page--narrow">
        <div className="empty-state">
          <div className="empty-state-icon"><i className="fas fa-user-gear" /></div>
          <h2 className="empty-state-title">Account unavailable</h2>
@@ -418,7 +418,7 @@ export default function Account() {
  const visibleTabs = isLocal ? TABS : TABS.filter(t => t.id !== 'security')

  return (
-    <div className="page account-page">
+    <div className="page page--narrow account-page">
      {/* Header */}
      <div className="page-header">
        <h1 className="page-title">Account</h1>
--- a/core/http/react-ui/src/pages/AgentChat.jsx
+++ b/core/http/react-ui/src/pages/AgentChat.jsx
@@ -101,6 +101,7 @@ export default function AgentChat() {
  const messagesEndRef = useRef(null)
  const messagesRef = useRef(null)
  const textareaRef = useRef(null)
+  const stickToBottomRef = useRef(true)
  const eventSourceRef = useRef(null)
  const messageIdCounter = useRef(0)
  const addMessageRef = useRef(addMessage)
@@ -260,11 +261,31 @@ export default function AgentChat() {
    }
  }, [name, userId, addToast, nextId])

-  // Auto-scroll to bottom
+  // Track whether the user is pinned to the bottom. If they scroll up
+  // while a response is streaming, stop forcing them back down.
  useEffect(() => {
+    const el = messagesRef.current
+    if (!el) return
+    const onScroll = () => {
+      const distanceFromBottom = el.scrollHeight - el.scrollTop - el.clientHeight
+      stickToBottomRef.current = distanceFromBottom < 80
+    }
+    el.addEventListener('scroll', onScroll, { passive: true })
+    return () => el.removeEventListener('scroll', onScroll)
+  }, [])
+
+  // Auto-scroll only when the user hasn't scrolled away from the bottom.
+  useEffect(() => {
+    if (!stickToBottomRef.current) return
    messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' })
  }, [messages, streamContent, streamReasoning, streamToolCalls])

+  // When switching conversations, snap to bottom and re-pin.
+  useEffect(() => {
+    stickToBottomRef.current = true
+    messagesEndRef.current?.scrollIntoView({ behavior: 'auto' })
+  }, [activeId])
+
  // Highlight code blocks
  useEffect(() => {
    if (messagesRef.current) highlightAll(messagesRef.current)
--- a/core/http/react-ui/src/pages/AgentCreate.jsx
+++ b/core/http/react-ui/src/pages/AgentCreate.jsx
@@ -97,7 +97,7 @@ function FormField({ field, value, onChange, disabled }) {
            rows={5}
            disabled={disabled}
            style={field.name.includes('prompt') || field.name.includes('template') || field.name.includes('script')
-              ? { fontFamily: "'JetBrains Mono', monospace", fontSize: '0.8125rem' } : undefined}
+              ? { fontFamily: 'var(--font-mono)', fontSize: '0.8125rem' } : undefined}
          />
        </div>
      )
@@ -624,7 +624,7 @@ export default function AgentCreate() {
                  value={mcpRawJson}
                  onChange={(e) => setMcpRawJson(e.target.value)}
                  rows={16}
-                  style={{ fontFamily: 'monospace', fontSize: '0.85rem', whiteSpace: 'pre' }}
+                  style={{ fontFamily: 'var(--font-mono)', fontSize: '0.85rem', whiteSpace: 'pre' }}
                  placeholder={'{\n  "mcpServers": {\n    "my-server": {\n      "command": "npx",\n      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/tmp"],\n      "env": {}\n    }\n  }\n}'}
                />
              </div>
@@ -812,14 +812,14 @@ export default function AgentCreate() {

  if (loading) {
    return (
-      <div className="page" style={{ display: 'flex', justifyContent: 'center', padding: 'var(--spacing-xl)' }}>
+      <div className="page page--narrow" style={{ display: 'flex', justifyContent: 'center', padding: 'var(--spacing-xl)' }}>
        <i className="fas fa-spinner fa-spin" style={{ fontSize: '2rem', color: 'var(--color-primary)' }} />
      </div>
    )
  }

  return (
-    <div className="page">
+    <div className="page page--narrow">
      <style>{`
        .agent-form-container {
          display: flex;
--- a/core/http/react-ui/src/pages/AgentJobDetails.jsx
+++ b/core/http/react-ui/src/pages/AgentJobDetails.jsx
@@ -43,7 +43,7 @@ function TraceCard({ trace, index }) {
            {trace.type || 'unknown'}
          </span>
          {trace.tool_name && (
-            <span style={{ fontFamily: "'JetBrains Mono', monospace", fontSize: '0.75rem', color: 'var(--color-text-secondary)' }}>
+            <span style={{ fontFamily: 'var(--font-mono)', fontSize: '0.75rem', color: 'var(--color-text-secondary)' }}>
              {trace.tool_name}
            </span>
          )}
@@ -60,7 +60,7 @@ function TraceCard({ trace, index }) {
          {trace.content && (
            <pre style={{
              whiteSpace: 'pre-wrap', wordBreak: 'break-word', margin: 0,
-              fontFamily: "'JetBrains Mono', monospace", fontSize: '0.75rem',
+              fontFamily: 'var(--font-mono)', fontSize: '0.75rem',
              color: 'var(--color-text-secondary)', lineHeight: 1.6,
            }}>
              {trace.content}
@@ -71,7 +71,7 @@ function TraceCard({ trace, index }) {
              <span style={{ fontSize: '0.6875rem', fontWeight: 600, color: 'var(--color-text-muted)' }}>Arguments:</span>
              <pre style={{
                whiteSpace: 'pre-wrap', wordBreak: 'break-word', margin: '4px 0 0',
-                fontFamily: "'JetBrains Mono', monospace", fontSize: '0.75rem',
+                fontFamily: 'var(--font-mono)', fontSize: '0.75rem',
                color: 'var(--color-text-secondary)', lineHeight: 1.5,
              }}>
                {typeof trace.arguments === 'string' ? trace.arguments : JSON.stringify(trace.arguments, null, 2)}
@@ -162,9 +162,9 @@ export default function AgentJobDetails() {
    return rendered
  }

-  if (loading) return <div className="page" style={{ display: 'flex', justifyContent: 'center', padding: 'var(--spacing-xl)' }}><LoadingSpinner size="lg" /></div>
+  if (loading) return <div className="page page--narrow" style={{ display: 'flex', justifyContent: 'center', padding: 'var(--spacing-xl)' }}><LoadingSpinner size="lg" /></div>
  if (!job) return (
-    <div className="page">
+    <div className="page page--narrow">
      <div className="empty-state">
        <div className="empty-state-icon"><i className="fas fa-search" /></div>
        <h2 className="empty-state-title">Job not found</h2>
@@ -177,7 +177,7 @@ export default function AgentJobDetails() {
  const traces = Array.isArray(job.traces) ? job.traces : []

  return (
-    <div className="page" style={{ maxWidth: 900 }}>
+    <div className="page page--narrow">
      <div className="page-header" style={{ display: 'flex', justifyContent: 'space-between', alignItems: 'center' }}>
        <div>
          <h1 className="page-title">Job Details</h1>
@@ -207,7 +207,7 @@ export default function AgentJobDetails() {
        <div style={{ display: 'grid', gridTemplateColumns: 'repeat(3, 1fr)', gap: 'var(--spacing-md)' }}>
          <div>
            <span className="form-label">Job ID</span>
-            <p style={{ fontFamily: "'JetBrains Mono', monospace", fontSize: '0.8125rem', wordBreak: 'break-all' }}>{job.id}</p>
+            <p style={{ fontFamily: 'var(--font-mono)', fontSize: '0.8125rem', wordBreak: 'break-all' }}>{job.id}</p>
          </div>
          <div>
            <span className="form-label">Task</span>
@@ -264,7 +264,7 @@ export default function AgentJobDetails() {
          </h3>
          <div style={{ display: 'flex', flexWrap: 'wrap', gap: 'var(--spacing-xs)' }}>
            {Object.entries(job.cron_parameters).map(([k, v]) => (
-              <span key={k} className="badge badge-info" style={{ fontFamily: "'JetBrains Mono', monospace", fontSize: '0.75rem' }}>
+              <span key={k} className="badge badge-info" style={{ fontFamily: 'var(--font-mono)', fontSize: '0.75rem' }}>
                {k}={v}
              </span>
            ))}
@@ -281,7 +281,7 @@ export default function AgentJobDetails() {
          </h3>
          <div style={{ display: 'flex', flexWrap: 'wrap', gap: 'var(--spacing-xs)' }}>
            {Object.entries(job.parameters).map(([k, v]) => (
-              <span key={k} className="badge badge-info" style={{ fontFamily: "'JetBrains Mono', monospace", fontSize: '0.75rem' }}>
+              <span key={k} className="badge badge-info" style={{ fontFamily: 'var(--font-mono)', fontSize: '0.75rem' }}>
                {k}={v}
              </span>
            ))}
--- a/core/http/react-ui/src/pages/AgentJobs.jsx
+++ b/core/http/react-ui/src/pages/AgentJobs.jsx
@@ -213,7 +213,7 @@ export default function AgentJobs() {
  // Wizard: no models installed
  if (!loading && models.length === 0) {
    return (
-      <div className="page">
+      <div className="page page--wide">
        <div className="page-header">
          <h1 className="page-title">Agent Jobs</h1>
          <p className="page-subtitle">Manage agent tasks and automated workflows</p>
@@ -240,7 +240,7 @@ export default function AgentJobs() {
  // Wizard: models but no MCP
  if (!loading && models.length > 0 && !hasMCPModels && tasks.length === 0) {
    return (
-      <div className="page">
+      <div className="page page--wide">
        <div className="page-header">
          <h1 className="page-title">Agent Jobs</h1>
          <p className="page-subtitle">Manage agent tasks and automated workflows</p>
@@ -253,7 +253,7 @@ export default function AgentJobs() {
          </p>
          <div style={{ background: 'var(--color-bg-primary)', borderRadius: 'var(--radius-md)', padding: 'var(--spacing-md)', maxWidth: 500, margin: '0 auto var(--spacing-md)', textAlign: 'left' }}>
            <p style={{ fontSize: '0.8125rem', fontWeight: 600, marginBottom: 'var(--spacing-xs)' }}>Example MCP configuration (YAML):</p>
-            <pre style={{ fontSize: '0.75rem', fontFamily: "'JetBrains Mono', monospace", color: 'var(--color-text-secondary)', whiteSpace: 'pre-wrap' }}>{`mcp:
+            <pre style={{ fontSize: '0.75rem', fontFamily: 'var(--font-mono)', color: 'var(--color-text-secondary)', whiteSpace: 'pre-wrap' }}>{`mcp:
  stdio:
    - name: my-tool
      command: /path/to/tool
@@ -273,7 +273,7 @@ export default function AgentJobs() {
  }

  return (
-    <div className="page">
+    <div className="page page--wide">
      <div className="page-header" style={{ display: 'flex', justifyContent: 'space-between', alignItems: 'center' }}>
        <div>
          <h1 className="page-title">Agent Jobs</h1>
@@ -345,7 +345,7 @@ export default function AgentJobs() {
                      </td>
                      <td>
                        {task.cron ? (
-                          <span className="badge badge-info" style={{ fontFamily: "'JetBrains Mono', monospace", fontSize: '0.6875rem' }}>
+                          <span className="badge badge-info" style={{ fontFamily: 'var(--font-mono)', fontSize: '0.6875rem' }}>
                            {task.cron}
                          </span>
                        ) : '-'}
@@ -426,7 +426,7 @@ export default function AgentJobs() {
                  {filteredJobs.map(job => (
                    <tr key={job.id}>
                      <td>
-                        <a onClick={() => navigate(`/app/agent-jobs/jobs/${job.id}`)} style={{ cursor: 'pointer', color: 'var(--color-primary)', fontFamily: "'JetBrains Mono', monospace", fontSize: '0.8125rem' }}>
+                        <a onClick={() => navigate(`/app/agent-jobs/jobs/${job.id}`)} style={{ cursor: 'pointer', color: 'var(--color-primary)', fontFamily: 'var(--font-mono)', fontSize: '0.8125rem' }}>
                          {job.id?.slice(0, 12)}...
                        </a>
                      </td>
@@ -510,7 +510,7 @@ export default function AgentJobs() {
                <tbody>
                  {(items || []).map(job => (
                    <tr key={job.id}>
-                      <td style={{ fontFamily: "'JetBrains Mono', monospace", fontSize: '0.8125rem' }}>{job.id?.slice(0, 12)}...</td>
+                      <td style={{ fontFamily: 'var(--font-mono)', fontSize: '0.8125rem' }}>{job.id?.slice(0, 12)}...</td>
                      <td>{job.task_id || '-'}</td>
                      <td>{statusBadge(job.status)}</td>
                      <td style={{ fontSize: '0.8125rem', color: 'var(--color-text-secondary)' }}>{formatDate(job.created_at)}</td>
@@ -566,7 +566,7 @@ export default function AgentJobs() {
                  onChange={(e) => setExecuteParams(e.target.value)}
                  rows={5}
                  placeholder={`topic=AI trends\nformat=markdown`}
-                  style={{ fontFamily: "'JetBrains Mono', monospace", fontSize: '0.8125rem' }}
+                  style={{ fontFamily: 'var(--font-mono)', fontSize: '0.8125rem' }}
                />
                <p style={{ fontSize: '0.75rem', color: 'var(--color-text-muted)', marginTop: 'var(--spacing-xs)' }}>
                  These will be available as {'{{.parameter_name}}'} in the prompt template.
@@ -590,7 +590,7 @@ export default function AgentJobs() {
                        {executeMultimedia[type].map((item, i) => (
                          <div key={i} style={{
                            display: 'flex', alignItems: 'center', justifyContent: 'space-between',
-                            background: 'var(--color-bg-primary)', borderRadius: 'var(--radius-sm)', padding: '4px 8px', fontSize: '0.75rem',
+                            background: 'var(--color-bg-primary)', borderRadius: 'var(--radius-sm)', padding: 'var(--spacing-xs) var(--spacing-sm)', fontSize: '0.75rem',
                          }}>
                            <span style={{ overflow: 'hidden', textOverflow: 'ellipsis', whiteSpace: 'nowrap' }}>{item.name || item.url?.slice(0, 40)}</span>
                            <button onClick={() => removeMultimedia(type, i)} style={{ background: 'none', border: 'none', color: 'var(--color-error)', cursor: 'pointer', padding: '2px 4px' }}>
--- a/core/http/react-ui/src/pages/AgentStatus.jsx
+++ b/core/http/react-ui/src/pages/AgentStatus.jsx
@@ -260,7 +260,7 @@ export default function AgentStatus() {
  const tree = buildTree(observables)

  return (
-    <div className="page">
+    <div className="page page--wide">
      <style>{`
        .as-card {
          background: var(--color-bg-secondary);
@@ -294,7 +294,7 @@ export default function AgentStatus() {
        .as-id {
          font-size: 0.6875rem;
          color: var(--color-text-muted);
-          font-family: 'JetBrains Mono', monospace;
+          font-family: var(--font-mono);
        }
        .as-summary-item {
          display: flex; align-items: center; gap: 6px;
@@ -303,7 +303,7 @@ export default function AgentStatus() {
        }
        .as-summary-item i { font-size: 0.625rem; flex-shrink: 0; }
        .as-summary-creation i { color: var(--color-primary); }
-        .as-summary-tool-call i { color: #f59e0b; }
+        .as-summary-tool-call i { color: var(--color-warning); }
        .as-summary-completion i { color: var(--color-success); }
        .as-summary-error i { color: var(--color-error); }
        .as-card-body {
@@ -327,13 +327,13 @@ export default function AgentStatus() {
          background: var(--color-bg-tertiary); color: var(--color-text-muted);
          margin-right: 4px; vertical-align: middle;
        }
-        .as-tag-error { background: var(--color-error); color: #fff; }
+        .as-tag-error { background: var(--color-error); color: var(--color-text-inverse); }
        .as-error-text { color: var(--color-error); }
        .as-raw { margin-top: var(--spacing-sm); }
        .as-raw summary { font-size: 0.75rem; color: var(--color-text-muted); cursor: pointer; }
        .as-json {
          background: var(--color-bg-tertiary); border-radius: var(--radius-sm);
-          padding: var(--spacing-sm); font-family: 'JetBrains Mono', monospace;
+          padding: var(--spacing-sm); font-family: var(--font-mono);
          font-size: 0.75rem; overflow-x: auto; white-space: pre-wrap;
          word-break: break-word; max-height: 300px; overflow-y: auto;
        }
--- a/core/http/react-ui/src/pages/AgentTaskDetails.jsx
+++ b/core/http/react-ui/src/pages/AgentTaskDetails.jsx
@@ -159,12 +159,12 @@ export default function AgentTaskDetails() {

  const formatDate = (d) => d ? new Date(d).toLocaleString() : '-'

-  if (loading) return <div className="page" style={{ display: 'flex', justifyContent: 'center', padding: 'var(--spacing-xl)' }}><LoadingSpinner size="lg" /></div>
+  if (loading) return <div className="page page--narrow" style={{ display: 'flex', justifyContent: 'center', padding: 'var(--spacing-xl)' }}><LoadingSpinner size="lg" /></div>

  // View mode
  if (!isNew && !isEdit) {
    return (
-      <div className="page" style={{ maxWidth: 900 }}>
+      <div className="page page--narrow">
        <div className="page-header" style={{ display: 'flex', justifyContent: 'space-between', alignItems: 'center' }}>
          <div>
            <h1 className="page-title">{task.name || 'Task Details'}</h1>
@@ -198,7 +198,7 @@ export default function AgentTaskDetails() {
            {task.cron && (
              <div>
                <span className="form-label">Cron Schedule</span>
-                <p style={{ fontFamily: "'JetBrains Mono', monospace", fontSize: '0.8125rem' }}>{task.cron}</p>
+                <p style={{ fontFamily: 'var(--font-mono)', fontSize: '0.8125rem' }}>{task.cron}</p>
              </div>
            )}
          </div>
@@ -229,13 +229,13 @@ export default function AgentTaskDetails() {
          <div style={{ display: 'flex', flexDirection: 'column', gap: 'var(--spacing-md)' }}>
            <div>
              <span className="form-label">Execute by name</span>
-              <pre style={{ background: 'var(--color-bg-primary)', padding: 'var(--spacing-sm)', borderRadius: 'var(--radius-md)', fontSize: '0.75rem', fontFamily: "'JetBrains Mono', monospace", whiteSpace: 'pre-wrap', overflow: 'auto' }}>
+              <pre style={{ background: 'var(--color-bg-primary)', padding: 'var(--spacing-sm)', borderRadius: 'var(--radius-md)', fontSize: '0.75rem', fontFamily: 'var(--font-mono)', whiteSpace: 'pre-wrap', overflow: 'auto' }}>
 {`curl -X POST ${window.location.origin}${basePath}/api/agent/tasks/${encodeURIComponent(task.name)}/execute`}
              </pre>
            </div>
            <div>
              <span className="form-label">Execute with multimedia</span>
-              <pre style={{ background: 'var(--color-bg-primary)', padding: 'var(--spacing-sm)', borderRadius: 'var(--radius-md)', fontSize: '0.75rem', fontFamily: "'JetBrains Mono', monospace", whiteSpace: 'pre-wrap', overflow: 'auto' }}>
+              <pre style={{ background: 'var(--color-bg-primary)', padding: 'var(--spacing-sm)', borderRadius: 'var(--radius-md)', fontSize: '0.75rem', fontFamily: 'var(--font-mono)', whiteSpace: 'pre-wrap', overflow: 'auto' }}>
 {`curl -X POST ${window.location.origin}${basePath}/api/agent/tasks/${encodeURIComponent(task.name)}/execute \\
  -H "Content-Type: application/json" \\
  -d '{"multimedia": {"images": [{"url": "https://example.com/image.jpg"}]}}'`}
@@ -243,7 +243,7 @@ export default function AgentTaskDetails() {
            </div>
            <div>
              <span className="form-label">Check job status</span>
-              <pre style={{ background: 'var(--color-bg-primary)', padding: 'var(--spacing-sm)', borderRadius: 'var(--radius-md)', fontSize: '0.75rem', fontFamily: "'JetBrains Mono', monospace", whiteSpace: 'pre-wrap', overflow: 'auto' }}>
+              <pre style={{ background: 'var(--color-bg-primary)', padding: 'var(--spacing-sm)', borderRadius: 'var(--radius-md)', fontSize: '0.75rem', fontFamily: 'var(--font-mono)', whiteSpace: 'pre-wrap', overflow: 'auto' }}>
 {`curl ${window.location.origin}${basePath}/api/agent/jobs/<job-id>`}
              </pre>
            </div>
@@ -261,7 +261,7 @@ export default function AgentTaskDetails() {
              <div key={i} style={{ background: 'var(--color-bg-primary)', borderRadius: 'var(--radius-md)', padding: 'var(--spacing-sm)', marginBottom: 'var(--spacing-sm)' }}>
                <div style={{ display: 'flex', gap: 'var(--spacing-sm)', fontSize: '0.8125rem' }}>
                  <span className="badge badge-info">{wh.method || 'POST'}</span>
-                  <span style={{ fontFamily: "'JetBrains Mono', monospace" }}>{wh.url}</span>
+                  <span style={{ fontFamily: 'var(--font-mono)' }}>{wh.url}</span>
                </div>
              </div>
            ))}
@@ -283,7 +283,7 @@ export default function AgentTaskDetails() {
                <tbody>
                  {jobHistory.map(job => (
                    <tr key={job.id}>
-                      <td style={{ fontFamily: "'JetBrains Mono', monospace", fontSize: '0.8125rem' }}>
+                      <td style={{ fontFamily: 'var(--font-mono)', fontSize: '0.8125rem' }}>
                        {job.id?.slice(0, 12)}...
                      </td>
                      <td>{statusBadge(job.status)}</td>
@@ -306,7 +306,7 @@ export default function AgentTaskDetails() {

  // Edit/Create form
  return (
-    <div className="page" style={{ maxWidth: 900 }}>
+    <div className="page page--narrow">
      <div className="page-header" style={{ display: 'flex', justifyContent: 'space-between', alignItems: 'center' }}>
        <h1 className="page-title">{isNew ? 'Create Task' : 'Edit Task'}</h1>
        <button className="btn btn-secondary btn-sm" onClick={() => navigate('/app/agent-jobs')}>
@@ -351,7 +351,7 @@ export default function AgentTaskDetails() {
              onChange={(e) => updateField('prompt', e.target.value)}
              rows={8}
              placeholder={`Write a summary about {{.topic}} in {{.format}} format.`}
-              style={{ fontFamily: "'JetBrains Mono', monospace", fontSize: '0.8125rem' }}
+              style={{ fontFamily: 'var(--font-mono)', fontSize: '0.8125rem' }}
            />
            <p style={{ fontSize: '0.75rem', color: 'var(--color-text-muted)', marginTop: 'var(--spacing-xs)' }}>
              Use {'{{.parameter_name}}'} for dynamic parameters. Parameters are provided when executing the task.
@@ -376,7 +376,7 @@ export default function AgentTaskDetails() {
              value={task.cron}
              onChange={(e) => { updateField('cron', e.target.value); validateCron(e.target.value) }}
              placeholder="0 */6 * * *"
-              style={{ fontFamily: "'JetBrains Mono', monospace" }}
+              style={{ fontFamily: 'var(--font-mono)' }}
            />
            {cronError && <p style={{ color: 'var(--color-error)', fontSize: '0.75rem', marginTop: 4 }}>{cronError}</p>}
            <p style={{ fontSize: '0.75rem', color: 'var(--color-text-muted)', marginTop: 'var(--spacing-xs)' }}>
@@ -392,7 +392,7 @@ export default function AgentTaskDetails() {
                onChange={(e) => updateField('cron_parameters', e.target.value)}
                rows={3}
                placeholder={`topic=daily news\nformat=bullet points`}
-                style={{ fontFamily: "'JetBrains Mono', monospace", fontSize: '0.8125rem' }}
+                style={{ fontFamily: 'var(--font-mono)', fontSize: '0.8125rem' }}
              />
              <p style={{ fontSize: '0.75rem', color: 'var(--color-text-muted)', marginTop: 'var(--spacing-xs)' }}>
                Default parameters used when the cron triggers the task.
@@ -437,7 +437,7 @@ export default function AgentTaskDetails() {
                </div>
                <div className="form-group" style={{ marginTop: 'var(--spacing-xs)' }}>
                  <label className="form-label">Headers (JSON)</label>
-                  <input className="input" value={ms.headers} onChange={(e) => updateMultimediaSource(i, 'headers', e.target.value)} placeholder='{"Authorization": "Bearer ..."}' style={{ fontFamily: "'JetBrains Mono', monospace", fontSize: '0.8125rem' }} />
+                  <input className="input" value={ms.headers} onChange={(e) => updateMultimediaSource(i, 'headers', e.target.value)} placeholder='{"Authorization": "Bearer ..."}' style={{ fontFamily: 'var(--font-mono)', fontSize: '0.8125rem' }} />
                </div>
              </div>
            ))
@@ -479,7 +479,7 @@ export default function AgentTaskDetails() {
                </div>
                <div className="form-group" style={{ marginTop: 'var(--spacing-xs)' }}>
                  <label className="form-label">Headers (JSON)</label>
-                  <input className="input" value={wh.headers} onChange={(e) => updateWebhook(i, 'headers', e.target.value)} placeholder='{"Content-Type": "application/json"}' style={{ fontFamily: "'JetBrains Mono', monospace", fontSize: '0.8125rem' }} />
+                  <input className="input" value={wh.headers} onChange={(e) => updateWebhook(i, 'headers', e.target.value)} placeholder='{"Content-Type": "application/json"}' style={{ fontFamily: 'var(--font-mono)', fontSize: '0.8125rem' }} />
                </div>
                <div className="form-group" style={{ marginTop: 'var(--spacing-xs)' }}>
                  <label className="form-label">Payload Template (Go template syntax)</label>
@@ -489,7 +489,7 @@ export default function AgentTaskDetails() {
                    onChange={(e) => updateWebhook(i, 'payload_template', e.target.value)}
                    rows={3}
                    placeholder={`{"text": "Job {{.Status}}: {{if .Error}}Error: {{.Error}}{{else}}{{.Result}}{{end}}"}`}
-                    style={{ fontFamily: "'JetBrains Mono', monospace", fontSize: '0.8125rem' }}
+                    style={{ fontFamily: 'var(--font-mono)', fontSize: '0.8125rem' }}
                  />
                  <p style={{ fontSize: '0.75rem', color: 'var(--color-text-muted)', marginTop: 2 }}>
                    Available: {'{{.Job}}'} {'{{.Task}}'} {'{{.Result}}'} {'{{.Error}}'} {'{{.Status}}'}
--- a/core/http/react-ui/src/pages/Agents.jsx
+++ b/core/http/react-ui/src/pages/Agents.jsx
@@ -136,7 +136,7 @@ export default function Agents() {
  }

  return (
-    <div className="page">
+    <div className="page page--wide">
      <style>{`
        .agents-import-input { display: none; }
        .agents-toolbar {
--- a/core/http/react-ui/src/pages/BackendLogs.jsx
+++ b/core/http/react-ui/src/pages/BackendLogs.jsx
@@ -149,7 +149,7 @@ function BackendLogsDetail({ modelId }) {
  }

  return (
-    <div className="page">
+    <div className="page page--wide">
      <div className="page-header">
        <div>
          <h1 className="page-title" style={{ marginBottom: 0 }}>
@@ -229,7 +229,7 @@ function BackendLogsDetail({ modelId }) {
            borderRadius: 'var(--radius-md)',
            overflow: 'auto',
            maxHeight: 'calc(100vh - 280px)',
-            fontFamily: 'JetBrains Mono, Consolas, monospace',
+            fontFamily: 'var(--font-mono)',
            fontSize: '0.75rem',
            lineHeight: '1.5',
          }}
@@ -283,7 +283,7 @@ export default function BackendLogs() {

  // No model specified — redirect to System page
  return (
-    <div className="page">
+    <div className="page page--wide">
      <div className="empty-state">
        <div className="empty-state-icon"><i className="fas fa-terminal" /></div>
        <h2 className="empty-state-title">No model selected</h2>
--- a/Show More
+++ b/Show More