chore(turboquant): bump fork to 4d24ad87 and patch ggml-hip for new f16-turbo fattn-vec instances

Bump TURBOQUANT_VERSION from 627ebbc6 to 4d24ad87, which pulls in upstream commit fa4e8be0a0ce ("fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu"). That commit adds three new template instance files under ggml-cuda/template-instances/: - fattn-vec-instance-f16-turbo2_0.cu - fattn-vec-instance-f16-turbo3_0.cu - fattn-vec-instance-f16-turbo4_0.cu and wires matching FATTN_VEC_CASES_ALL_D(GGML_TYPE_F16, GGML_TYPE_TURBO{2,3,4}_0) dispatch cases into fattn.cu. The dispatch cases are compiled into the HIP build (fattn.cu is shared with ggml-hip via hipify), but the fork forgot to mirror the new source files into ggml/src/ggml-hip/CMakeLists.txt. CMake's ROCm branch carries a hand-curated template-instance list (used when GGML_CUDA_FA_ALL_QUANTS is OFF, which is the default), so the HIP build ends up with the extern template declarations but no matching instantiations — the -gpu-rocm-hipblas-turboquant job failed at link time (~90min into the 3h+ build). Add patches/0001-ggml-hip-add-f16-turbo-vec-instances.patch, which the existing apply-patches.sh machinery applies to the cloned fork sources after fetch. The patch appends the three new f16-turbo instance files to ggml-hip's source list in the same interleaved order used by ggml-cuda's CMakeLists.txt. Drop this patch once the fork syncs the ROCm list (the build will fail fast if the anchor context goes stale, which is the signal to retire it). CUDA builds were unaffected (ggml-cuda's CMakeLists.txt was updated upstream) — the failure was isolated to HIP. Assisted-by: Claude:claude-opus-4-7 [Claude Code]
chore: ⬆️ Update ggml-org/llama.cpp to 5a4cd6741fc33227cdacb329f355ab21f8481de2 (#9479 )
2026-05-22 07:38:26 -04:00 · 2026-04-22 07:13:47 +00:00 · 2026-04-22 08:58:19 +02:00 · 2026-04-22 08:22:05 +02:00 · 2026-04-21 22:06:35 +02:00 · 2026-04-21 21:59:33 +02:00
73 changed files with 4500 additions and 755 deletions
--- a/.agents/ai-coding-assistants.md
+++ b/.agents/ai-coding-assistants.md
@@ -0,0 +1,101 @@
+# AI Coding Assistants
+
+This document provides guidance for AI tools and developers using AI
+assistance when contributing to LocalAI.
+
+**LocalAI follows the same guidelines as the Linux kernel project for
+AI-assisted contributions.** See the upstream policy here:
+<https://docs.kernel.org/process/coding-assistants.html>
+
+The rules below mirror that policy, adapted to LocalAI's license and
+project layout. If anything is unclear, the kernel document is the
+authoritative reference for intent.
+
+AI tools helping with LocalAI development should follow the standard
+project development process:
+
+- [CONTRIBUTING.md](../CONTRIBUTING.md) — development workflow, commit
+  conventions, and PR guidelines
+- [.agents/coding-style.md](coding-style.md) — code style, editorconfig,
+  logging, and documentation conventions
+- [.agents/building-and-testing.md](building-and-testing.md) — build and
+  test procedures
+
+## Licensing and Legal Requirements
+
+All contributions must comply with LocalAI's licensing requirements:
+
+- LocalAI is licensed under the **MIT License** — see the [LICENSE](../LICENSE)
+  file
+- New source files should use the SPDX license identifier `MIT` where
+  applicable to the file type
+- Contributions must be compatible with the MIT License and must not
+  introduce code under incompatible licenses (e.g., GPL) without an
+  explicit discussion with maintainers
+
+## Signed-off-by and Developer Certificate of Origin
+
+**AI agents MUST NOT add `Signed-off-by` tags.** Only humans can legally
+certify the Developer Certificate of Origin (DCO). The human submitter
+is responsible for:
+
+- Reviewing all AI-generated code
+- Ensuring compliance with licensing requirements
+- Adding their own `Signed-off-by` tag (when the project requires DCO)
+  to certify the contribution
+- Taking full responsibility for the contribution
+
+AI agents MUST NOT add `Co-Authored-By` trailers for themselves either.
+A human reviewer owns the contribution; the AI's involvement is recorded
+via `Assisted-by` (see below).
+
+## Attribution
+
+When AI tools contribute to LocalAI development, proper attribution helps
+track the evolving role of AI in the development process. Contributions
+should include an `Assisted-by` tag in the commit message trailer in the
+following format:
+
+```
+Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]
+```
+
+Where:
+
+- `AGENT_NAME` — name of the AI tool or framework (e.g., `Claude`,
+  `Copilot`, `Cursor`)
+- `MODEL_VERSION` — specific model version used (e.g.,
+  `claude-opus-4-7`, `gpt-5`)
+- `[TOOL1] [TOOL2]` — optional specialized analysis tools invoked by the
+  agent (e.g., `golangci-lint`, `staticcheck`, `go vet`)
+
+Basic development tools (git, go, make, editors) should **not** be listed.
+
+### Example
+
+```
+fix(llama-cpp): handle empty tool call arguments
+
+Previously the parser panicked when the model returned a tool call with
+an empty arguments object. Fall back to an empty JSON object in that
+case so downstream consumers receive a valid payload.
+
+Assisted-by: Claude:claude-opus-4-7 golangci-lint
+Signed-off-by: Jane Developer <jane@example.com>
+```
+
+## Scope and Responsibility
+
+Using an AI assistant does not reduce the contributor's responsibility.
+The human submitter must:
+
+- Understand every line that lands in the PR
+- Verify that generated code compiles, passes tests, and follows the
+  project style
+- Confirm that any referenced APIs, flags, or file paths actually exist
+  in the current tree (AI models may hallucinate identifiers)
+- Not submit AI output verbatim without review
+
+Reviewers may ask for clarification on any change regardless of how it
+was produced. "An AI wrote it" is not an acceptable answer to a design
+question.
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -30,6 +30,7 @@ jobs:
      skip-drivers: ${{ matrix.skip-drivers }}
      context: ${{ matrix.context }}
      ubuntu-version: ${{ matrix.ubuntu-version }}
+      amdgpu-targets: ${{ matrix.amdgpu-targets }}
    secrets:
      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
@@ -1623,19 +1624,6 @@ jobs:
            dockerfile: "./backend/Dockerfile.python"
            context: "./"
            ubuntu-version: '2404'
-          - build-type: 'hipblas'
-            cuda-major-version: ""
-            cuda-minor-version: ""
-            platforms: 'linux/amd64'
-            tag-latest: 'auto'
-            tag-suffix: '-gpu-rocm-hipblas-whisperx'
-            runs-on: 'bigger-runner'
-            base-image: "rocm/dev-ubuntu-24.04:7.2.1"
-            skip-drivers: 'false'
-            backend: "whisperx"
-            dockerfile: "./backend/Dockerfile.python"
-            context: "./"
-            ubuntu-version: '2404'
          - build-type: 'hipblas'
            cuda-major-version: ""
            cuda-minor-version: ""
--- a/.github/workflows/backend_build.yml
+++ b/.github/workflows/backend_build.yml
@@ -58,6 +58,11 @@ on:
        required: false
        default: '2204'
        type: string
+      amdgpu-targets:
+        description: 'AMD GPU targets for ROCm/HIP builds'
+        required: false
+        default: 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201'
+        type: string
    secrets:
      dockerUsername:
        required: false
@@ -214,6 +219,7 @@ jobs:
            BASE_IMAGE=${{ inputs.base-image }}
            BACKEND=${{ inputs.backend }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
+            AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
          context: ${{ inputs.context }}
          file: ${{ inputs.dockerfile }}
          cache-from: type=gha
@@ -235,6 +241,7 @@ jobs:
            BASE_IMAGE=${{ inputs.base-image }}
            BACKEND=${{ inputs.backend }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
+            AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
          context: ${{ inputs.context }}
          file: ${{ inputs.dockerfile }}
          cache-from: type=gha
--- a/.github/workflows/gallery-agent.yaml
+++ b/.github/workflows/gallery-agent.yaml
@@ -54,24 +54,41 @@ jobs:
          REPO: ${{ github.repository }}
          SEARCH: 'gallery agent in:title'
        run: |
-          # Walk open gallery-agent PRs and act on maintainer comments:
+          # Walk gallery-agent PRs and act on maintainer comments:
          #   /gallery-agent blacklist → label `gallery-agent/blacklisted` + close (never repropose)
          #   /gallery-agent recreate  → close without label (next run may repropose)
          # Only comments from OWNER / MEMBER / COLLABORATOR are honored so
          # random users can't drive the bot.
+          #
+          # We scan both open PRs AND recently-closed PRs that don't already
+          # carry the blacklist label. This covers the common flow where a
+          # maintainer writes /gallery-agent blacklist and immediately clicks
+          # Close — without this, the next scheduled run wouldn't see the
+          # command (PR is already closed) and would repropose the model.
          gh label create gallery-agent/blacklisted \
            --repo "$REPO" --color ededed \
            --description "gallery-agent must not repropose this model" 2>/dev/null || true

-          prs=$(gh pr list --repo "$REPO" --state open --search "$SEARCH" --json number --jq '.[].number')
+          prs_open=$(gh pr list --repo "$REPO" --state open --search "$SEARCH" \
+            --json number --jq '.[].number')
+          # Closed PRs from the last 14 days that don't yet have the blacklist label.
+          # Bounded window keeps the scan cheap while covering late-applied commands.
+          since=$(date -u -d '14 days ago' +%Y-%m-%d)
+          prs_closed=$(gh pr list --repo "$REPO" --state closed \
+            --search "$SEARCH closed:>=$since -label:gallery-agent/blacklisted" \
+            --json number --jq '.[].number')
+          prs=$(printf '%s\n%s\n' "$prs_open" "$prs_closed" | sort -u | sed '/^$/d')
          for pr in $prs; do
+            state=$(gh pr view "$pr" --repo "$REPO" --json state --jq '.state')
            cmds=$(gh pr view "$pr" --repo "$REPO" --json comments \
              --jq '.comments[] | select(.authorAssociation=="OWNER" or .authorAssociation=="MEMBER" or .authorAssociation=="COLLABORATOR") | .body')
            if echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+blacklist([[:space:]]|$)'; then
-              echo "PR #$pr: blacklist command found"
+              echo "PR #$pr: blacklist command found (state=$state)"
              gh pr edit "$pr" --repo "$REPO" --add-label gallery-agent/blacklisted || true
-              gh pr close "$pr" --repo "$REPO" --comment "Blacklisted via \`/gallery-agent blacklist\`. This model will not be reproposed." || true
-            elif echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+recreate([[:space:]]|$)'; then
+              if [ "$state" = "OPEN" ]; then
+                gh pr close "$pr" --repo "$REPO" --comment "Blacklisted via \`/gallery-agent blacklist\`. This model will not be reproposed." || true
+              fi
+            elif [ "$state" = "OPEN" ] && echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+recreate([[:space:]]|$)'; then
              echo "PR #$pr: recreate command found"
              gh pr close "$pr" --repo "$REPO" --comment "Closed via \`/gallery-agent recreate\`. The next scheduled run will propose this model again." || true
            fi
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,11 +1,23 @@
 # LocalAI Agent Instructions

-This file is an index to detailed topic guides in the `.agents/` directory. Read the relevant file(s) for the task at hand — you don't need to load all of them.
+This file is the entry point for AI coding assistants (Claude Code, Cursor, Copilot, Codex, Aider, etc.) working on LocalAI. It is an index to detailed topic guides in the `.agents/` directory. Read the relevant file(s) for the task at hand — you don't need to load all of them.
+
+Human contributors: see [CONTRIBUTING.md](CONTRIBUTING.md) for the development workflow.
+
+## Policy for AI-Assisted Contributions
+
+LocalAI follows the Linux kernel project's [guidelines for AI coding assistants](https://docs.kernel.org/process/coding-assistants.html). Before submitting AI-assisted code, read [.agents/ai-coding-assistants.md](.agents/ai-coding-assistants.md). Key rules:
+
+- **No `Signed-off-by` from AI.** Only the human submitter may sign off on the Developer Certificate of Origin.
+- **No `Co-Authored-By: <AI>` trailers.** The human contributor owns the change.
+- **Use an `Assisted-by:` trailer** to attribute AI involvement. Format: `Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]`.
+- **The human submitter is responsible** for reviewing, testing, and understanding every line of generated code.

 ## Topics

 | File | When to read |
 |------|-------------|
+| [.agents/ai-coding-assistants.md](.agents/ai-coding-assistants.md) | Policy for AI-assisted contributions — licensing, DCO, attribution |
 | [.agents/building-and-testing.md](.agents/building-and-testing.md) | Building the project, running tests, Docker builds for specific platforms |
 | [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist |
 | [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -13,6 +13,7 @@ Thank you for your interest in contributing to LocalAI! We appreciate your time
  - [Development Workflow](#development-workflow)
  - [Creating a Pull Request (PR)](#creating-a-pull-request-pr)
 - [Coding Guidelines](#coding-guidelines)
+- [AI Coding Assistants](#ai-coding-assistants)
 - [Testing](#testing)
 - [Documentation](#documentation)
 - [Community and Communication](#community-and-communication)
@@ -185,7 +186,7 @@ Before jumping into a PR for a massive feature or big change, it is preferred to

 This project uses an [`.editorconfig`](.editorconfig) file to define formatting standards (indentation, line endings, charset, etc.). Please configure your editor to respect it.

-For AI-assisted development, see [`CLAUDE.md`](CLAUDE.md) for agent-specific guidelines including build instructions and backend architecture details.
+For AI-assisted development, see [`AGENTS.md`](AGENTS.md) (or the equivalent [`CLAUDE.md`](CLAUDE.md) symlink) for agent-specific guidelines including build instructions and backend architecture details. Contributions produced with AI assistance must follow the rules in the [AI Coding Assistants](#ai-coding-assistants) section below.

 ### General Principles

@@ -211,6 +212,26 @@ For AI-assisted development, see [`CLAUDE.md`](CLAUDE.md) for agent-specific gui
 - Reviewers will check for correctness, test coverage, adherence to these guidelines, and clarity of intent.
 - Be responsive to review feedback and keep discussions constructive.

+## AI Coding Assistants
+
+LocalAI follows the **same guidelines as the Linux kernel project** for AI-assisted contributions: <https://docs.kernel.org/process/coding-assistants.html>.
+
+The full policy for this repository lives in [`.agents/ai-coding-assistants.md`](.agents/ai-coding-assistants.md). Summary:
+
+- **AI agents MUST NOT add `Signed-off-by` tags.** Only humans can certify the Developer Certificate of Origin.
+- **AI agents MUST NOT add `Co-Authored-By` trailers** attributing themselves as co-authors.
+- **Attribute AI involvement with an `Assisted-by` trailer** in the commit message:
+
+  ```
+  Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]
+  ```
+
+  Example: `Assisted-by: Claude:claude-opus-4-7 golangci-lint`
+
+  Basic development tools (git, go, make, editors) should not be listed.
+- **The human submitter is responsible** for reviewing, testing, and fully understanding every line of AI-generated code — including verifying that any referenced APIs, flags, or file paths actually exist in the tree.
+- Contributions must remain compatible with LocalAI's **MIT License**.
+
 ## Testing

 All new features and bug fixes should include test coverage. The project uses [Ginkgo](https://onsi.github.io/ginkgo/) as its test framework.
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=8befd92ea5f702494ea9813fe42a52fb015db5fe
+IK_LLAMA_VERSION?=d4824131580b94ffa7b0e91c955e2b237c2fe16e
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@

-LLAMA_VERSION?=4f02d4733934179386cbc15b3454be26237940bb
+LLAMA_VERSION?=5a4cd6741fc33227cdacb329f355ab21f8481de2
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/turboquant/Makefile
+++ b/backend/cpp/turboquant/Makefile
@@ -1,7 +1,7 @@

 # Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
 # Auto-bumped nightly by .github/workflows/bump_deps.yaml.
-TURBOQUANT_VERSION?=45f8a066ed5f5bb38c695cec532f6cef9f4efa9d
+TURBOQUANT_VERSION?=4d24ad87b8ed2ad160809af41930f1e04b83f234
 LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant

 CMAKE_ARGS?=
--- a/backend/cpp/turboquant/patch-grpc-server.sh
+++ b/backend/cpp/turboquant/patch-grpc-server.sh
@@ -1,13 +1,22 @@
 #!/bin/bash
-# Augment the shared backend/cpp/llama-cpp/grpc-server.cpp allow-list of KV-cache
-# types so the gRPC `LoadModel` call accepts the TurboQuant-specific
-# `turbo2` / `turbo3` / `turbo4` cache types.
+# Patch the shared backend/cpp/llama-cpp/grpc-server.cpp *copy* used by the
+# turboquant build to account for two gaps between upstream and the fork:
 #
-# We do this on the *copy* sitting in turboquant-<flavor>-build/, never on the
-# original under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps
-# compiling against vanilla upstream which does not know about GGML_TYPE_TURBO*.
+#   1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
+#      fork-specific `turbo2` / `turbo3` / `turbo4` cache types.
+#   2. Replace `get_media_marker()` (added upstream in ggml-org/llama.cpp#21962,
+#      server-side random per-instance marker) with the legacy "<__media__>"
+#      literal. The fork branched before that PR, so server-common.cpp has no
+#      get_media_marker symbol. The fork's mtmd_default_marker() still returns
+#      "<__media__>", and Go-side tooling falls back to that sentinel when the
+#      backend does not expose media_marker, so substituting the literal keeps
+#      behavior identical on the turboquant path.
 #
-# Idempotent: skips the insertion if the marker is already present (so re-runs
+# We patch the *copy* sitting in turboquant-<flavor>-build/, never the original
+# under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps compiling
+# against vanilla upstream.
+#
+# Idempotent: skips each insertion if its marker is already present (so re-runs
 # of the same build dir don't double-insert).

 set -euo pipefail
@@ -25,33 +34,47 @@ if [[ ! -f "$SRC" ]]; then
 fi

 if grep -q 'GGML_TYPE_TURBO2_0' "$SRC"; then
-    echo "==> $SRC already has TurboQuant cache types, skipping"
-    exit 0
+    echo "==> $SRC already has TurboQuant cache types, skipping KV allow-list patch"
+else
+    echo "==> patching $SRC to allow turbo2/turbo3/turbo4 KV-cache types"
+
+    # Insert the three TURBO entries right after the first `    GGML_TYPE_Q5_1,`
+    # line (the kv_cache_types[] allow-list). Using awk because the builder image
+    # does not ship python3, and GNU sed's multi-line `a\` quoting is awkward.
+    awk '
+        /^    GGML_TYPE_Q5_1,$/ && !done {
+            print
+            print "    // turboquant fork extras — added by patch-grpc-server.sh"
+            print "    GGML_TYPE_TURBO2_0,"
+            print "    GGML_TYPE_TURBO3_0,"
+            print "    GGML_TYPE_TURBO4_0,"
+            done = 1
+            next
+        }
+        { print }
+        END {
+            if (!done) {
+                print "patch-grpc-server.sh: anchor `    GGML_TYPE_Q5_1,` not found" > "/dev/stderr"
+                exit 1
+            }
+        }
+    ' "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+
+    echo "==> KV allow-list patch OK"
 fi

-echo "==> patching $SRC to allow turbo2/turbo3/turbo4 KV-cache types"
+if grep -q 'get_media_marker()' "$SRC"; then
+    echo "==> patching $SRC to replace get_media_marker() with legacy \"<__media__>\" literal"
+    # Only one call site today (ModelMetadata), but replace all occurrences to
+    # stay robust if upstream adds more. Use a temp file to avoid relying on
+    # sed -i portability (the builder image uses GNU sed, but keeping this
+    # consistent with the awk block above).
+    sed 's/get_media_marker()/"<__media__>"/g' "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+    echo "==> get_media_marker() substitution OK"
+else
+    echo "==> $SRC has no get_media_marker() call, skipping media-marker patch"
+fi

-# Insert the three TURBO entries right after the first `    GGML_TYPE_Q5_1,`
-# line (the kv_cache_types[] allow-list). Using awk because the builder image
-# does not ship python3, and GNU sed's multi-line `a\` quoting is awkward.
-awk '
-    /^    GGML_TYPE_Q5_1,$/ && !done {
-        print
-        print "    // turboquant fork extras — added by patch-grpc-server.sh"
-        print "    GGML_TYPE_TURBO2_0,"
-        print "    GGML_TYPE_TURBO3_0,"
-        print "    GGML_TYPE_TURBO4_0,"
-        done = 1
-        next
-    }
-    { print }
-    END {
-        if (!done) {
-            print "patch-grpc-server.sh: anchor `    GGML_TYPE_Q5_1,` not found" > "/dev/stderr"
-            exit 1
-        }
-    }
-' "$SRC" > "$SRC.tmp"
-mv "$SRC.tmp" "$SRC"
-
-echo "==> patched OK"
+echo "==> all patches applied"
--- a/backend/cpp/turboquant/patches/0001-ggml-hip-add-f16-turbo-vec-instances.patch
+++ b/backend/cpp/turboquant/patches/0001-ggml-hip-add-f16-turbo-vec-instances.patch
@@ -0,0 +1,47 @@
+From: LocalAI turboquant backend maintainers <noreply@localai.io>
+Subject: ggml-hip: add F16-K + TURBO-V fattn-vec template instances
+
+Upstream commit fa4e8be0a0ce ("fix(cuda): add F16-K + TURBO-V dispatch cases
+in fattn.cu") added three new template instance files under ggml-cuda/:
+
+  - fattn-vec-instance-f16-turbo2_0.cu
+  - fattn-vec-instance-f16-turbo3_0.cu
+  - fattn-vec-instance-f16-turbo4_0.cu
+
+and registered them in ggml/src/ggml-cuda/CMakeLists.txt. The companion
+dispatch cases FATTN_VEC_CASES_ALL_D(GGML_TYPE_F16, GGML_TYPE_TURBO{2,3,4}_0)
+were added to ggml/src/ggml-cuda/fattn.cu, which is shared with the HIP
+build path via hipify.
+
+However, ggml/src/ggml-hip/CMakeLists.txt carries its own explicit list of
+template instance sources (used when GGML_CUDA_FA_ALL_QUANTS is OFF, which
+is the default) and was never updated for the new F16-K + TURBO-V combos.
+The HIP build therefore compiles the dispatch cases (which reference
+ggml_cuda_flash_attn_ext_vec_case<D, F16, TURBO*>) without ever compiling
+the matching template instantiations, causing a link-time failure in the
+-gpu-rocm-hipblas-turboquant CI job.
+
+Add the three new template instance files to ggml-hip's list so the HIP
+build links cleanly. Drop this patch once the fork picks up the
+corresponding upstream sync in ggml-hip/CMakeLists.txt.
+
+--- a/ggml/src/ggml-hip/CMakeLists.txt
+++ b/ggml/src/ggml-hip/CMakeLists.txt
+@@ -85,14 +85,17 @@ else()
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-turbo3_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-q8_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-q8_0-turbo3_0.cu
+        ../ggml-cuda/template-instances/fattn-vec-instance-f16-turbo3_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo2_0-turbo2_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo2_0-q8_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-q8_0-turbo2_0.cu
+        ../ggml-cuda/template-instances/fattn-vec-instance-f16-turbo2_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-turbo2_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo2_0-turbo3_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo4_0-turbo4_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo4_0-q8_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-q8_0-turbo4_0.cu
+        ../ggml-cuda/template-instances/fattn-vec-instance-f16-turbo4_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo4_0-turbo3_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-turbo4_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo4_0-turbo2_0.cu
--- a/backend/cpp/turboquant/patches/0001-server-respect-the-ignore-eos-flag.patch
+++ b/backend/cpp/turboquant/patches/0001-server-respect-the-ignore-eos-flag.patch
@@ -1,83 +0,0 @@
-From 660600081fb7b9b769ded5c805a2d39a419f0a0d Mon Sep 17 00:00:00 2001
-From: Yuri Khrustalev <ykhrustalev@users.noreply.github.com>
-Date: Wed, 8 Apr 2026 11:12:15 -0400
-Subject: [PATCH] server: respect the ignore eos flag (#21203)
-
---
- tools/server/server-context.cpp | 3 +++
- tools/server/server-context.h   | 3 +++
- tools/server/server-task.cpp    | 3 ++-
- tools/server/server-task.h      | 1 +
- 4 files changed, 9 insertions(+), 1 deletion(-)
-
-diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
-index 9d3ac538..b31981c5 100644
--- a/tools/server/server-context.cpp
-+++ b/tools/server/server-context.cpp
-@@ -3033,6 +3033,8 @@ server_context_meta server_context::get_meta() const {
-         /* fim_rep_token          */ llama_vocab_fim_rep(impl->vocab),
-         /* fim_sep_token          */ llama_vocab_fim_sep(impl->vocab),
- 
-+        /* logit_bias_eog         */ impl->params_base.sampling.logit_bias_eog,
-+
-         /* model_vocab_type       */ llama_vocab_type(impl->vocab),
-         /* model_vocab_n_tokens   */ llama_vocab_n_tokens(impl->vocab),
-         /* model_n_ctx_train      */ llama_model_n_ctx_train(impl->model),
-@@ -3117,6 +3119,7 @@ std::unique_ptr<server_res_generator> server_routes::handle_completions_impl(
-                     ctx_server.vocab,
-                     params,
-                     meta->slot_n_ctx,
-+                    meta->logit_bias_eog,
-                     data);
-             task.id_slot = json_value(data, "id_slot", -1);
- 
-diff --git a/tools/server/server-context.h b/tools/server/server-context.h
-index d7ce8735..6ea9afc0 100644
--- a/tools/server/server-context.h
-+++ b/tools/server/server-context.h
-@@ -39,6 +39,9 @@ struct server_context_meta {
-     llama_token fim_rep_token;
-     llama_token fim_sep_token;
- 
-+    // sampling
-+    std::vector<llama_logit_bias> logit_bias_eog;
-+
-     // model meta
-     enum llama_vocab_type model_vocab_type;
-     int32_t model_vocab_n_tokens;
-diff --git a/tools/server/server-task.cpp b/tools/server/server-task.cpp
-index 4cc87bc5..856b3f0e 100644
--- a/tools/server/server-task.cpp
-+++ b/tools/server/server-task.cpp
-@@ -239,6 +239,7 @@ task_params server_task::params_from_json_cmpl(
-         const llama_vocab * vocab,
-         const common_params & params_base,
-         const int n_ctx_slot,
-+        const std::vector<llama_logit_bias> & logit_bias_eog,
-         const json & data) {
-     task_params params;
- 
-@@ -562,7 +563,7 @@ task_params server_task::params_from_json_cmpl(
-         if (params.sampling.ignore_eos) {
-             params.sampling.logit_bias.insert(
-                     params.sampling.logit_bias.end(),
-                    defaults.sampling.logit_bias_eog.begin(), defaults.sampling.logit_bias_eog.end());
-+                    logit_bias_eog.begin(), logit_bias_eog.end());
-         }
-     }
- 
-diff --git a/tools/server/server-task.h b/tools/server/server-task.h
-index d855bf08..243e47a8 100644
--- a/tools/server/server-task.h
-+++ b/tools/server/server-task.h
-@@ -209,6 +209,7 @@ struct server_task {
-         const llama_vocab * vocab,
-         const common_params & params_base,
-         const int n_ctx_slot,
-+        const std::vector<llama_logit_bias> & logit_bias_eog,
-         const json & data);
- 
-     // utility function
-- 
-2.43.0
-
--- a/backend/go/stablediffusion-ggml/Makefile
+++ b/backend/go/stablediffusion-ggml/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # stablediffusion.cpp (ggml)
 STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
-STABLEDIFFUSION_GGML_VERSION?=7d33d4b2ddeafa672761a5880ec33bdff452504d
+STABLEDIFFUSION_GGML_VERSION?=44cca3d626d301e2215d5e243277e8f0e65bfa78

 CMAKE_ARGS+=-DGGML_MAX_NAME=128

--- a/backend/go/stablediffusion-ggml/gosd.cpp
+++ b/backend/go/stablediffusion-ggml/gosd.cpp
@@ -1106,6 +1106,11 @@ static int ffmpeg_mux_raw_to_mp4(sd_image_t* frames, int num_frames, int fps, co
            const_cast<char*>("-c:v"), const_cast<char*>("libx264"),
            const_cast<char*>("-pix_fmt"), const_cast<char*>("yuv420p"),
            const_cast<char*>("-movflags"), const_cast<char*>("+faststart"),
+            // Force MP4 container. Distributed LocalAI hands us a staging
+            // path (e.g. /staging/localai-output-NNN.tmp) with a non-standard
+            // extension; relying on filename suffix makes ffmpeg bail with
+            // "Unable to choose an output format".
+            const_cast<char*>("-f"), const_cast<char*>("mp4"),
            const_cast<char*>(dst),
            nullptr
        };
--- a/backend/go/whisper/Makefile
+++ b/backend/go/whisper/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # whisper.cpp version
 WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
-WHISPER_CPP_VERSION?=166c20b473d5f4d04052e699f992f625ea2a2fdd
+WHISPER_CPP_VERSION?=fc674574ca27cac59a15e5b22a09b9d9ad62aafe
 SO_TARGET?=libgowhisper.so

 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -587,7 +587,6 @@
  alias: "whisperx"
  capabilities:
    nvidia: "cuda12-whisperx"
-    amd: "rocm-whisperx"
    metal: "metal-whisperx"
    default: "cpu-whisperx"
    nvidia-cuda-13: "cuda13-whisperx"
@@ -1008,6 +1007,20 @@
    nvidia-cuda-12: "cuda12-turboquant-development"
    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-turboquant-development"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-turboquant-development"
+- !!merge <<: *stablediffusionggml
+  name: "stablediffusion-ggml-development"
+  capabilities:
+    default: "cpu-stablediffusion-ggml-development"
+    nvidia: "cuda12-stablediffusion-ggml-development"
+    intel: "intel-sycl-f16-stablediffusion-ggml-development"
+    # amd: "rocm-stablediffusion-ggml-development"
+    vulkan: "vulkan-stablediffusion-ggml-development"
+    nvidia-l4t: "nvidia-l4t-arm64-stablediffusion-ggml-development"
+    metal: "metal-stablediffusion-ggml-development"
+    nvidia-cuda-13: "cuda13-stablediffusion-ggml-development"
+    nvidia-cuda-12: "cuda12-stablediffusion-ggml-development"
+    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-stablediffusion-ggml-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-stablediffusion-ggml-development"
 - !!merge <<: *neutts
  name: "cpu-neutts"
  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-neutts"
@@ -2731,7 +2744,6 @@
  name: "whisperx-development"
  capabilities:
    nvidia: "cuda12-whisperx-development"
-    amd: "rocm-whisperx-development"
    metal: "metal-whisperx-development"
    default: "cpu-whisperx-development"
    nvidia-cuda-13: "cuda13-whisperx-development"
@@ -2757,16 +2769,6 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-whisperx"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-whisperx
- !!merge <<: *whisperx
-  name: "rocm-whisperx"
-  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-whisperx"
-  mirrors:
-    - localai/localai-backends:latest-gpu-rocm-hipblas-whisperx
- !!merge <<: *whisperx
-  name: "rocm-whisperx-development"
-  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-whisperx"
-  mirrors:
-    - localai/localai-backends:master-gpu-rocm-hipblas-whisperx
 - !!merge <<: *whisperx
  name: "cuda13-whisperx"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-whisperx"
--- a/backend/python/whisperx/requirements-hipblas.txt
+++ b/backend/python/whisperx/requirements-hipblas.txt
@@ -1,6 +0,0 @@
-# whisperx hard-pins torch~=2.8.0, which is not available in the rocm7.x indexes
-# (they start at torch 2.10). Keep rocm6.4 wheels here — they still load against
-# the rocm7.2.1 runtime via AMD's forward-compatibility window.
--extra-index-url https://download.pytorch.org/whl/rocm6.4
-torch==2.8.0+rocm6.4
-whisperx @ git+https://github.com/m-bain/whisperX.git
--- a/backend/rust/kokoros/src/service.rs
+++ b/backend/rust/kokoros/src/service.rs
@@ -341,6 +341,16 @@ impl Backend for KokorosService {
        Err(Status::unimplemented("Not supported"))
    }

+    type AudioTranscriptionStreamStream =
+        ReceiverStream<Result<backend::TranscriptStreamResponse, Status>>;
+
+    async fn audio_transcription_stream(
+        &self,
+        _: Request<backend::TranscriptRequest>,
+    ) -> Result<Response<Self::AudioTranscriptionStreamStream>, Status> {
+        Err(Status::unimplemented("Not supported"))
+    }
+
    async fn sound_generation(
        &self,
        _: Request<backend::SoundGenerationRequest>,
--- a/core/application/distributed.go
+++ b/core/application/distributed.go
@@ -242,14 +242,20 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB) (*Distribut
 		DB:            authDB,
 	})

-	// Create ReplicaReconciler for auto-scaling model replicas
+	// Create ReplicaReconciler for auto-scaling model replicas. Adapter +
+	// RegistrationToken feed the state-reconciliation passes: pending op
+	// drain uses the adapter, and model health probes use the token to auth
+	// against workers' gRPC HealthCheck.
 	reconciler := nodes.NewReplicaReconciler(nodes.ReplicaReconcilerOptions{
-		Registry:       registry,
-		Scheduler:      router,
-		Unloader:       remoteUnloader,
-		DB:             authDB,
-		Interval:       30 * time.Second,
-		ScaleDownDelay: 5 * time.Minute,
+		Registry:          registry,
+		Scheduler:         router,
+		Unloader:          remoteUnloader,
+		Adapter:           remoteUnloader,
+		RegistrationToken: cfg.Distributed.RegistrationToken,
+		DB:                authDB,
+		Interval:          30 * time.Second,
+		ScaleDownDelay:    5 * time.Minute,
+		ProbeStaleAfter:   2 * time.Minute,
 	})

 	// Create ModelRouterAdapter to wire into ModelLoader
--- a/core/application/startup.go
+++ b/core/application/startup.go
@@ -235,7 +235,12 @@ func New(opts ...config.AppOption) (*Application, error) {
 	// In distributed mode, uses PostgreSQL advisory lock so only one frontend
 	// instance runs periodic checks (avoids duplicate upgrades across replicas).
 	if len(options.BackendGalleries) > 0 {
-		uc := NewUpgradeChecker(options, application.ModelLoader(), application.distributedDB())
+		// Pass a lazy getter for the backend manager so the checker always
+		// uses the active one — DistributedBackendManager is swapped in above
+		// and asks workers for their installed backends, which is what
+		// upgrade detection needs in distributed mode.
+		bmFn := func() galleryop.BackendManager { return application.GalleryService().BackendManager() }
+		uc := NewUpgradeChecker(options, application.ModelLoader(), application.distributedDB(), bmFn)
 		application.upgradeChecker = uc
 		go uc.Run(options.Context)
 	}
--- a/core/application/upgrade_checker.go
+++ b/core/application/upgrade_checker.go
@@ -8,6 +8,7 @@ import (
 	"github.com/mudler/LocalAI/core/config"
 	"github.com/mudler/LocalAI/core/gallery"
 	"github.com/mudler/LocalAI/core/services/advisorylock"
+	"github.com/mudler/LocalAI/core/services/galleryop"
 	"github.com/mudler/LocalAI/pkg/model"
 	"github.com/mudler/LocalAI/pkg/system"
 	"github.com/mudler/xlog"
@@ -26,6 +27,12 @@ type UpgradeChecker struct {
 	galleries   []config.Gallery
 	systemState *system.SystemState
 	db          *gorm.DB // non-nil in distributed mode
+	// backendManagerFn lazily returns the current backend manager (may be
+	// swapped from Local to Distributed after startup). Pulled through each
+	// check so the UpgradeChecker uses whichever is active. In distributed
+	// mode this ensures CheckUpgrades asks workers instead of the (empty)
+	// frontend filesystem — fixing the bug where upgrades never surfaced.
+	backendManagerFn func() galleryop.BackendManager

 	checkInterval time.Duration
 	stop          chan struct{}
@@ -40,18 +47,22 @@ type UpgradeChecker struct {
 // NewUpgradeChecker creates a new UpgradeChecker service.
 // Pass db=nil for standalone mode, or a *gorm.DB for distributed mode
 // (uses advisory locks so only one instance runs periodic checks).
-func NewUpgradeChecker(appConfig *config.ApplicationConfig, ml *model.ModelLoader, db *gorm.DB) *UpgradeChecker {
+// backendManagerFn is optional; when set, CheckUpgrades is routed through
+// the active backend manager — required in distributed mode so the check
+// aggregates from workers rather than the empty frontend filesystem.
+func NewUpgradeChecker(appConfig *config.ApplicationConfig, ml *model.ModelLoader, db *gorm.DB, backendManagerFn func() galleryop.BackendManager) *UpgradeChecker {
 	return &UpgradeChecker{
-		appConfig:     appConfig,
-		modelLoader:   ml,
-		galleries:     appConfig.BackendGalleries,
-		systemState:   appConfig.SystemState,
-		db:            db,
-		checkInterval: 6 * time.Hour,
-		stop:          make(chan struct{}),
-		done:          make(chan struct{}),
-		triggerCh:     make(chan struct{}, 1),
-		lastUpgrades:  make(map[string]gallery.UpgradeInfo),
+		appConfig:        appConfig,
+		modelLoader:      ml,
+		galleries:        appConfig.BackendGalleries,
+		systemState:      appConfig.SystemState,
+		db:               db,
+		backendManagerFn: backendManagerFn,
+		checkInterval:    6 * time.Hour,
+		stop:             make(chan struct{}),
+		done:             make(chan struct{}),
+		triggerCh:        make(chan struct{}, 1),
+		lastUpgrades:     make(map[string]gallery.UpgradeInfo),
 	}
 }

@@ -64,13 +75,16 @@ func NewUpgradeChecker(appConfig *config.ApplicationConfig, ml *model.ModelLoade
 func (uc *UpgradeChecker) Run(ctx context.Context) {
 	defer close(uc.done)

-	// Initial delay: don't slow down startup
+	// Initial delay: don't slow down startup. Short enough that operators
+	// don't stare at an empty upgrade banner for long; long enough that
+	// workers have registered and reported their installed backends.
+	initialDelay := 10 * time.Second
 	select {
 	case <-ctx.Done():
 		return
 	case <-uc.stop:
 		return
-	case <-time.After(30 * time.Second):
+	case <-time.After(initialDelay):
 	}

 	// First check always runs locally (to warm the cache on this instance)
@@ -144,7 +158,18 @@ func (uc *UpgradeChecker) GetAvailableUpgrades() map[string]gallery.UpgradeInfo
 }

 func (uc *UpgradeChecker) runCheck(ctx context.Context) {
-	upgrades, err := gallery.CheckBackendUpgrades(ctx, uc.galleries, uc.systemState)
+	var (
+		upgrades map[string]gallery.UpgradeInfo
+		err      error
+	)
+	if uc.backendManagerFn != nil {
+		if bm := uc.backendManagerFn(); bm != nil {
+			upgrades, err = bm.CheckUpgrades(ctx)
+		}
+	}
+	if upgrades == nil && err == nil {
+		upgrades, err = gallery.CheckBackendUpgrades(ctx, uc.galleries, uc.systemState)
+	}

 	uc.mu.Lock()
 	uc.lastCheckTime = time.Now()
--- a/core/backend/llm.go
+++ b/core/backend/llm.go
@@ -40,6 +40,12 @@ type TokenUsage struct {
 	ChatDeltas             []*proto.ChatDelta // per-chunk deltas from C++ autoparser (only set during streaming)
 }

+func needsThinkingProbe(c *config.ModelConfig) bool {
+	return c.TemplateConfig.UseTokenizerTemplate &&
+		(c.ReasoningConfig.DisableReasoning == nil ||
+			c.ReasoningConfig.DisableReasoningTagPrefill == nil)
+}
+
 // HasChatDeltaContent returns true if any chat delta carries content or reasoning text.
 // Used to decide whether to prefer C++ autoparser deltas over Go-side tag extraction.
 func (t TokenUsage) HasChatDeltaContent() bool {
@@ -100,11 +106,9 @@ func ModelInference(ctx context.Context, s string, messages schema.Messages, ima
 	// tokenizer template path is active) and the multimodal media marker (needed
 	// by custom chat templates so markers line up with what mtmd expects).
 	// We probe whenever any of those slots is still empty.
-	needsThinkingProbe := c.TemplateConfig.UseTokenizerTemplate &&
-		c.ReasoningConfig.DisableReasoning == nil &&
-		c.ReasoningConfig.DisableReasoningTagPrefill == nil
+	shouldProbeThinking := needsThinkingProbe(c)
 	needsMarkerProbe := c.MediaMarker == ""
-	if needsThinkingProbe || needsMarkerProbe {
+	if shouldProbeThinking || needsMarkerProbe {
 		modelOpts := grpcModelOpts(*c, o.SystemState.Model.ModelsPath)
 		config.DetectThinkingSupportFromBackend(ctx, c, inferenceModel, modelOpts)
 		// Update the config in the loader so it persists for future requests
--- a/core/backend/llm_probe_test.go
+++ b/core/backend/llm_probe_test.go
@@ -0,0 +1,29 @@
+package backend
+
+import (
+	"github.com/mudler/LocalAI/core/config"
+
+	"github.com/gpustack/gguf-parser-go/util/ptr"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("thinking probe gating", func() {
+	It("probes tokenizer-template models when any reasoning default is still unset", func() {
+		cfg := &config.ModelConfig{
+			TemplateConfig: config.TemplateConfig{UseTokenizerTemplate: true},
+		}
+		Expect(needsThinkingProbe(cfg)).To(BeTrue())
+
+		cfg.ReasoningConfig.DisableReasoning = ptr.To(true)
+		Expect(needsThinkingProbe(cfg)).To(BeTrue())
+
+		cfg.ReasoningConfig.DisableReasoningTagPrefill = ptr.To(true)
+		Expect(needsThinkingProbe(cfg)).To(BeFalse())
+	})
+
+	It("does not probe when tokenizer templates are disabled", func() {
+		cfg := &config.ModelConfig{}
+		Expect(needsThinkingProbe(cfg)).To(BeFalse())
+	})
+})
--- a/core/cli/run.go
+++ b/core/cli/run.go
@@ -507,7 +507,7 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {

 	app, err := application.New(opts...)
 	if err != nil {
-		return fmt.Errorf("failed basic startup tasks with error %s", err.Error())
+		return fmt.Errorf("LocalAI failed to start: %w.\nTroubleshooting steps:\n  1. Check that your models directory exists and is accessible: %s\n  2. Verify model config files are valid YAML: 'local-ai util usecase-heuristic <config>'\n  3. Check available disk space and file permissions\n  4. Run with --log-level=debug for more details\nSee https://localai.io/basics/troubleshooting/ for more help", err, r.ModelsPath)
 	}

 	appHTTP, err := http.API(app)
--- a/core/cli/transcript.go
+++ b/core/cli/transcript.go
@@ -3,7 +3,6 @@ package cli
 import (
 	"context"
 	"encoding/json"
-	"errors"
 	"fmt"
 	"strings"

@@ -60,7 +59,7 @@ func (t *TranscriptCMD) Run(ctx *cliContext.Context) error {

 	c, exists := cl.GetModelConfig(t.Model)
 	if !exists {
-		return errors.New("model not found")
+		return fmt.Errorf("model %q not found. Run 'local-ai models list' to see available models, or install one with 'local-ai models install <model>'. See https://localai.io/models/ for more information", t.Model)
 	}

 	c.Threads = &t.Threads
--- a/core/cli/util.go
+++ b/core/cli/util.go
@@ -74,7 +74,7 @@ func (u *CreateOCIImageCMD) Run(ctx *cliContext.Context) error {

 func (u *GGUFInfoCMD) Run(ctx *cliContext.Context) error {
 	if len(u.Args) == 0 {
-		return fmt.Errorf("no GGUF file provided")
+		return fmt.Errorf("no GGUF file provided. Usage: local-ai util gguf-info <path-to-file.gguf>\nGGUF is a binary format for storing quantized language models. You can download GGUF models from https://huggingface.co or install one with 'local-ai models install <model>'")
 	}
 	// We try to guess only if we don't have a template defined already
 	f, err := gguf.ParseGGUFFile(u.Args[0])
--- a/core/cli/worker.go
+++ b/core/cli/worker.go
@@ -21,6 +21,7 @@ import (
 	"github.com/mudler/LocalAI/core/cli/workerregistry"
 	"github.com/mudler/LocalAI/core/config"
 	"github.com/mudler/LocalAI/core/gallery"
+	"github.com/mudler/LocalAI/core/services/galleryop"
 	"github.com/mudler/LocalAI/core/services/messaging"
 	"github.com/mudler/LocalAI/core/services/nodes"
 	"github.com/mudler/LocalAI/core/services/storage"
@@ -597,12 +598,20 @@ func (s *backendSupervisor) installBackend(req messaging.BackendInstallRequest)
 	// Try to find the backend binary
 	backendPath := s.findBackend(req.Backend)
 	if backendPath == "" {
-		// Backend not found locally — try auto-installing from gallery
-		xlog.Info("Backend not found locally, attempting gallery install", "backend", req.Backend)
-		if err := gallery.InstallBackendFromGallery(
-			context.Background(), galleries, s.systemState, s.ml, req.Backend, nil, false,
-		); err != nil {
-			return "", fmt.Errorf("installing backend from gallery: %w", err)
+		if req.URI != "" {
+			xlog.Info("Backend not found locally, attempting external install", "backend", req.Backend, "uri", req.URI)
+			if err := galleryop.InstallExternalBackend(
+				context.Background(), galleries, s.systemState, s.ml, nil, req.URI, req.Name, req.Alias,
+			); err != nil {
+				return "", fmt.Errorf("installing backend from gallery: %w", err)
+			}
+		} else {
+			xlog.Info("Backend not found locally, attempting gallery install", "backend", req.Backend)
+			if err := gallery.InstallBackendFromGallery(
+				context.Background(), galleries, s.systemState, s.ml, req.Backend, nil, false,
+			); err != nil {
+				return "", fmt.Errorf("installing backend from gallery: %w", err)
+			}
 		}
 		// Re-register after install and retry
 		gallery.RegisterBackends(s.systemState, s.ml)
@@ -738,6 +747,9 @@ func (s *backendSupervisor) subscribeLifecycleEvents() {
 			if b.Metadata != nil {
 				info.InstalledAt = b.Metadata.InstalledAt
 				info.GalleryURL = b.Metadata.GalleryURL
+				info.Version = b.Metadata.Version
+				info.URI = b.Metadata.URI
+				info.Digest = b.Metadata.Digest
 			}
 			infos = append(infos, info)
 		}
--- a/core/cli/worker/worker_p2p.go
+++ b/core/cli/worker/worker_p2p.go
@@ -38,7 +38,7 @@ func (r *P2P) Run(ctx *cliContext.Context) error {
 	// Check if the token is set
 	// as we always need it.
 	if r.Token == "" {
-		return fmt.Errorf("Token is required")
+		return fmt.Errorf("a P2P token is required to join the network. Set it via the LOCALAI_TOKEN environment variable or the --token flag. You can generate a token by running 'local-ai run --p2p' on the main node. See https://localai.io/features/distribute/ for more information")
 	}

 	port, err := freeport.GetFreePort()
--- a/core/config/gguf.go
+++ b/core/config/gguf.go
@@ -125,19 +125,7 @@ func DetectThinkingSupportFromBackend(ctx context.Context, cfg *ModelConfig, bac
 			return
 		}

-		cfg.ReasoningConfig.DisableReasoning = ptr.To(!metadata.SupportsThinking)
-
-		// Use the rendered template to detect if thinking token is at the end
-		// This reuses the existing DetectThinkingStartToken function
-		if metadata.RenderedTemplate != "" {
-			thinkingStartToken := reasoning.DetectThinkingStartToken(metadata.RenderedTemplate, &cfg.ReasoningConfig)
-			thinkingForcedOpen := thinkingStartToken != ""
-			cfg.ReasoningConfig.DisableReasoningTagPrefill = ptr.To(!thinkingForcedOpen)
-			xlog.Debug("[gguf] DetectThinkingSupportFromBackend: thinking support detected", "supports_thinking", metadata.SupportsThinking, "thinking_forced_open", thinkingForcedOpen, "thinking_start_token", thinkingStartToken)
-		} else {
-			cfg.ReasoningConfig.DisableReasoningTagPrefill = ptr.To(true)
-			xlog.Debug("[gguf] DetectThinkingSupportFromBackend: thinking support detected", "supports_thinking", metadata.SupportsThinking, "thinking_forced_open", false)
-		}
+		applyDetectedThinkingConfig(cfg, metadata)

 		// Extract tool format markers from autoparser analysis
 		if tf := metadata.GetToolFormat(); tf != nil && tf.FormatType != "" {
@@ -180,3 +168,34 @@ func DetectThinkingSupportFromBackend(ctx context.Context, cfg *ModelConfig, bac
 		}
 	}
 }
+
+func applyDetectedThinkingConfig(cfg *ModelConfig, metadata *pb.ModelMetadataResponse) {
+	if cfg == nil || metadata == nil {
+		return
+	}
+
+	// Respect explicit YAML/user config. Backend probing should only fill defaults
+	// when the reasoning mode has not already been set.
+	if cfg.ReasoningConfig.DisableReasoning == nil {
+		cfg.ReasoningConfig.DisableReasoning = ptr.To(!metadata.SupportsThinking)
+	}
+
+	// Respect explicit prefill config for the same reason. Only infer the
+	// default prefill behavior when the user did not set it.
+	if cfg.ReasoningConfig.DisableReasoningTagPrefill == nil {
+		// Use the rendered template to detect if thinking token is at the end.
+		// This reuses the existing DetectThinkingStartToken function.
+		if metadata.RenderedTemplate != "" {
+			thinkingStartToken := reasoning.DetectThinkingStartToken(metadata.RenderedTemplate, &cfg.ReasoningConfig)
+			thinkingForcedOpen := thinkingStartToken != ""
+			cfg.ReasoningConfig.DisableReasoningTagPrefill = ptr.To(!thinkingForcedOpen)
+			xlog.Debug("[gguf] DetectThinkingSupportFromBackend: thinking support detected", "supports_thinking", metadata.SupportsThinking, "thinking_forced_open", thinkingForcedOpen, "thinking_start_token", thinkingStartToken)
+		} else {
+			cfg.ReasoningConfig.DisableReasoningTagPrefill = ptr.To(true)
+			xlog.Debug("[gguf] DetectThinkingSupportFromBackend: thinking support detected", "supports_thinking", metadata.SupportsThinking, "thinking_forced_open", false)
+		}
+		return
+	}
+
+	xlog.Debug("[gguf] DetectThinkingSupportFromBackend: preserving explicit reasoning config", "supports_thinking", metadata.SupportsThinking, "disable_reasoning", *cfg.ReasoningConfig.DisableReasoning, "disable_reasoning_tag_prefill", *cfg.ReasoningConfig.DisableReasoningTagPrefill)
+}
--- a/core/config/gguf_reasoning_test.go
+++ b/core/config/gguf_reasoning_test.go
@@ -0,0 +1,101 @@
+package config
+
+import (
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+	"github.com/mudler/LocalAI/pkg/reasoning"
+
+	"github.com/gpustack/gguf-parser-go/util/ptr"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("GGUF backend metadata reasoning defaults", func() {
+	It("fills reasoning defaults when unset", func() {
+		cfg := &ModelConfig{
+			TemplateConfig: TemplateConfig{UseTokenizerTemplate: true},
+		}
+
+		applyDetectedThinkingConfig(cfg, &pb.ModelMetadataResponse{
+			SupportsThinking: true,
+			RenderedTemplate: "{{ bos_token }}<think>",
+		})
+
+		Expect(cfg.ReasoningConfig.DisableReasoning).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoning).To(BeFalse())
+		Expect(cfg.ReasoningConfig.DisableReasoningTagPrefill).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoningTagPrefill).To(BeFalse())
+	})
+
+	It("preserves fully explicit reasoning settings", func() {
+		cfg := &ModelConfig{
+			TemplateConfig: TemplateConfig{UseTokenizerTemplate: true},
+			ReasoningConfig: reasoning.Config{
+				DisableReasoning:           ptr.To(true),
+				DisableReasoningTagPrefill: ptr.To(true),
+			},
+		}
+
+		applyDetectedThinkingConfig(cfg, &pb.ModelMetadataResponse{
+			SupportsThinking: true,
+			RenderedTemplate: "{{ bos_token }}<think>",
+		})
+
+		Expect(cfg.ReasoningConfig.DisableReasoning).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoning).To(BeTrue())
+		Expect(cfg.ReasoningConfig.DisableReasoningTagPrefill).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoningTagPrefill).To(BeTrue())
+	})
+
+	It("preserves explicit disable while still inferring missing prefill", func() {
+		cfg := &ModelConfig{
+			TemplateConfig: TemplateConfig{UseTokenizerTemplate: true},
+			ReasoningConfig: reasoning.Config{
+				DisableReasoning: ptr.To(true),
+			},
+		}
+
+		applyDetectedThinkingConfig(cfg, &pb.ModelMetadataResponse{
+			SupportsThinking: true,
+			RenderedTemplate: "{{ bos_token }}<think>",
+		})
+
+		Expect(cfg.ReasoningConfig.DisableReasoning).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoning).To(BeTrue())
+		Expect(cfg.ReasoningConfig.DisableReasoningTagPrefill).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoningTagPrefill).To(BeFalse())
+	})
+
+	It("preserves explicit prefill while still inferring missing disable flag", func() {
+		cfg := &ModelConfig{
+			TemplateConfig: TemplateConfig{UseTokenizerTemplate: true},
+			ReasoningConfig: reasoning.Config{
+				DisableReasoningTagPrefill: ptr.To(true),
+			},
+		}
+
+		applyDetectedThinkingConfig(cfg, &pb.ModelMetadataResponse{
+			SupportsThinking: true,
+			RenderedTemplate: "{{ bos_token }}<think>",
+		})
+
+		Expect(cfg.ReasoningConfig.DisableReasoning).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoning).To(BeFalse())
+		Expect(cfg.ReasoningConfig.DisableReasoningTagPrefill).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoningTagPrefill).To(BeTrue())
+	})
+
+	It("defaults to disabling reasoning when backend does not support thinking", func() {
+		cfg := &ModelConfig{
+			TemplateConfig: TemplateConfig{UseTokenizerTemplate: true},
+		}
+
+		applyDetectedThinkingConfig(cfg, &pb.ModelMetadataResponse{
+			SupportsThinking: false,
+		})
+
+		Expect(cfg.ReasoningConfig.DisableReasoning).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoning).To(BeTrue())
+		Expect(cfg.ReasoningConfig.DisableReasoningTagPrefill).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoningTagPrefill).To(BeTrue())
+	})
+})
--- a/core/config/model_config_loader.go
+++ b/core/config/model_config_loader.go
@@ -193,9 +193,9 @@ func (bcl *ModelConfigLoader) ReadModelConfig(file string, opts ...ConfigLoaderO
 		bcl.configs[c.Name] = *c
 	} else {
 		if err != nil {
-			return fmt.Errorf("config is not valid: %w", err)
+			return fmt.Errorf("model config %q is not valid: %w. Ensure the YAML file has a valid 'name' field and correct syntax. See https://localai.io/docs/getting-started/customize-model/ for config reference", file, err)
 		}
-		return fmt.Errorf("config is not valid")
+		return fmt.Errorf("model config %q is not valid. Ensure the YAML file has a valid 'name' field and correct syntax. See https://localai.io/docs/getting-started/customize-model/ for config reference", file)
 	}

 	return nil
@@ -373,9 +373,9 @@ func (bcl *ModelConfigLoader) LoadModelConfigsFromPath(path string, opts ...Conf
 		files = append(files, info)
 	}
 	for _, file := range files {
-		// Skip templates, YAML and .keep files
-		if !strings.Contains(file.Name(), ".yaml") && !strings.Contains(file.Name(), ".yml") ||
-			strings.HasPrefix(file.Name(), ".") {
+		// Only load real YAML config files and ignore dotfiles or backup variants
+		ext := strings.ToLower(filepath.Ext(file.Name()))
+		if (ext != ".yaml" && ext != ".yml") || strings.HasPrefix(file.Name(), ".") {
 			continue
 		}

--- a/core/config/model_test.go
+++ b/core/config/model_test.go
@@ -2,6 +2,7 @@ package config

 import (
 	"os"
+	"path/filepath"

 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
@@ -109,5 +110,50 @@ options:
 			Expect(testModel.Options).To(ContainElements("foo", "bar", "baz"))

 		})
+
+		It("Only loads files ending with yaml or yml", func() {
+			tmpdir, err := os.MkdirTemp("", "model-config-loader")
+			Expect(err).ToNot(HaveOccurred())
+			defer os.RemoveAll(tmpdir)
+
+			err = os.WriteFile(filepath.Join(tmpdir, "foo.yaml"), []byte(
+				`name: "foo-model"
+description: "formal config"
+backend: "llama-cpp"
+parameters:
+  model: "foo.gguf"
+`), 0644)
+			Expect(err).ToNot(HaveOccurred())
+
+			err = os.WriteFile(filepath.Join(tmpdir, "foo.yaml.bak"), []byte(
+				`name: "foo-model"
+description: "backup config"
+backend: "llama-cpp"
+parameters:
+  model: "foo-backup.gguf"
+`), 0644)
+			Expect(err).ToNot(HaveOccurred())
+
+			err = os.WriteFile(filepath.Join(tmpdir, "foo.yaml.bak.123"), []byte(
+				`name: "foo-backup-only"
+description: "timestamped backup config"
+backend: "llama-cpp"
+parameters:
+  model: "foo-timestamped.gguf"
+`), 0644)
+			Expect(err).ToNot(HaveOccurred())
+
+			bcl := NewModelConfigLoader(tmpdir)
+			err = bcl.LoadModelConfigsFromPath(tmpdir)
+			Expect(err).ToNot(HaveOccurred())
+
+			configs := bcl.GetAllModelsConfigs()
+			Expect(configs).To(HaveLen(1))
+			Expect(configs[0].Name).To(Equal("foo-model"))
+			Expect(configs[0].Description).To(Equal("formal config"))
+
+			_, exists := bcl.GetModelConfig("foo-backup-only")
+			Expect(exists).To(BeFalse())
+		})
 	})
 })
--- a/core/gallery/backends.go
+++ b/core/gallery/backends.go
@@ -110,7 +110,13 @@ func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery,
 		if err != nil {
 			return err
 		}
-		if backends.Exists(name) {
+		// Only short-circuit if the install is *actually usable*. An orphaned
+		// meta entry whose concrete was removed still shows up in
+		// ListSystemBackends with a RunFile pointing at a path that no longer
+		// exists; returning early there leaves the caller with a broken
+		// alias and the worker fails with "backend not found after install
+		// attempt" on every retry. Re-install in that case.
+		if existing, ok := backends.Get(name); ok && isBackendRunnable(existing) {
 			return nil
 		}
 	}
@@ -375,17 +381,44 @@ func DeleteBackendFromSystem(systemState *system.SystemState, name string) error
 	}

 	if metadata != nil && metadata.MetaBackendFor != "" {
-		metaBackendDirectory := filepath.Join(systemState.Backend.BackendsPath, metadata.MetaBackendFor)
-		xlog.Debug("Deleting meta backend", "backendDirectory", metaBackendDirectory)
-		if _, err := os.Stat(metaBackendDirectory); os.IsNotExist(err) {
-			return fmt.Errorf("meta backend %q not found", metadata.MetaBackendFor)
+		concreteDirectory := filepath.Join(systemState.Backend.BackendsPath, metadata.MetaBackendFor)
+		xlog.Debug("Deleting concrete backend referenced by meta", "concreteDirectory", concreteDirectory)
+		// If the concrete the meta points to is already gone (earlier delete,
+		// partial install, or manual cleanup), keep going and remove the
+		// orphaned meta dir. Previously we returned an error here, which made
+		// the orphaned meta impossible to uninstall from the UI — the delete
+		// kept failing and every subsequent install short-circuited because
+		// the stale meta metadata made ListSystemBackends.Exists(name) true.
+		if _, statErr := os.Stat(concreteDirectory); statErr == nil {
+			os.RemoveAll(concreteDirectory)
+		} else if os.IsNotExist(statErr) {
+			xlog.Warn("Concrete backend referenced by meta not found — removing orphaned meta only",
+				"meta", name, "concrete", metadata.MetaBackendFor)
+		} else {
+			return statErr
 		}
-		os.RemoveAll(metaBackendDirectory)
 	}

 	return os.RemoveAll(backendDirectory)
 }

+// isBackendRunnable reports whether the given backend entry can actually be
+// invoked. A meta backend is runnable only if its concrete's run.sh still
+// exists on disk; concrete backends are considered runnable as long as their
+// RunFile is set (ListSystemBackends only emits them when the runfile is
+// present). Used to guard the "already installed" short-circuit so an
+// orphaned meta pointing at a missing concrete triggers a real reinstall
+// rather than being silently skipped.
+func isBackendRunnable(b SystemBackend) bool {
+	if b.RunFile == "" {
+		return false
+	}
+	if fi, err := os.Stat(b.RunFile); err != nil || fi.IsDir() {
+		return false
+	}
+	return true
+}
+
 type SystemBackend struct {
 	Name             string
 	RunFile          string
@@ -394,6 +427,23 @@ type SystemBackend struct {
 	Metadata         *BackendMetadata
 	UpgradeAvailable bool   `json:"upgrade_available,omitempty"`
 	AvailableVersion string `json:"available_version,omitempty"`
+	// Nodes holds per-node attribution in distributed mode. Empty in single-node.
+	// Each entry describes a node that has this backend installed, with the
+	// version/digest it reports. Lets the UI surface drift and per-node status.
+	Nodes []NodeBackendRef `json:"nodes,omitempty"`
+}
+
+// NodeBackendRef describes one node's view of an installed backend. Used both
+// for per-node attribution in the UI and for drift detection during upgrade
+// checks (a cluster with mismatched versions/digests is flagged upgradeable).
+type NodeBackendRef struct {
+	NodeID      string `json:"node_id"`
+	NodeName    string `json:"node_name"`
+	NodeStatus  string `json:"node_status"` // healthy | unhealthy | offline | draining | pending
+	Version     string `json:"version,omitempty"`
+	Digest      string `json:"digest,omitempty"`
+	URI         string `json:"uri,omitempty"`
+	InstalledAt string `json:"installed_at,omitempty"`
 }

 type SystemBackends map[string]SystemBackend
--- a/core/gallery/backends_test.go
+++ b/core/gallery/backends_test.go
@@ -952,6 +952,58 @@ var _ = Describe("Gallery Backends", func() {
 			err = DeleteBackendFromSystem(systemState, "non-existent")
 			Expect(err).To(HaveOccurred())
 		})
+
+		It("removes an orphaned meta backend whose concrete is missing", func() {
+			// Real scenario from the dev cluster: the concrete got wiped
+			// (partial install, manual cleanup, previous crash) but the meta
+			// directory + metadata.json still points at it. The old code
+			// errored with "meta backend X not found" and left the orphan in
+			// place, making the backend impossible to uninstall.
+			metaName := "meta-backend"
+			concreteName := "concrete-backend-that-vanished"
+			metaPath := filepath.Join(tempDir, metaName)
+			Expect(os.MkdirAll(metaPath, 0750)).To(Succeed())
+
+			meta := BackendMetadata{Name: metaName, MetaBackendFor: concreteName}
+			data, err := json.MarshalIndent(meta, "", "  ")
+			Expect(err).NotTo(HaveOccurred())
+			Expect(os.WriteFile(filepath.Join(metaPath, "metadata.json"), data, 0644)).To(Succeed())
+
+			// Concrete directory intentionally absent.
+			systemState, err := system.GetSystemState(system.WithBackendPath(tempDir))
+			Expect(err).NotTo(HaveOccurred())
+
+			Expect(DeleteBackendFromSystem(systemState, metaName)).To(Succeed())
+			Expect(metaPath).NotTo(BeADirectory())
+		})
+	})
+
+	Describe("InstallBackendFromGallery — orphaned meta reinstall", func() {
+		It("re-runs install when the meta's concrete is missing", func() {
+			// Seed state: meta dir exists with metadata pointing at a
+			// concrete that was removed from disk. ListSystemBackends still
+			// surfaces the meta via its metadata.Name → the old short-circuit
+			// at `if backends.Exists(name) { return nil }` returned silently,
+			// leaving the worker's findBackend() with a dead alias forever.
+			// The fix: require the backend to be runnable before we skip.
+			metaName := "meta-orphan"
+			concreteName := "concrete-gone"
+			metaPath := filepath.Join(tempDir, metaName)
+			Expect(os.MkdirAll(metaPath, 0750)).To(Succeed())
+			meta := BackendMetadata{Name: metaName, MetaBackendFor: concreteName}
+			data, err := json.MarshalIndent(meta, "", "  ")
+			Expect(err).NotTo(HaveOccurred())
+			Expect(os.WriteFile(filepath.Join(metaPath, "metadata.json"), data, 0644)).To(Succeed())
+
+			systemState, err := system.GetSystemState(system.WithBackendPath(tempDir))
+			Expect(err).NotTo(HaveOccurred())
+
+			listed, err := ListSystemBackends(systemState)
+			Expect(err).NotTo(HaveOccurred())
+			b, ok := listed.Get(metaName)
+			Expect(ok).To(BeTrue())
+			Expect(isBackendRunnable(b)).To(BeFalse()) // concrete run.sh absent
+		})
 	})

 	Describe("ListSystemBackends", func() {
--- a/core/gallery/upgrade.go
+++ b/core/gallery/upgrade.go
@@ -23,22 +23,45 @@ type UpgradeInfo struct {
 	AvailableVersion string `json:"available_version"`
 	InstalledDigest  string `json:"installed_digest,omitempty"`
 	AvailableDigest  string `json:"available_digest,omitempty"`
+	// NodeDrift lists nodes whose installed version or digest differs from
+	// the cluster majority. Non-empty means the cluster has diverged and an
+	// upgrade will realign it. Empty in single-node mode.
+	NodeDrift []NodeDriftInfo `json:"node_drift,omitempty"`
 }

-// CheckBackendUpgrades compares installed backends against gallery entries
-// and returns a map of backend names to UpgradeInfo for those that have
-// newer versions or different OCI digests available.
+// NodeDriftInfo describes one node that disagrees with the cluster majority
+// on which version/digest of a backend is installed.
+type NodeDriftInfo struct {
+	NodeID   string `json:"node_id"`
+	NodeName string `json:"node_name"`
+	Version  string `json:"version,omitempty"`
+	Digest   string `json:"digest,omitempty"`
+}
+
+// CheckBackendUpgrades is the single-node entrypoint. Distributed callers use
+// CheckUpgradesAgainst directly with their aggregated SystemBackends.
 func CheckBackendUpgrades(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState) (map[string]UpgradeInfo, error) {
+	installed, err := ListSystemBackends(systemState)
+	if err != nil {
+		return nil, fmt.Errorf("failed to list installed backends: %w", err)
+	}
+	return CheckUpgradesAgainst(ctx, galleries, systemState, installed)
+}
+
+// CheckUpgradesAgainst compares a caller-supplied SystemBackends set against
+// the gallery. Fixes the distributed-mode bug where the old code passed the
+// frontend's (empty) local filesystem through ListSystemBackends and so never
+// surfaced any upgrades.
+//
+// Cluster drift policy: if a backend's per-node versions/digests disagree, the
+// row is flagged upgradeable regardless of whether any node matches the gallery
+// — next Upgrade All realigns the cluster. NodeDrift lists the outliers.
+func CheckUpgradesAgainst(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, installedBackends SystemBackends) (map[string]UpgradeInfo, error) {
 	galleryBackends, err := AvailableBackends(galleries, systemState)
 	if err != nil {
 		return nil, fmt.Errorf("failed to list available backends: %w", err)
 	}

-	installedBackends, err := ListSystemBackends(systemState)
-	if err != nil {
-		return nil, fmt.Errorf("failed to list installed backends: %w", err)
-	}
-
 	result := make(map[string]UpgradeInfo)

 	for _, installed := range installedBackends {
@@ -57,34 +80,48 @@ func CheckBackendUpgrades(ctx context.Context, galleries []config.Gallery, syste
 		}

 		installedVersion := installed.Metadata.Version
+		installedDigest := installed.Metadata.Digest
 		galleryVersion := galleryEntry.Version

-		// If both sides have versions, compare them
+		// Detect cluster drift: does every node report the same version+digest?
+		// In single-node mode this stays empty (Nodes is nil).
+		majority, drift := summarizeNodeDrift(installed.Nodes)
+		if majority.version != "" {
+			installedVersion = majority.version
+		}
+		if majority.digest != "" {
+			installedDigest = majority.digest
+		}
+
+		makeInfo := func(info UpgradeInfo) UpgradeInfo {
+			info.NodeDrift = drift
+			return info
+		}
+
+		// If versions are available on both sides, they're the source of truth.
 		if galleryVersion != "" && installedVersion != "" {
-			if galleryVersion != installedVersion {
-				result[installed.Metadata.Name] = UpgradeInfo{
+			if galleryVersion != installedVersion || len(drift) > 0 {
+				result[installed.Metadata.Name] = makeInfo(UpgradeInfo{
 					BackendName:      installed.Metadata.Name,
 					InstalledVersion: installedVersion,
 					AvailableVersion: galleryVersion,
-				}
+				})
 			}
-			// Versions match — no upgrade needed
 			continue
 		}

-		// Gallery has a version but installed doesn't — this happens for backends
-		// installed before version tracking was added. Flag as upgradeable so
-		// users can re-install to pick up version metadata.
+		// Gallery has a version but installed doesn't — backends installed before
+		// version tracking was added. Flag as upgradeable to pick up metadata.
 		if galleryVersion != "" && installedVersion == "" {
-			result[installed.Metadata.Name] = UpgradeInfo{
+			result[installed.Metadata.Name] = makeInfo(UpgradeInfo{
 				BackendName:      installed.Metadata.Name,
 				InstalledVersion: "",
 				AvailableVersion: galleryVersion,
-			}
+			})
 			continue
 		}

-		// Fall back to OCI digest comparison when versions are unavailable
+		// Fall back to OCI digest comparison when versions are unavailable.
 		if downloader.URI(galleryEntry.URI).LooksLikeOCI() {
 			remoteDigest, err := oci.GetImageDigest(galleryEntry.URI, "", nil, nil)
 			if err != nil {
@@ -92,21 +129,68 @@ func CheckBackendUpgrades(ctx context.Context, galleries []config.Gallery, syste
 				continue
 			}
 			// If we have a stored digest, compare; otherwise any remote digest
-			// means we can't confirm we're up to date — flag as upgradeable
-			if installed.Metadata.Digest == "" || remoteDigest != installed.Metadata.Digest {
-				result[installed.Metadata.Name] = UpgradeInfo{
+			// means we can't confirm we're up to date — flag as upgradeable.
+			if installedDigest == "" || remoteDigest != installedDigest || len(drift) > 0 {
+				result[installed.Metadata.Name] = makeInfo(UpgradeInfo{
 					BackendName:     installed.Metadata.Name,
-					InstalledDigest: installed.Metadata.Digest,
+					InstalledDigest: installedDigest,
 					AvailableDigest: remoteDigest,
-				}
+				})
 			}
+		} else if len(drift) > 0 {
+			// No version/digest path but nodes disagree — still worth flagging.
+			result[installed.Metadata.Name] = makeInfo(UpgradeInfo{
+				BackendName:      installed.Metadata.Name,
+				InstalledVersion: installedVersion,
+				InstalledDigest:  installedDigest,
+			})
 		}
-		// No version info and non-OCI URI — cannot determine, skip
 	}

 	return result, nil
 }

+// summarizeNodeDrift collapses per-node version/digest tuples to a majority
+// pair and returns the outliers. In single-node mode (empty nodes slice) this
+// returns zero values and a nil drift list.
+func summarizeNodeDrift(nodes []NodeBackendRef) (majority struct{ version, digest string }, drift []NodeDriftInfo) {
+	if len(nodes) == 0 {
+		return majority, nil
+	}
+
+	type key struct{ version, digest string }
+	counts := map[key]int{}
+	var topKey key
+	var topCount int
+	for _, n := range nodes {
+		k := key{n.Version, n.Digest}
+		counts[k]++
+		if counts[k] > topCount {
+			topCount = counts[k]
+			topKey = k
+		}
+	}
+
+	majority.version = topKey.version
+	majority.digest = topKey.digest
+
+	if len(counts) == 1 {
+		return majority, nil // unanimous — no drift
+	}
+	for _, n := range nodes {
+		if n.Version == majority.version && n.Digest == majority.digest {
+			continue
+		}
+		drift = append(drift, NodeDriftInfo{
+			NodeID:   n.NodeID,
+			NodeName: n.NodeName,
+			Version:  n.Version,
+			Digest:   n.Digest,
+		})
+	}
+	return majority, drift
+}
+
 // UpgradeBackend upgrades a single backend to the latest gallery version using
 // an atomic swap with backup-based rollback on failure.
 func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, galleries []config.Gallery, backendName string, downloadStatus func(string, string, string, float64)) error {
--- a/core/gallery/upgrade_test.go
+++ b/core/gallery/upgrade_test.go
@@ -144,6 +144,97 @@ var _ = Describe("Upgrade Detection and Execution", func() {
 		})
 	})

+	// CheckUpgradesAgainst is the entry point used by DistributedBackendManager.
+	// It takes installed backends directly — typically aggregated from workers —
+	// instead of reading the frontend filesystem. These tests exercise drift
+	// detection, which is the feature the distributed path relies on.
+	Describe("CheckUpgradesAgainst (distributed)", func() {
+		It("flags upgrade when cluster nodes disagree on version, even if gallery matches majority", func() {
+			writeGalleryYAML([]GalleryBackend{
+				{
+					Metadata: Metadata{Name: "my-backend"},
+					URI:      filepath.Join(tempDir, "some-source"),
+					Version:  "2.0.0",
+				},
+			})
+
+			installed := SystemBackends{
+				"my-backend": SystemBackend{
+					Name:     "my-backend",
+					Metadata: &BackendMetadata{Name: "my-backend", Version: "2.0.0"},
+					Nodes: []NodeBackendRef{
+						{NodeID: "a", NodeName: "worker-1", Version: "2.0.0"},
+						{NodeID: "b", NodeName: "worker-2", Version: "2.0.0"},
+						{NodeID: "c", NodeName: "worker-3", Version: "1.0.0"}, // drift
+					},
+				},
+			}
+
+			upgrades, err := CheckUpgradesAgainst(context.Background(), galleries, systemState, installed)
+			Expect(err).NotTo(HaveOccurred())
+			Expect(upgrades).To(HaveKey("my-backend"))
+			info := upgrades["my-backend"]
+			Expect(info.AvailableVersion).To(Equal("2.0.0"))
+			Expect(info.NodeDrift).To(HaveLen(1))
+			Expect(info.NodeDrift[0].NodeName).To(Equal("worker-3"))
+			Expect(info.NodeDrift[0].Version).To(Equal("1.0.0"))
+		})
+
+		It("does not flag upgrade when all nodes agree and match gallery", func() {
+			writeGalleryYAML([]GalleryBackend{
+				{
+					Metadata: Metadata{Name: "my-backend"},
+					URI:      filepath.Join(tempDir, "some-source"),
+					Version:  "2.0.0",
+				},
+			})
+
+			installed := SystemBackends{
+				"my-backend": SystemBackend{
+					Name:     "my-backend",
+					Metadata: &BackendMetadata{Name: "my-backend", Version: "2.0.0"},
+					Nodes: []NodeBackendRef{
+						{NodeID: "a", NodeName: "worker-1", Version: "2.0.0"},
+						{NodeID: "b", NodeName: "worker-2", Version: "2.0.0"},
+					},
+				},
+			}
+
+			upgrades, err := CheckUpgradesAgainst(context.Background(), galleries, systemState, installed)
+			Expect(err).NotTo(HaveOccurred())
+			Expect(upgrades).To(BeEmpty())
+		})
+
+		It("surfaces empty-installed-version path the old distributed code silently missed", func() {
+			// Simulates the real-world bug: worker has a backend, its version
+			// is empty (pre-tracking or OCI-pinned-to-latest), gallery has a
+			// version. Pre-fix CheckUpgrades returned nothing; now it surfaces.
+			writeGalleryYAML([]GalleryBackend{
+				{
+					Metadata: Metadata{Name: "my-backend"},
+					URI:      filepath.Join(tempDir, "some-source"),
+					Version:  "2.0.0",
+				},
+			})
+
+			installed := SystemBackends{
+				"my-backend": SystemBackend{
+					Name:     "my-backend",
+					Metadata: &BackendMetadata{Name: "my-backend"},
+					Nodes: []NodeBackendRef{
+						{NodeID: "a", NodeName: "worker-1"},
+					},
+				},
+			}
+
+			upgrades, err := CheckUpgradesAgainst(context.Background(), galleries, systemState, installed)
+			Expect(err).NotTo(HaveOccurred())
+			Expect(upgrades).To(HaveKey("my-backend"))
+			Expect(upgrades["my-backend"].InstalledVersion).To(BeEmpty())
+			Expect(upgrades["my-backend"].AvailableVersion).To(Equal("2.0.0"))
+		})
+	})
+
 	Describe("UpgradeBackend", func() {
 		It("should replace backend directory and update metadata", func() {
 			// Install v1
--- a/core/http/endpoints/localai/backend_monitor.go
+++ b/core/http/endpoints/localai/backend_monitor.go
@@ -9,19 +9,26 @@ import (
 // BackendMonitorEndpoint returns the status of the specified backend
 // @Summary Backend monitor endpoint
 // @Tags monitoring
-// @Param request body schema.BackendMonitorRequest true "Backend statistics request"
+// @Param model query string true "Name of the model to monitor"
 // @Success 200 {object} proto.StatusResponse "Response"
 // @Router /backend/monitor [get]
 func BackendMonitorEndpoint(bm *monitoring.BackendMonitorService) echo.HandlerFunc {
 	return func(c echo.Context) error {
-
-		input := new(schema.BackendMonitorRequest)
-		// Get input data from the request body
-		if err := c.Bind(input); err != nil {
-			return err
+		model := c.QueryParam("model")
+		// Fall back to binding the request body so pre-existing clients that
+		// sent `{"model": "..."}` with GET keep working.
+		if model == "" {
+			input := new(schema.BackendMonitorRequest)
+			if err := c.Bind(input); err != nil {
+				return err
+			}
+			model = input.Model
+		}
+		if model == "" {
+			return echo.NewHTTPError(400, "model query parameter is required")
 		}

-		resp, err := bm.CheckAndSample(input.Model)
+		resp, err := bm.CheckAndSample(model)
 		if err != nil {
 			return err
 		}
--- a/core/http/endpoints/localai/nodes.go
+++ b/core/http/endpoints/localai/nodes.go
@@ -376,7 +376,7 @@ func InstallBackendOnNodeEndpoint(unloader nodes.NodeCommandSender) echo.Handler
 		if err := c.Bind(&req); err != nil || req.Backend == "" {
 			return c.JSON(http.StatusBadRequest, nodeError(http.StatusBadRequest, "backend name required"))
 		}
-		reply, err := unloader.InstallBackend(nodeID, req.Backend, "", req.BackendGalleries)
+		reply, err := unloader.InstallBackend(nodeID, req.Backend, "", req.BackendGalleries, "", "", "")
 		if err != nil {
 			xlog.Error("Failed to install backend on node", "node", nodeID, "backend", req.Backend, "error", err)
 			return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "failed to install backend on node"))
--- a/core/http/endpoints/localai/settings.go
+++ b/core/http/endpoints/localai/settings.go
@@ -110,6 +110,27 @@ func UpdateSettingsEndpoint(app *application.Application) echo.HandlerFunc {
 			})
 		}

+		// The UI reads ApiKeys from GET /api/settings, which already returns the
+		// merged env+runtime list. When the user clicks Save, the same merged
+		// list comes back in the POST body. Strip the env-supplied keys from
+		// the incoming list before we persist or re-merge, otherwise each save
+		// duplicates the env keys on top of the previous merge (#9071).
+		if settings.ApiKeys != nil {
+			envKeys := startupConfig.ApiKeys
+			envSet := make(map[string]struct{}, len(envKeys))
+			for _, k := range envKeys {
+				envSet[k] = struct{}{}
+			}
+			runtimeOnly := make([]string, 0, len(*settings.ApiKeys))
+			for _, k := range *settings.ApiKeys {
+				if _, fromEnv := envSet[k]; fromEnv {
+					continue
+				}
+				runtimeOnly = append(runtimeOnly, k)
+			}
+			settings.ApiKeys = &runtimeOnly
+		}
+
 		settingsFile := filepath.Join(appConfig.DynamicConfigsDir, "runtime_settings.json")
 		settingsJSON, err := json.MarshalIndent(settings, "", "  ")
 		if err != nil {
--- a/core/http/endpoints/openai/chat.go
+++ b/core/http/endpoints/openai/chat.go
@@ -147,6 +147,7 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 		result := ""
 		lastEmittedCount := 0
 		sentInitialRole := false
+		sentReasoning := false
 		hasChatDeltaToolCalls := false
 		hasChatDeltaContent := false

@@ -190,6 +191,7 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 					}},
 					Object: "chat.completion.chunk",
 				}
+				sentReasoning = true
 			}

 			// Stream content deltas (cleaned of reasoning tags) while no tool calls
@@ -363,7 +365,12 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 			functionResults = functions.ParseFunctionCall(cleanedResult, config.FunctionsConfig)
 		}
 		xlog.Debug("[ChatDeltas] final tool call decision", "tool_calls", len(functionResults), "text_content", *textContentToReturn)
-		noActionToRun := len(functionResults) > 0 && functionResults[0].Name == noAction || len(functionResults) == 0
+		// noAction is a sentinel "just answer" pseudo-function — not a real
+		// tool call. Scan the whole slice rather than only index 0 so we
+		// don't drop a real tool call that happens to follow a noAction
+		// entry, and so the default branch isn't entered with only noAction
+		// entries to emit as tool_calls.
+		noActionToRun := !hasRealCall(functionResults, noAction)

 		switch {
 		case noActionToRun:
@@ -377,108 +384,31 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 				usage.TimingPromptProcessing = tokenUsage.TimingPromptProcessing
 			}

-			if sentInitialRole {
-				// Content was already streamed during the callback — just emit usage.
-				delta := &schema.Message{}
-				if reasoning != "" && extractor.Reasoning() == "" {
-					delta.Reasoning = &reasoning
-				}
-				responses <- schema.OpenAIResponse{
-					ID: id, Created: created, Model: req.Model,
-					Choices: []schema.Choice{{Delta: delta, Index: 0}},
-					Object:  "chat.completion.chunk",
-					Usage:   usage,
-				}
-			} else {
-				// Content was NOT streamed — send everything at once (fallback).
-				responses <- schema.OpenAIResponse{
-					ID: id, Created: created, Model: req.Model,
-					Choices: []schema.Choice{{Delta: &schema.Message{Role: "assistant"}, Index: 0}},
-					Object:  "chat.completion.chunk",
-				}
-
-				result, err := handleQuestion(config, functionResults, extractor.CleanedContent(), prompt)
-				if err != nil {
-					xlog.Error("error handling question", "error", err)
-					return err
-				}
-
-				delta := &schema.Message{Content: &result}
-				if reasoning != "" {
-					delta.Reasoning = &reasoning
-				}
-				responses <- schema.OpenAIResponse{
-					ID: id, Created: created, Model: req.Model,
-					Choices: []schema.Choice{{Delta: delta, Index: 0}},
-					Object:  "chat.completion.chunk",
-					Usage:   usage,
+			var result string
+			if !sentInitialRole {
+				var hqErr error
+				result, hqErr = handleQuestion(config, functionResults, extractor.CleanedContent(), prompt)
+				if hqErr != nil {
+					xlog.Error("error handling question", "error", hqErr)
+					return hqErr
 				}
 			}
+			for _, chunk := range buildNoActionFinalChunks(
+				id, req.Model, created,
+				sentInitialRole, sentReasoning,
+				result, reasoning, usage,
+			) {
+				responses <- chunk
+			}

 		default:
-			for i, ss := range functionResults {
-				name, args := ss.Name, ss.Arguments
-				toolCallID := ss.ID
-				if toolCallID == "" {
-					toolCallID = id
-				}
-
-				if i < lastEmittedCount {
-					// Already emitted during streaming by the incremental
-					// JSON/XML parser — skip to avoid duplicate tool calls.
-					continue
-				}
-
-				// Tool call not yet emitted — send name + args (two chunks).
-				initialMessage := schema.OpenAIResponse{
-					ID:      id,
-					Created: created,
-					Model:   req.Model,
-					Choices: []schema.Choice{{
-						Delta: &schema.Message{
-							Role: "assistant",
-							ToolCalls: []schema.ToolCall{
-								{
-									Index: i,
-									ID:    toolCallID,
-									Type:  "function",
-									FunctionCall: schema.FunctionCall{
-										Name: name,
-									},
-								},
-							},
-						},
-						Index:        0,
-						FinishReason: nil,
-					}},
-					Object: "chat.completion.chunk",
-				}
-				responses <- initialMessage
-
-				responses <- schema.OpenAIResponse{
-					ID:      id,
-					Created: created,
-					Model:   req.Model,
-					Choices: []schema.Choice{{
-						Delta: &schema.Message{
-							Role:    "assistant",
-							Content: textContentToReturn,
-							ToolCalls: []schema.ToolCall{
-								{
-									Index: i,
-									ID:    toolCallID,
-									Type:  "function",
-									FunctionCall: schema.FunctionCall{
-										Arguments: args,
-									},
-								},
-							},
-						},
-						Index:        0,
-						FinishReason: nil,
-					}},
-					Object: "chat.completion.chunk",
-				}
+			for _, chunk := range buildDeferredToolCallChunks(
+				id, req.Model, created,
+				functionResults, lastEmittedCount,
+				sentInitialRole, *textContentToReturn,
+				sentReasoning, reasoning,
+			) {
+				responses <- chunk
 			}
 		}

--- a/core/http/endpoints/openai/chat_emit.go
+++ b/core/http/endpoints/openai/chat_emit.go
@@ -0,0 +1,233 @@
+package openai
+
+import (
+	"fmt"
+
+	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/pkg/functions"
+)
+
+// hasRealCall reports whether functionResults contains at least one
+// entry whose Name is something other than the noAction sentinel.
+// Used by processTools to decide between the "answer the question"
+// path and the real tool-call flush.
+func hasRealCall(functionResults []functions.FuncCallResults, noAction string) bool {
+	for _, fc := range functionResults {
+		if fc.Name != noAction {
+			return true
+		}
+	}
+	return false
+}
+
+// buildNoActionFinalChunks produces the closing SSE chunks for the
+// noActionToRun branch of processTools (i.e. the model chose the "answer"
+// pseudo-function or emitted no tool calls at all).
+//
+// When content was already streamed (contentAlreadyStreamed=true) the
+// helper emits a single trailing usage chunk, optionally carrying
+// reasoning that was produced but not streamed incrementally. When
+// content was not streamed it emits a role chunk followed by a
+// content+reasoning+usage chunk — the "send everything at once" fallback.
+//
+// Reasoning re-emission is guarded by reasoningAlreadyStreamed, not by
+// probing the extractor's Go-side state: the C++ autoparser delivers
+// reasoning through ProcessChatDeltaReasoning which populates a
+// separate accumulator that extractor.Reasoning() does not expose.
+// Without this guard the callback would stream reasoning incrementally
+// and the final chunk would duplicate it.
+func buildNoActionFinalChunks(
+	id, model string,
+	created int,
+	contentAlreadyStreamed bool,
+	reasoningAlreadyStreamed bool,
+	content string,
+	reasoning string,
+	usage schema.OpenAIUsage,
+) []schema.OpenAIResponse {
+	var out []schema.OpenAIResponse
+
+	if contentAlreadyStreamed {
+		delta := &schema.Message{}
+		if reasoning != "" && !reasoningAlreadyStreamed {
+			r := reasoning
+			delta.Reasoning = &r
+		}
+		out = append(out, schema.OpenAIResponse{
+			ID: id, Created: created, Model: model,
+			Choices: []schema.Choice{{Delta: delta, Index: 0}},
+			Object:  "chat.completion.chunk",
+			Usage:   usage,
+		})
+		return out
+	}
+
+	// Content was not streamed — send role, then content (+reasoning) + usage.
+	out = append(out, schema.OpenAIResponse{
+		ID: id, Created: created, Model: model,
+		Choices: []schema.Choice{{
+			Delta: &schema.Message{Role: "assistant"},
+			Index: 0,
+		}},
+		Object: "chat.completion.chunk",
+	})
+
+	c := content
+	delta := &schema.Message{Content: &c}
+	if reasoning != "" && !reasoningAlreadyStreamed {
+		r := reasoning
+		delta.Reasoning = &r
+	}
+	out = append(out, schema.OpenAIResponse{
+		ID: id, Created: created, Model: model,
+		Choices: []schema.Choice{{Delta: delta, Index: 0}},
+		Object:  "chat.completion.chunk",
+		Usage:   usage,
+	})
+	return out
+}
+
+// buildDeferredToolCallChunks produces the SSE chunks for tool calls that
+// were discovered only during final parsing (i.e. after the streaming
+// callback finished). The caller forwards every returned chunk to the
+// responses channel.
+//
+// Guarantees:
+//   - tool calls with i < lastEmittedCount are skipped (already streamed)
+//   - each emitted call yields two chunks: name-only, then args-only
+//   - no chunk ever carries both non-empty Content and non-empty ToolCalls
+//   - no chunk ever carries both non-empty Reasoning and non-empty ToolCalls
+//   - if !reasoningAlreadyStreamed && reasoningContent != "",
+//     a reasoning chunk is emitted first
+//   - if !contentAlreadyStreamed && textContent != "",
+//     a role chunk followed by a content chunk is emitted (after reasoning)
+//   - chunks order: [reasoning?] [role+content?] (name, args)+
+//   - fallback IDs for empty ss.ID are unique per index so a client can
+//     match tool_result messages back to the right call
+func buildDeferredToolCallChunks(
+	id, model string,
+	created int,
+	functionResults []functions.FuncCallResults,
+	lastEmittedCount int,
+	contentAlreadyStreamed bool,
+	textContent string,
+	reasoningAlreadyStreamed bool,
+	reasoningContent string,
+) []schema.OpenAIResponse {
+	// If every call was already emitted incrementally there's nothing to
+	// flush — and no reason to emit a standalone reasoning/content chunk.
+	hasDeferred := false
+	for i := range functionResults {
+		if i >= lastEmittedCount {
+			hasDeferred = true
+			break
+		}
+	}
+	if !hasDeferred {
+		return nil
+	}
+
+	var out []schema.OpenAIResponse
+
+	// Reasoning first — the callback path at processTools emits reasoning
+	// incrementally in its own chunks, but when the C++ autoparser only
+	// surfaces reasoning as a final aggregate the callback never sees it.
+	// Recover it here (no duplication: contentAlreadyStreamed and
+	// reasoningAlreadyStreamed track what the callback already sent).
+	if !reasoningAlreadyStreamed && reasoningContent != "" {
+		r := reasoningContent
+		out = append(out, schema.OpenAIResponse{
+			ID: id, Created: created, Model: model,
+			Choices: []schema.Choice{{
+				Delta: &schema.Message{Reasoning: &r},
+				Index: 0,
+			}},
+			Object: "chat.completion.chunk",
+		})
+	}
+
+	// Then content, when it wasn't streamed via the callback. Emit role
+	// and content in separate deltas — the OpenAI streaming contract
+	// forbids bundling content alongside tool_calls in one delta.
+	if !contentAlreadyStreamed && textContent != "" {
+		out = append(out, schema.OpenAIResponse{
+			ID: id, Created: created, Model: model,
+			Choices: []schema.Choice{{
+				Delta: &schema.Message{Role: "assistant"},
+				Index: 0,
+			}},
+			Object: "chat.completion.chunk",
+		})
+		c := textContent
+		out = append(out, schema.OpenAIResponse{
+			ID: id, Created: created, Model: model,
+			Choices: []schema.Choice{{
+				Delta: &schema.Message{Content: &c},
+				Index: 0,
+			}},
+			Object: "chat.completion.chunk",
+		})
+	}
+
+	for i, ss := range functionResults {
+		if i < lastEmittedCount {
+			// Already streamed by the incremental JSON/XML parser during
+			// the token callback — skip to avoid a duplicate emission.
+			continue
+		}
+
+		toolCallID := ss.ID
+		if toolCallID == "" {
+			// Unique per-index fallback so multiple empty-ID calls don't
+			// collide on the same request ID (clients match tool results
+			// back by tool_call_id).
+			toolCallID = fmt.Sprintf("%s-%d", id, i)
+		}
+
+		// Name chunk.
+		out = append(out, schema.OpenAIResponse{
+			ID: id, Created: created, Model: model,
+			Choices: []schema.Choice{{
+				Delta: &schema.Message{
+					Role: "assistant",
+					ToolCalls: []schema.ToolCall{{
+						Index: i,
+						ID:    toolCallID,
+						Type:  "function",
+						FunctionCall: schema.FunctionCall{
+							Name: ss.Name,
+						},
+					}},
+				},
+				Index:        0,
+				FinishReason: nil,
+			}},
+			Object: "chat.completion.chunk",
+		})
+
+		// Args chunk — no Content here. Either it was streamed through
+		// the token callback earlier, or the role+content pair above
+		// already delivered it.
+		out = append(out, schema.OpenAIResponse{
+			ID: id, Created: created, Model: model,
+			Choices: []schema.Choice{{
+				Delta: &schema.Message{
+					Role: "assistant",
+					ToolCalls: []schema.ToolCall{{
+						Index: i,
+						ID:    toolCallID,
+						Type:  "function",
+						FunctionCall: schema.FunctionCall{
+							Arguments: ss.Arguments,
+						},
+					}},
+				},
+				Index:        0,
+				FinishReason: nil,
+			}},
+			Object: "chat.completion.chunk",
+		})
+	}
+
+	return out
+}
--- a/core/http/endpoints/openai/chat_emit_test.go
+++ b/core/http/endpoints/openai/chat_emit_test.go
@@ -0,0 +1,717 @@
+package openai
+
+import (
+	"fmt"
+
+	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/pkg/functions"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+// contentOf extracts the string payload from a chunk's delta.Content,
+// transparently handling both *string and string underlying types so
+// assertions don't have to care which one the helper produced.
+func contentOf(ch schema.OpenAIResponse) string {
+	if len(ch.Choices) == 0 || ch.Choices[0].Delta == nil {
+		return ""
+	}
+	switch v := ch.Choices[0].Delta.Content.(type) {
+	case *string:
+		if v == nil {
+			return ""
+		}
+		return *v
+	case string:
+		return v
+	default:
+		return ""
+	}
+}
+
+// reasoningOf mirrors contentOf for the delta.Reasoning field, which is a
+// *string on schema.Message.
+func reasoningOf(ch schema.OpenAIResponse) string {
+	if len(ch.Choices) == 0 || ch.Choices[0].Delta == nil {
+		return ""
+	}
+	r := ch.Choices[0].Delta.Reasoning
+	if r == nil {
+		return ""
+	}
+	return *r
+}
+
+// toolCallsOf returns the ToolCalls slice of a chunk's delta, or nil.
+func toolCallsOf(ch schema.OpenAIResponse) []schema.ToolCall {
+	if len(ch.Choices) == 0 || ch.Choices[0].Delta == nil {
+		return nil
+	}
+	return ch.Choices[0].Delta.ToolCalls
+}
+
+// expectSpecCompliant enforces the invariants on every chunk:
+//   - Object == "chat.completion.chunk"
+//   - Exactly one Choice with Index==0
+//   - No delta ever carries both non-empty Content and non-empty ToolCalls
+//   - No delta ever carries both non-empty Reasoning and non-empty ToolCalls
+func expectSpecCompliant(chunks []schema.OpenAIResponse) {
+	for i, ch := range chunks {
+		Expect(ch.Object).To(Equal("chat.completion.chunk"), "chunk[%d] Object", i)
+		Expect(ch.Choices).To(HaveLen(1), "chunk[%d] Choices length", i)
+		Expect(ch.Choices[0].Index).To(Equal(0), "chunk[%d] Choices[0].Index", i)
+
+		hasContent := contentOf(ch) != ""
+		hasReasoning := reasoningOf(ch) != ""
+		hasToolCalls := len(toolCallsOf(ch)) > 0
+
+		if hasContent && hasToolCalls {
+			Fail(fmt.Sprintf("chunk[%d] violates spec: Content and ToolCalls in same delta", i))
+		}
+		if hasReasoning && hasToolCalls {
+			Fail(fmt.Sprintf("chunk[%d] violates spec: Reasoning and ToolCalls in same delta", i))
+		}
+	}
+}
+
+// expectMetadata asserts every chunk carries the same id/model/created.
+func expectMetadata(chunks []schema.OpenAIResponse, id, model string, created int) {
+	for i, ch := range chunks {
+		Expect(ch.ID).To(Equal(id), "chunk[%d] ID", i)
+		Expect(ch.Model).To(Equal(model), "chunk[%d] Model", i)
+		Expect(ch.Created).To(Equal(created), "chunk[%d] Created", i)
+	}
+}
+
+var _ = Describe("buildDeferredToolCallChunks", func() {
+	const (
+		testID      = "req"
+		testModel   = "test-model"
+		testCreated = 1700000000
+	)
+
+	Describe("Case A — primary bug: content already streamed, 1 deferred call", func() {
+		It("emits only the tool_call chunks, no Content anywhere", func() {
+			results := []functions.FuncCallResults{
+				{Name: "search", Arguments: `{"q":"x"}`, ID: "tc1"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				true, "Let me search…",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(2), "two chunks: name, args")
+
+			// Name chunk
+			tc0 := toolCallsOf(chunks[0])
+			Expect(tc0).To(HaveLen(1))
+			Expect(tc0[0].Index).To(Equal(0))
+			Expect(tc0[0].ID).To(Equal("tc1"))
+			Expect(tc0[0].FunctionCall.Name).To(Equal("search"))
+			Expect(tc0[0].FunctionCall.Arguments).To(BeEmpty())
+			Expect(contentOf(chunks[0])).To(BeEmpty())
+
+			// Args chunk — MUST NOT carry Content
+			tc1 := toolCallsOf(chunks[1])
+			Expect(tc1).To(HaveLen(1))
+			Expect(tc1[0].FunctionCall.Name).To(BeEmpty())
+			Expect(tc1[0].FunctionCall.Arguments).To(Equal(`{"q":"x"}`))
+			Expect(contentOf(chunks[1])).To(BeEmpty(),
+				"args chunk must not duplicate already-streamed content")
+		})
+	})
+
+	Describe("Case B — autoparser / content not streamed", func() {
+		It("emits role, content, then name+args", func() {
+			results := []functions.FuncCallResults{
+				{Name: "do", Arguments: "{}", ID: "tc1"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				false, "Here is my plan…",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(4), "role, content, name, args")
+
+			// Role chunk
+			Expect(chunks[0].Choices[0].Delta.Role).To(Equal("assistant"))
+			Expect(contentOf(chunks[0])).To(BeEmpty())
+			Expect(toolCallsOf(chunks[0])).To(BeEmpty())
+
+			// Content chunk
+			Expect(contentOf(chunks[1])).To(Equal("Here is my plan…"))
+			Expect(toolCallsOf(chunks[1])).To(BeEmpty())
+
+			// Name + args chunks
+			Expect(toolCallsOf(chunks[2])).To(HaveLen(1))
+			Expect(toolCallsOf(chunks[2])[0].FunctionCall.Name).To(Equal("do"))
+			Expect(toolCallsOf(chunks[3])).To(HaveLen(1))
+			Expect(toolCallsOf(chunks[3])[0].FunctionCall.Arguments).To(Equal("{}"))
+		})
+	})
+
+	Describe("Case C — multiple deferred calls, content already streamed", func() {
+		It("emits (name, args) × 3 with no Content anywhere", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tcA"},
+				{Name: "b", Arguments: "{}", ID: "tcB"},
+				{Name: "c", Arguments: "{}", ID: "tcC"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				true, "some narration",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(6))
+
+			for i := 0; i < 3; i++ {
+				Expect(contentOf(chunks[2*i])).To(BeEmpty(),
+					"call #%d name chunk must not carry Content", i)
+				Expect(contentOf(chunks[2*i+1])).To(BeEmpty(),
+					"call #%d args chunk must not carry Content", i)
+				Expect(toolCallsOf(chunks[2*i])[0].Index).To(Equal(i))
+				Expect(toolCallsOf(chunks[2*i+1])[0].Index).To(Equal(i))
+			}
+			Expect(toolCallsOf(chunks[0])[0].FunctionCall.Name).To(Equal("a"))
+			Expect(toolCallsOf(chunks[2])[0].FunctionCall.Name).To(Equal("b"))
+			Expect(toolCallsOf(chunks[4])[0].FunctionCall.Name).To(Equal("c"))
+		})
+	})
+
+	Describe("Case D — partial incremental emission", func() {
+		It("emits only the deferred tail (call #1), skipping #0", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tc0"},
+				{Name: "b", Arguments: "{}", ID: "tc1"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 1,
+				true, "narration",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(2))
+			Expect(toolCallsOf(chunks[0])[0].Index).To(Equal(1))
+			Expect(toolCallsOf(chunks[0])[0].FunctionCall.Name).To(Equal("b"))
+			Expect(toolCallsOf(chunks[1])[0].Index).To(Equal(1))
+			Expect(toolCallsOf(chunks[1])[0].FunctionCall.Arguments).To(Equal("{}"))
+		})
+	})
+
+	Describe("Case E — all calls already emitted incrementally", func() {
+		It("emits nothing", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tc0"},
+				{Name: "b", Arguments: "{}", ID: "tc1"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 2,
+				true, "narration",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(BeEmpty())
+		})
+	})
+
+	Describe("Case F — content not streamed but textContent empty", func() {
+		It("emits only the tool call chunks, no leading role/content", func() {
+			results := []functions.FuncCallResults{
+				{Name: "x", Arguments: "{}", ID: "tcX"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				false, "",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(2))
+			Expect(toolCallsOf(chunks[0])[0].FunctionCall.Name).To(Equal("x"))
+			Expect(toolCallsOf(chunks[1])[0].FunctionCall.Arguments).To(Equal("{}"))
+		})
+	})
+
+	Describe("Case G — empty ss.ID falls back to a unique per-index ID", func() {
+		It("emits a deterministic per-index fallback", func() {
+			results := []functions.FuncCallResults{
+				{Name: "x", Arguments: "{}", ID: ""},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				true, "narration",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(2))
+			expectedID := fmt.Sprintf("%s-%d", testID, 0)
+			Expect(toolCallsOf(chunks[0])[0].ID).To(Equal(expectedID))
+			Expect(toolCallsOf(chunks[1])[0].ID).To(Equal(expectedID))
+		})
+	})
+
+	Describe("Case G2 — multiple empty IDs get distinct fallbacks", func() {
+		It("avoids the collision bug where every empty-ID call shared the request id", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: ""},
+				{Name: "b", Arguments: "{}", ID: ""},
+				{Name: "c", Arguments: "{}", ID: ""},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				true, "narration",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(6))
+
+			ids := map[string]int{}
+			for _, ch := range chunks {
+				for _, tc := range toolCallsOf(ch) {
+					ids[tc.ID]++
+				}
+			}
+			// Each call yields a name chunk + args chunk → each distinct ID
+			// should appear in exactly two chunks. Three distinct IDs
+			// overall.
+			Expect(ids).To(HaveLen(3), "three distinct per-index fallback IDs")
+			for id, n := range ids {
+				Expect(n).To(Equal(2), "ID %q should appear in exactly 2 chunks", id)
+			}
+		})
+	})
+
+	Describe("Case H — indices preserved across skip with multiple calls", func() {
+		It("emits Index fields matching functionResults positions", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tc0"},
+				{Name: "b", Arguments: "{}", ID: "tc1"},
+				{Name: "c", Arguments: "{}", ID: "tc2"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 1,
+				true, "narration",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(4))
+
+			Expect(toolCallsOf(chunks[0])[0].Index).To(Equal(1))
+			Expect(toolCallsOf(chunks[1])[0].Index).To(Equal(1))
+			Expect(toolCallsOf(chunks[2])[0].Index).To(Equal(2))
+			Expect(toolCallsOf(chunks[3])[0].Index).To(Equal(2))
+		})
+	})
+
+	Describe("Case I — explicit non-empty ID is preserved", func() {
+		It("does not touch ss.ID when it's already set", func() {
+			results := []functions.FuncCallResults{
+				{Name: "x", Arguments: "{}", ID: "abc123"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				true, "narration",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(2))
+			Expect(toolCallsOf(chunks[0])[0].ID).To(Equal("abc123"))
+			Expect(toolCallsOf(chunks[1])[0].ID).To(Equal("abc123"))
+		})
+	})
+
+	Describe("Case J — chunk-shape sanity", func() {
+		It("splits Name into the first chunk and Arguments into the second", func() {
+			results := []functions.FuncCallResults{
+				{Name: "x", Arguments: `{"k":"v"}`, ID: "tcX"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				true, "narration",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(2))
+
+			Expect(toolCallsOf(chunks[0])[0].FunctionCall.Name).To(Equal("x"))
+			Expect(toolCallsOf(chunks[0])[0].FunctionCall.Arguments).To(BeEmpty())
+
+			Expect(toolCallsOf(chunks[1])[0].FunctionCall.Name).To(BeEmpty())
+			Expect(toolCallsOf(chunks[1])[0].FunctionCall.Arguments).To(Equal(`{"k":"v"}`))
+		})
+	})
+
+	Describe("Case K — metadata propagation", func() {
+		It("stamps every chunk with the same id/model/created", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tcA"},
+				{Name: "b", Arguments: "{}", ID: "tcB"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				false, "hello",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			expectMetadata(chunks, testID, testModel, testCreated)
+		})
+	})
+
+	Describe("Case L — Choices[0].Index == 0 invariant", func() {
+		It("is upheld across every branch the helper can take", func() {
+			scenarios := []struct {
+				name                  string
+				functionResults       []functions.FuncCallResults
+				lastEmittedCount      int
+				contentStreamed       bool
+				text                  string
+				reasoningStreamed     bool
+				reasoning             string
+			}{
+				{"streamed-content-deferred-call",
+					[]functions.FuncCallResults{{Name: "a", Arguments: "{}"}},
+					0, true, "hi", true, ""},
+				{"unstreamed-content-deferred-call",
+					[]functions.FuncCallResults{{Name: "a", Arguments: "{}"}},
+					0, false, "hello", true, ""},
+				{"unstreamed-reasoning-and-content",
+					[]functions.FuncCallResults{{Name: "a", Arguments: "{}"}},
+					0, false, "hello", false, "thinking…"},
+				{"partial-incremental",
+					[]functions.FuncCallResults{
+						{Name: "a", Arguments: "{}"},
+						{Name: "b", Arguments: "{}"}},
+					1, true, "hi", true, ""},
+			}
+			for _, sc := range scenarios {
+				chunks := buildDeferredToolCallChunks(
+					testID, testModel, testCreated,
+					sc.functionResults, sc.lastEmittedCount,
+					sc.contentStreamed, sc.text,
+					sc.reasoningStreamed, sc.reasoning,
+				)
+				for i, ch := range chunks {
+					Expect(ch.Choices[0].Index).To(Equal(0),
+						"scenario %q chunk[%d] Choices[0].Index", sc.name, i)
+				}
+			}
+		})
+	})
+
+	Describe("Case M — spec compliance across every scenario", func() {
+		It("never mixes Content or Reasoning with ToolCalls in a single delta", func() {
+			scenarios := []struct {
+				name                  string
+				functionResults       []functions.FuncCallResults
+				lastEmittedCount      int
+				contentStreamed       bool
+				text                  string
+				reasoningStreamed     bool
+				reasoning             string
+			}{
+				{"A", []functions.FuncCallResults{{Name: "a", Arguments: "{}", ID: "tc"}},
+					0, true, "already-streamed", true, ""},
+				{"C", []functions.FuncCallResults{
+					{Name: "a", Arguments: "{}", ID: "tc0"},
+					{Name: "b", Arguments: "{}", ID: "tc1"}},
+					0, true, "already-streamed", true, ""},
+				{"B", []functions.FuncCallResults{{Name: "a", Arguments: "{}", ID: "tc"}},
+					0, false, "plan", true, ""},
+				{"Reasoning-deferred", []functions.FuncCallResults{{Name: "a", Arguments: "{}", ID: "tc"}},
+					0, false, "plan", false, "thinking…"},
+			}
+			for _, sc := range scenarios {
+				chunks := buildDeferredToolCallChunks(
+					testID, testModel, testCreated,
+					sc.functionResults, sc.lastEmittedCount,
+					sc.contentStreamed, sc.text,
+					sc.reasoningStreamed, sc.reasoning,
+				)
+				for i, ch := range chunks {
+					hasContent := contentOf(ch) != ""
+					hasReasoning := reasoningOf(ch) != ""
+					hasToolCalls := len(toolCallsOf(ch)) > 0
+					Expect(hasContent && hasToolCalls).To(BeFalse(),
+						"scenario %q chunk[%d] mixes Content with ToolCalls", sc.name, i)
+					Expect(hasReasoning && hasToolCalls).To(BeFalse(),
+						"scenario %q chunk[%d] mixes Reasoning with ToolCalls", sc.name, i)
+				}
+			}
+		})
+	})
+
+	Describe("Case N — empty functionResults", func() {
+		It("emits nothing, including no leading role/content/reasoning", func() {
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				nil, 0,
+				false, "ignored",
+				false, "ignored",
+			)
+			Expect(chunks).To(BeEmpty())
+		})
+	})
+
+	Describe("Case O — content not streamed but all calls already emitted", func() {
+		It("emits nothing, not even a standalone content chunk", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tc0"},
+				{Name: "b", Arguments: "{}", ID: "tc1"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 2,
+				false, "narration",
+				false, "thinking…",
+			)
+			Expect(chunks).To(BeEmpty(),
+				"no tool_calls to trigger on, so no leading role/content/reasoning either")
+		})
+	})
+
+	Describe("Reasoning — autoparser delivered reasoning only at end", func() {
+		It("emits a leading reasoning chunk when !reasoningAlreadyStreamed", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tc"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				true, "streamed content",
+				false, "model's private thoughts",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(3), "reasoning, name, args")
+
+			Expect(reasoningOf(chunks[0])).To(Equal("model's private thoughts"))
+			Expect(contentOf(chunks[0])).To(BeEmpty())
+			Expect(toolCallsOf(chunks[0])).To(BeEmpty())
+
+			// The following two are the tool_call name + args chunks.
+			Expect(toolCallsOf(chunks[1])[0].FunctionCall.Name).To(Equal("a"))
+			Expect(toolCallsOf(chunks[2])[0].FunctionCall.Arguments).To(Equal("{}"))
+		})
+
+		It("emits reasoning before role+content when neither was streamed", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tc"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				false, "final plan",
+				false, "private thoughts",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(5), "reasoning, role, content, name, args")
+
+			Expect(reasoningOf(chunks[0])).To(Equal("private thoughts"))
+			Expect(chunks[1].Choices[0].Delta.Role).To(Equal("assistant"))
+			Expect(contentOf(chunks[2])).To(Equal("final plan"))
+			Expect(toolCallsOf(chunks[3])[0].FunctionCall.Name).To(Equal("a"))
+			Expect(toolCallsOf(chunks[4])[0].FunctionCall.Arguments).To(Equal("{}"))
+		})
+
+		It("does not re-emit reasoning that was already streamed", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tc"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				true, "streamed",
+				true, "already-sent reasoning",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(2), "only name + args; no reasoning re-emission")
+			for _, ch := range chunks {
+				Expect(reasoningOf(ch)).To(BeEmpty())
+			}
+		})
+	})
+})
+
+var _ = Describe("hasRealCall", func() {
+	const noAction = "answer"
+
+	It("returns false for nil and empty slices", func() {
+		Expect(hasRealCall(nil, noAction)).To(BeFalse())
+		Expect(hasRealCall([]functions.FuncCallResults{}, noAction)).To(BeFalse())
+	})
+
+	It("returns false when every entry is the noAction sentinel", func() {
+		results := []functions.FuncCallResults{
+			{Name: noAction, Arguments: `{"message":"hi"}`},
+			{Name: noAction, Arguments: `{"message":"hello"}`},
+		}
+		Expect(hasRealCall(results, noAction)).To(BeFalse())
+	})
+
+	It("returns true when only one entry is a real call", func() {
+		results := []functions.FuncCallResults{
+			{Name: "search", Arguments: "{}"},
+		}
+		Expect(hasRealCall(results, noAction)).To(BeTrue())
+	})
+
+	It("returns true when a real call follows a noAction entry", func() {
+		// This is the regression the follow-up fixes: the old
+		// functionResults[0].Name == noAction check would declare this
+		// noActionToRun and drop the real call entirely.
+		results := []functions.FuncCallResults{
+			{Name: noAction, Arguments: `{"message":"hi"}`},
+			{Name: "search", Arguments: "{}"},
+		}
+		Expect(hasRealCall(results, noAction)).To(BeTrue())
+	})
+
+	It("returns true when a real call precedes a noAction entry", func() {
+		results := []functions.FuncCallResults{
+			{Name: "search", Arguments: "{}"},
+			{Name: noAction, Arguments: `{"message":"hi"}`},
+		}
+		Expect(hasRealCall(results, noAction)).To(BeTrue())
+	})
+})
+
+var _ = Describe("buildNoActionFinalChunks", func() {
+	const (
+		testID      = "req"
+		testModel   = "test-model"
+		testCreated = 1700000000
+	)
+	usage := schema.OpenAIUsage{PromptTokens: 5, CompletionTokens: 7, TotalTokens: 12}
+
+	Describe("Content streamed — trailing usage chunk", func() {
+		It("emits just one chunk with usage, no content, no reasoning when reasoning was streamed", func() {
+			chunks := buildNoActionFinalChunks(
+				testID, testModel, testCreated,
+				true, true,
+				"", "already-streamed-reasoning", usage,
+			)
+
+			Expect(chunks).To(HaveLen(1))
+			Expect(chunks[0].Usage.TotalTokens).To(Equal(12))
+			Expect(contentOf(chunks[0])).To(BeEmpty())
+			Expect(reasoningOf(chunks[0])).To(BeEmpty(),
+				"reasoning must not be re-emitted once it was streamed via the callback")
+		})
+
+		It("emits a trailing reasoning delivery when reasoning came only at end", func() {
+			chunks := buildNoActionFinalChunks(
+				testID, testModel, testCreated,
+				true, false,
+				"", "autoparser final reasoning", usage,
+			)
+
+			Expect(chunks).To(HaveLen(1))
+			Expect(reasoningOf(chunks[0])).To(Equal("autoparser final reasoning"))
+			Expect(contentOf(chunks[0])).To(BeEmpty())
+			Expect(chunks[0].Usage.TotalTokens).To(Equal(12))
+		})
+
+		It("omits reasoning when it's empty regardless of streamed flag", func() {
+			chunks := buildNoActionFinalChunks(
+				testID, testModel, testCreated,
+				true, false,
+				"", "", usage,
+			)
+
+			Expect(chunks).To(HaveLen(1))
+			Expect(reasoningOf(chunks[0])).To(BeEmpty())
+		})
+	})
+
+	Describe("Content not streamed — role, then content+usage", func() {
+		It("emits role chunk then content chunk without reasoning when reasoning was streamed", func() {
+			chunks := buildNoActionFinalChunks(
+				testID, testModel, testCreated,
+				false, true,
+				"the answer", "already-streamed-reasoning", usage,
+			)
+
+			Expect(chunks).To(HaveLen(2))
+			Expect(chunks[0].Choices[0].Delta.Role).To(Equal("assistant"))
+			Expect(contentOf(chunks[0])).To(BeEmpty())
+
+			Expect(contentOf(chunks[1])).To(Equal("the answer"))
+			Expect(reasoningOf(chunks[1])).To(BeEmpty(),
+				"reasoning must not be re-emitted if it was streamed earlier")
+			Expect(chunks[1].Usage.TotalTokens).To(Equal(12))
+		})
+
+		It("emits role, then content+reasoning when reasoning was not streamed", func() {
+			chunks := buildNoActionFinalChunks(
+				testID, testModel, testCreated,
+				false, false,
+				"the answer", "autoparser final reasoning", usage,
+			)
+
+			Expect(chunks).To(HaveLen(2))
+			Expect(chunks[0].Choices[0].Delta.Role).To(Equal("assistant"))
+
+			Expect(contentOf(chunks[1])).To(Equal("the answer"))
+			Expect(reasoningOf(chunks[1])).To(Equal("autoparser final reasoning"))
+			Expect(chunks[1].Usage.TotalTokens).To(Equal(12))
+		})
+
+		It("still emits content even when reasoning is empty", func() {
+			chunks := buildNoActionFinalChunks(
+				testID, testModel, testCreated,
+				false, false,
+				"just an answer", "", usage,
+			)
+
+			Expect(chunks).To(HaveLen(2))
+			Expect(contentOf(chunks[1])).To(Equal("just an answer"))
+			Expect(reasoningOf(chunks[1])).To(BeEmpty())
+		})
+	})
+
+	Describe("Metadata and shape invariants", func() {
+		It("stamps every chunk with the same id/model/created and object", func() {
+			chunks := buildNoActionFinalChunks(
+				testID, testModel, testCreated,
+				false, false,
+				"hi", "reasoning", usage,
+			)
+			for i, ch := range chunks {
+				Expect(ch.ID).To(Equal(testID), "chunk[%d] ID", i)
+				Expect(ch.Model).To(Equal(testModel), "chunk[%d] Model", i)
+				Expect(ch.Created).To(Equal(testCreated), "chunk[%d] Created", i)
+				Expect(ch.Object).To(Equal("chat.completion.chunk"), "chunk[%d] Object", i)
+				Expect(ch.Choices).To(HaveLen(1))
+				Expect(ch.Choices[0].Index).To(Equal(0))
+			}
+		})
+	})
+})
--- a/core/http/middleware/trace.go
+++ b/core/http/middleware/trace.go
@@ -3,6 +3,7 @@ package middleware
 import (
 	"bytes"
 	"io"
+	"mime"
 	"net/http"
 	"slices"
 	"sync"
@@ -94,7 +95,8 @@ func TraceMiddleware(app *application.Application) echo.MiddlewareFunc {

 			initializeTracing(app.ApplicationConfig().TracingMaxItems)

-			if c.Request().Header.Get("Content-Type") != "application/json" {
+			ct, _, _ := mime.ParseMediaType(c.Request().Header.Get("Content-Type"))
+			if ct != "application/json" {
 				return next(c)
 			}

--- a/core/http/react-ui/src/App.css
+++ b/core/http/react-ui/src/App.css
@@ -1529,6 +1529,401 @@ select.input {
  background: var(--color-warning-light);
  color: var(--color-warning);
 }
+.badge-accent {
+  background: var(--color-accent-light);
+  color: var(--color-accent);
+}
+
+/* Horizontal row of badges used inside table cells — consistent spacing so
+   cells line up regardless of how many badges are present. */
+.badge-row {
+  display: inline-flex;
+  flex-wrap: wrap;
+  gap: 4px;
+  align-items: center;
+}
+
+/* Vertically stacked cell content (e.g. version + update chip + drift chip).
+   Keeps rows readable at scale without inline style={{...}} everywhere. */
+.cell-stack {
+  display: flex;
+  flex-direction: column;
+  gap: 4px;
+  align-items: flex-start;
+}
+
+.cell-mono {
+  font-family: 'JetBrains Mono', ui-monospace, monospace;
+  font-size: var(--text-xs);
+  color: var(--color-text-primary);
+}
+
+.cell-muted {
+  color: var(--color-text-muted);
+  font-size: var(--text-xs);
+}
+
+.cell-subtle {
+  color: var(--color-text-muted);
+  font-size: var(--text-xs);
+  font-weight: 400;
+  margin-left: 8px;
+}
+
+.cell-name {
+  display: inline-flex;
+  align-items: center;
+  gap: var(--spacing-xs);
+  font-weight: 500;
+}
+.cell-name > i {
+  color: var(--color-accent);
+  font-size: var(--text-xs);
+}
+
+.row-actions {
+  display: flex;
+  gap: var(--spacing-xs);
+  justify-content: flex-end;
+  align-items: center;
+}
+
+/* Softer delete button for dense tables — the destructive confirm dialog
+   already owns the "are you sure" affordance, so the button itself doesn't
+   need to scream. Keeps the delete red readable without dominating rows. */
+.btn.btn-danger-ghost {
+  background: transparent;
+  color: var(--color-error);
+  border-color: transparent;
+}
+.btn.btn-danger-ghost:hover:not(:disabled) {
+  background: var(--color-error-light);
+  color: var(--color-error);
+  border-color: var(--color-error-light);
+}
+
+/* Small count pill used inside tabs ("(3) ↑ 2") so update counts are
+   glanceable without extra rows of UI. */
+.tab-pill {
+  display: inline-flex;
+  align-items: center;
+  gap: 3px;
+  margin-left: 6px;
+  padding: 1px 6px;
+  border-radius: var(--radius-full);
+  font-size: var(--text-xs);
+  font-weight: 600;
+  line-height: 1.4;
+}
+.tab-pill--warning {
+  background: var(--color-warning-light);
+  color: var(--color-warning);
+}
+
+/* Stat cards — uniform-height cluster metrics for the Nodes dashboard.
+   Left accent bar ties the color to the metric's semantic (success/warning/
+   error/primary), icon chip sits top-right, value is left-aligned and
+   prominent so you can scan a row of cards without reading labels. */
+.stat-grid {
+  display: grid;
+  grid-template-columns: repeat(auto-fill, minmax(180px, 1fr));
+  gap: var(--spacing-md);
+  margin-bottom: var(--spacing-xl);
+}
+
+.stat-card {
+  position: relative;
+  display: flex;
+  align-items: center;
+  justify-content: space-between;
+  gap: var(--spacing-sm);
+  padding: var(--spacing-md);
+  min-height: 96px;
+  background: var(--color-bg-raised, var(--color-bg-secondary));
+  border: 1px solid var(--color-border-subtle);
+  border-radius: var(--radius-lg);
+  transition: transform var(--duration-fast) var(--ease-default),
+              box-shadow var(--duration-fast) var(--ease-default),
+              border-color var(--duration-fast) var(--ease-default);
+  overflow: hidden;
+}
+.stat-card::before {
+  content: '';
+  position: absolute;
+  left: 0; top: 0; bottom: 0;
+  width: 3px;
+  background: var(--stat-accent, var(--color-border-subtle));
+  transition: background var(--duration-fast) var(--ease-default);
+}
+.stat-card:hover {
+  transform: translateY(-1px);
+  box-shadow: var(--shadow-sm);
+  border-color: var(--color-border);
+}
+
+.stat-card__body {
+  display: flex;
+  flex-direction: column;
+  gap: 6px;
+  min-width: 0;
+}
+.stat-card__label {
+  font-size: var(--text-xs);
+  font-weight: 600;
+  letter-spacing: 0.08em;
+  text-transform: uppercase;
+  color: var(--color-text-muted);
+  white-space: normal;
+  line-height: 1.2;
+}
+.stat-card__value {
+  font-size: var(--text-2xl);
+  font-weight: 600;
+  font-family: 'JetBrains Mono', ui-monospace, monospace;
+  line-height: 1;
+  color: var(--color-text-primary);
+  word-break: break-word;
+}
+.stat-card__icon {
+  display: inline-flex;
+  align-items: center;
+  justify-content: center;
+  width: 36px;
+  height: 36px;
+  border-radius: var(--radius-md);
+  background: color-mix(in srgb, var(--stat-accent, var(--color-text-muted)) 12%, transparent);
+  color: var(--stat-accent, var(--color-text-muted));
+  font-size: var(--text-lg);
+  flex-shrink: 0;
+}
+
+/* Subtle "Register a new worker" trigger replacing the broken-text chevron
+   link. Still opens the same hint card — just reads like a button now. */
+.nodes-add-worker {
+  display: inline-flex;
+  align-items: center;
+  gap: var(--spacing-xs);
+  padding: var(--spacing-xs) var(--spacing-sm);
+  background: transparent;
+  border: 1px dashed var(--color-border);
+  border-radius: var(--radius-md);
+  color: var(--color-text-secondary);
+  font-size: var(--text-sm);
+  font-family: inherit;
+  font-weight: 500;
+  cursor: pointer;
+  margin-bottom: var(--spacing-md);
+  transition: background var(--duration-fast) var(--ease-default),
+              border-color var(--duration-fast) var(--ease-default),
+              color var(--duration-fast) var(--ease-default);
+}
+.nodes-add-worker:hover {
+  background: var(--color-bg-raised, var(--color-bg-secondary));
+  border-color: var(--color-border-strong);
+  color: var(--color-text-primary);
+}
+
+/* Shared FilterBar layout — search strip + chip row + toggle strip. Lives
+   outside the .filter-bar chip row so the padding and wrapping behavior is
+   consistent between the Backends gallery and the System tabs. */
+.filter-bar-group {
+  display: flex;
+  flex-direction: column;
+  gap: var(--spacing-sm);
+  margin-bottom: var(--spacing-md);
+}
+.filter-bar-group__search {
+  min-width: 200px;
+  flex: 1;
+}
+.filter-bar-group__row {
+  display: flex;
+  gap: var(--spacing-md);
+  align-items: center;
+  flex-wrap: wrap;
+}
+.filter-bar-group__right {
+  display: flex;
+  gap: var(--spacing-md);
+  align-items: center;
+  flex-wrap: wrap;
+  padding-left: var(--spacing-md);
+  border-left: 1px solid var(--color-border-subtle);
+}
+.filter-bar-group__toggle {
+  display: flex;
+  align-items: center;
+  gap: var(--spacing-xs);
+  font-size: var(--text-xs);
+  color: var(--color-text-secondary);
+  cursor: pointer;
+  user-select: none;
+  white-space: nowrap;
+}
+.filter-btn__count {
+  display: inline-flex;
+  align-items: center;
+  justify-content: center;
+  margin-left: 6px;
+  min-width: 18px;
+  padding: 0 5px;
+  background: color-mix(in srgb, currentColor 18%, transparent);
+  border-radius: var(--radius-full);
+  font-size: 0.625rem;
+  font-weight: 600;
+}
+
+/* Popover — floating surface anchored to a trigger element. Uses the .card
+   base so theming is free, adds z-index + fixed-position + scroll cap so it
+   behaves on tables with many rows. Kept deliberately unstyled beyond that
+   — content is expected to provide its own header/body structure. */
+.popover {
+  position: fixed;
+  z-index: 200;
+  min-width: 260px;
+  max-width: min(420px, 95vw);
+  max-height: min(420px, 70vh);
+  display: flex;
+  flex-direction: column;
+  padding: 0; /* sections provide their own padding */
+  overflow: hidden;
+  box-shadow: var(--shadow-lg);
+  animation: popoverIn var(--duration-fast) var(--ease-default);
+}
+
+@keyframes popoverIn {
+  from { opacity: 0; transform: translateY(-4px) scale(0.98); }
+  to   { opacity: 1; transform: translateY(0) scale(1); }
+}
+
+.popover__header {
+  display: flex;
+  align-items: center;
+  gap: var(--spacing-sm);
+  padding: var(--spacing-sm) var(--spacing-md);
+  border-bottom: 1px solid var(--color-border-subtle);
+  font-size: var(--text-sm);
+}
+
+.popover__scroll {
+  overflow: auto;
+  padding: 0;
+}
+
+.popover__table {
+  margin: 0;
+  width: 100%;
+}
+.popover__table th {
+  position: sticky;
+  top: 0;
+  background: var(--color-bg-raised, var(--color-bg-secondary));
+  z-index: 1;
+}
+
+/* Inline-table chip trigger — looks like a badge but is a button (cursor,
+   focus ring inherited from global :focus-visible). */
+.chip-trigger {
+  border: none;
+  cursor: pointer;
+  font-family: inherit;
+}
+.chip-trigger:hover {
+  filter: brightness(1.08);
+}
+
+/* Truncate + ellipsize a long cell (e.g. OCI digest) without breaking the
+   table layout. Tooltip preserves the full value. */
+.cell-truncate {
+  max-width: 160px;
+  overflow: hidden;
+  text-overflow: ellipsis;
+  white-space: nowrap;
+}
+
+/* Compact empty-state used inside expanded drawer sections (e.g. "No
+   models loaded on this node"). Dimmer than the page-level .empty-state
+   because it lives inside another container and shouldn't compete with
+   the row's primary content. */
+.drawer-empty {
+  display: flex;
+  align-items: center;
+  gap: var(--spacing-sm);
+  padding: var(--spacing-sm) var(--spacing-md);
+  background: var(--color-bg-tertiary);
+  border: 1px dashed var(--color-border-subtle);
+  border-radius: var(--radius-md);
+  color: var(--color-text-muted);
+  font-size: var(--text-sm);
+}
+.drawer-empty > i {
+  font-size: var(--text-sm);
+  color: var(--color-text-muted);
+  opacity: 0.8;
+}
+
+/* Node-status indicator — replaces the tiny bullet with a proper LED-style
+   dot next to a bold status label. Colors are applied inline from statusConfig
+   so one primitive handles healthy/unhealthy/draining/pending in one shape. */
+.node-status {
+  display: inline-flex;
+  align-items: center;
+  gap: 8px;
+  font-size: var(--text-sm);
+  font-weight: 600;
+}
+.node-status__dot {
+  width: 8px;
+  height: 8px;
+  border-radius: 50%;
+  box-shadow: 0 0 0 3px color-mix(in srgb, currentColor 15%, transparent);
+  flex-shrink: 0;
+}
+
+/* Row-chevron cell — small 20px toggle used in table rows that expand.
+   The row itself is still clickable; the chevron provides the visible
+   affordance users were missing. */
+.row-chevron {
+  display: inline-flex;
+  align-items: center;
+  justify-content: center;
+  width: 20px;
+  height: 20px;
+  font-size: var(--text-xs);
+  color: var(--color-text-muted);
+  transition: transform var(--duration-fast) var(--ease-default);
+}
+.row-chevron.is-expanded {
+  transform: rotate(90deg);
+  color: var(--color-text-primary);
+}
+
+/* Upgrade banner — the yellow strip operators see when updates are available.
+   Mirrors the gallery so both pages speak the same visual language. */
+.upgrade-banner {
+  display: flex;
+  align-items: center;
+  justify-content: space-between;
+  gap: var(--spacing-md);
+  padding: var(--spacing-sm) var(--spacing-md);
+  margin-bottom: var(--spacing-md);
+  background: var(--color-warning-light);
+  border: 1px solid var(--color-warning);
+  border-radius: var(--radius-md);
+  color: var(--color-warning);
+}
+.upgrade-banner__text {
+  display: inline-flex;
+  align-items: center;
+  gap: var(--spacing-sm);
+  font-weight: 500;
+  font-size: var(--text-sm);
+}
+.upgrade-banner__actions {
+  display: inline-flex;
+  gap: var(--spacing-xs);
+  align-items: center;
+}

 /* Tabs */
 .tabs {
--- a/core/http/react-ui/src/components/FilterBar.jsx
+++ b/core/http/react-ui/src/components/FilterBar.jsx
@@ -0,0 +1,87 @@
+import Toggle from './Toggle'
+
+// FilterBar is the shared search + chip filter + toggles control strip that
+// the Backends gallery pioneered. Pulled into its own component so the System
+// page's two tabs stop looking like a different app — matching visual
+// grammar + matching keyboard behavior.
+//
+// Props:
+//   search:            controlled value for the search input.
+//   onSearchChange:    (value) => void; null disables the search input entirely.
+//   searchPlaceholder: placeholder for the search input.
+//   filters:           [{ key, label, icon }]; activeFilter is compared by key.
+//                      Omit to hide the chip row.
+//   activeFilter:      currently-selected filter key (use '' for "all" if
+//                      that's the first entry in `filters`).
+//   onFilterChange:    (key) => void.
+//   toggles:           [{ key, label, icon?, checked, onChange }]; optional
+//                      right-side toggle group (e.g. "Show all", "Development").
+//   rightSlot:         arbitrary element rendered after the toggles — use for
+//                      sort controls or extra buttons.
+export default function FilterBar({
+  search,
+  onSearchChange,
+  searchPlaceholder = 'Search...',
+  filters,
+  activeFilter,
+  onFilterChange,
+  toggles,
+  rightSlot,
+}) {
+  const hasFilters = Array.isArray(filters) && filters.length > 0
+  const hasToggles = Array.isArray(toggles) && toggles.length > 0
+
+  return (
+    <div className="filter-bar-group">
+      {onSearchChange && (
+        <div className="search-bar filter-bar-group__search">
+          <i className="fas fa-search search-icon" />
+          <input
+            className="input"
+            placeholder={searchPlaceholder}
+            value={search ?? ''}
+            onChange={e => onSearchChange(e.target.value)}
+            aria-label={searchPlaceholder}
+          />
+        </div>
+      )}
+
+      {(hasFilters || hasToggles || rightSlot) && (
+        <div className="filter-bar-group__row">
+          {hasFilters && (
+            <div className="filter-bar" role="tablist" aria-label="Filter">
+              {filters.map(f => (
+                <button
+                  key={f.key}
+                  role="tab"
+                  aria-selected={activeFilter === f.key}
+                  className={`filter-btn ${activeFilter === f.key ? 'active' : ''}`}
+                  onClick={() => onFilterChange(f.key)}
+                >
+                  {f.icon && <i className={`fas ${f.icon}`} style={{ marginRight: 4 }} />}
+                  {f.label}
+                  {typeof f.count === 'number' && (
+                    <span className="filter-btn__count">{f.count}</span>
+                  )}
+                </button>
+              ))}
+            </div>
+          )}
+
+          {(hasToggles || rightSlot) && (
+            <div className="filter-bar-group__right">
+              {hasToggles && toggles.map(t => (
+                <label key={t.key} className="filter-bar-group__toggle">
+                  <Toggle checked={t.checked} onChange={t.onChange} />
+                  {t.icon && <i className={`fas ${t.icon}`} />}
+                  {t.label}
+                </label>
+              ))}
+              {rightSlot}
+            </div>
+          )}
+        </div>
+      )}
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/NodeDistributionChip.jsx
+++ b/core/http/react-ui/src/components/NodeDistributionChip.jsx
@@ -0,0 +1,168 @@
+import { useRef, useState } from 'react'
+import Popover from './Popover'
+
+// NodeDistributionChip shows where something is installed/loaded across a
+// cluster. Used by both Manage → Backends (per-row Nodes column, data =
+// gallery NodeBackendRef with version/digest) and by the Models tab (data =
+// LoadedOn with state/status). Supports arbitrary cluster size — small
+// clusters render node-name chips inline, larger clusters collapse to a
+// summary chip and reveal the full per-node table in a popover on click.
+//
+// Field names are intentionally forgiving: both {node_name, node_status} and
+// {NodeName, NodeStatus} are supported so the component works whether it's
+// reading directly off the JSON or off a hydrated class.
+//
+// Props:
+//   nodes:             array of node refs (see shape below).
+//   compactThreshold:  max nodes to render inline before collapsing (default 3).
+//   context:           'backends' (default) shows version/digest; 'models'
+//                      shows state.
+//   emptyLabel:        what to render when nodes is empty (default "—").
+export default function NodeDistributionChip({
+  nodes,
+  compactThreshold = 3,
+  context = 'backends',
+  emptyLabel = '—',
+}) {
+  const triggerRef = useRef(null)
+  const [open, setOpen] = useState(false)
+
+  const list = Array.isArray(nodes) ? nodes : []
+  if (list.length === 0) {
+    return <span className="cell-muted">{emptyLabel}</span>
+  }
+
+  const getName = n => n.node_name ?? n.NodeName ?? ''
+  const getStatus = n => n.node_status ?? n.NodeStatus ?? ''
+  const getState = n => n.state ?? n.State ?? ''
+  const getVersion = n => n.version ?? n.Version ?? ''
+  const getDigest = n => n.digest ?? n.Digest ?? ''
+
+  // Inline mode: render every node as its own chip. Good for small clusters
+  // where seeing the names directly is more useful than a summary.
+  if (list.length <= compactThreshold) {
+    return (
+      <div className="badge-row">
+        {list.map(n => {
+          const status = getStatus(n)
+          const variant = status === 'healthy' ? 'badge-success'
+            : status === 'draining' ? 'badge-info'
+            : 'badge-warning'
+          const title = context === 'models'
+            ? `${getName(n)} — ${getState(n)} (${status})`
+            : `${getName(n)} — ${status}${getVersion(n) ? ` · v${getVersion(n)}` : ''}`
+          return (
+            <span key={n.node_id ?? n.NodeID ?? getName(n)} className={`badge ${variant}`} title={title}>
+              <i className="fas fa-server" /> {getName(n)}
+            </span>
+          )
+        })}
+      </div>
+    )
+  }
+
+  // Summary mode for anything bigger. Count unhealthy/offline explicitly so
+  // the chip tells an operator at-a-glance whether to click in. "Drift" for
+  // backends = more than one (version, digest) tuple across healthy nodes.
+  const total = list.length
+  const offline = list.filter(n => {
+    const s = getStatus(n)
+    return s !== 'healthy' && s !== 'draining'
+  }).length
+  const drift = context === 'backends' ? countDrift(list) : 0
+  const severity = offline > 0 || drift > 0 ? 'badge-warning' : 'badge-info'
+
+  return (
+    <>
+      <button
+        ref={triggerRef}
+        type="button"
+        className={`badge ${severity} chip-trigger`}
+        aria-expanded={open}
+        aria-haspopup="dialog"
+        onClick={e => { e.stopPropagation(); setOpen(v => !v) }}
+      >
+        <i className="fas fa-server" />
+        {' '}on {total} node{total === 1 ? '' : 's'}
+        {offline > 0 ? ` · ${offline} offline` : ''}
+        {drift > 0 ? ` · ${drift} drift` : ''}
+      </button>
+      <Popover
+        anchor={triggerRef}
+        open={open}
+        onClose={() => setOpen(false)}
+        ariaLabel={context === 'models' ? 'Model distribution' : 'Backend distribution'}
+      >
+        <div className="popover__header">
+          <strong>Installed on {total} node{total === 1 ? '' : 's'}</strong>
+          {offline > 0 && <span className="badge badge-warning">{offline} offline</span>}
+          {drift > 0 && <span className="badge badge-warning">{drift} drift</span>}
+        </div>
+        <div className="popover__scroll">
+          <table className="table popover__table">
+            <thead>
+              <tr>
+                <th>Node</th>
+                <th>Status</th>
+                {context === 'models' ? <th>State</th> : <>
+                  <th>Version</th>
+                  <th>Digest</th>
+                </>}
+              </tr>
+            </thead>
+            <tbody>
+              {list.map(n => (
+                <tr key={n.node_id ?? n.NodeID ?? getName(n)}>
+                  <td className="cell-mono">{getName(n)}</td>
+                  <td>
+                    <span className={`badge ${getStatus(n) === 'healthy' ? 'badge-success' : 'badge-warning'}`}>
+                      {getStatus(n)}
+                    </span>
+                  </td>
+                  {context === 'models' ? (
+                    <td className="cell-mono">{getState(n) || '—'}</td>
+                  ) : (
+                    <>
+                      <td className="cell-mono">{getVersion(n) ? `v${getVersion(n)}` : '—'}</td>
+                      <td className="cell-mono cell-truncate" title={getDigest(n)}>
+                        {getDigest(n) ? shortenDigest(getDigest(n)) : '—'}
+                      </td>
+                    </>
+                  )}
+                </tr>
+              ))}
+            </tbody>
+          </table>
+        </div>
+      </Popover>
+    </>
+  )
+}
+
+// countDrift counts nodes whose (version, digest) disagrees with the cluster
+// majority. Mirrors the backend summarizeNodeDrift logic so the UI number
+// matches what CheckUpgradesAgainst emits in UpgradeInfo.NodeDrift.
+function countDrift(nodes) {
+  if (nodes.length <= 1) return 0
+  const counts = new Map()
+  for (const n of nodes) {
+    const key = `${n.version ?? n.Version ?? ''}|${n.digest ?? n.Digest ?? ''}`
+    counts.set(key, (counts.get(key) || 0) + 1)
+  }
+  if (counts.size === 1) return 0 // unanimous
+  let topKey = ''
+  let topCount = 0
+  for (const [k, v] of counts.entries()) {
+    if (v > topCount) { topKey = k; topCount = v }
+  }
+  return nodes.length - topCount
+}
+
+// shortenDigest trims a full OCI digest to the common 12-char form used in
+// docker/oci tooling. Falls back to the raw value if it doesn't match.
+function shortenDigest(digest) {
+  const m = /^(sha\d+:)?([a-f0-9]+)$/i.exec(digest)
+  if (!m) return digest
+  const hex = m[2]
+  return (m[1] ?? '') + hex.slice(0, 12)
+}
--- a/core/http/react-ui/src/components/Popover.jsx
+++ b/core/http/react-ui/src/components/Popover.jsx
@@ -0,0 +1,86 @@
+import { useEffect, useRef, useState, useCallback } from 'react'
+
+// Minimal popover: positions itself below-right of the trigger's bounding box,
+// flips above when there isn't room below, closes on outside click or Escape,
+// returns focus to the trigger. Uses the existing .card surface so it picks
+// up theme/border/shadow automatically — no new theming work.
+//
+// Props:
+//   anchor:    ref to the trigger DOMElement (required)
+//   open:      boolean
+//   onClose:   () => void
+//   children:  popover body
+//   ariaLabel: accessible label for the dialog
+export default function Popover({ anchor, open, onClose, children, ariaLabel }) {
+  const popoverRef = useRef(null)
+  const [pos, setPos] = useState({ top: 0, left: 0, flipped: false })
+
+  // Compute position from the anchor's bounding box whenever we open or the
+  // viewport changes. 240px is the minimum width we'll reserve; bigger content
+  // grows naturally.
+  const reposition = useCallback(() => {
+    if (!anchor?.current) return
+    const rect = anchor.current.getBoundingClientRect()
+    const popoverHeight = popoverRef.current?.offsetHeight ?? 0
+    const spaceBelow = window.innerHeight - rect.bottom
+    const flipped = popoverHeight > spaceBelow - 16 && rect.top > popoverHeight
+    const top = flipped ? rect.top - popoverHeight - 8 : rect.bottom + 8
+    // Prefer left-aligned; clamp so we don't go off-screen right.
+    const left = Math.min(rect.left, window.innerWidth - 320)
+    setPos({ top, left: Math.max(8, left), flipped })
+  }, [anchor])
+
+  useEffect(() => {
+    if (!open) return
+    reposition()
+    window.addEventListener('resize', reposition)
+    window.addEventListener('scroll', reposition, true)
+    return () => {
+      window.removeEventListener('resize', reposition)
+      window.removeEventListener('scroll', reposition, true)
+    }
+  }, [open, reposition])
+
+  // Close on outside click or Escape. Mousedown (not click) so the close
+  // happens before a parent handler could re-trigger us.
+  useEffect(() => {
+    if (!open) return
+    const onMouseDown = (e) => {
+      if (popoverRef.current && !popoverRef.current.contains(e.target) && !anchor?.current?.contains(e.target)) {
+        onClose()
+      }
+    }
+    const onKey = (e) => { if (e.key === 'Escape') onClose() }
+    document.addEventListener('mousedown', onMouseDown)
+    document.addEventListener('keydown', onKey)
+    return () => {
+      document.removeEventListener('mousedown', onMouseDown)
+      document.removeEventListener('keydown', onKey)
+    }
+  }, [open, onClose, anchor])
+
+  // Return focus to the trigger when the popover closes — keyboard users
+  // shouldn't have to tab back through the whole page to find their spot.
+  useEffect(() => {
+    if (!open && anchor?.current) {
+      // requestAnimationFrame so the close is painted before focus jumps;
+      // otherwise screen readers announce the trigger mid-transition.
+      const raf = requestAnimationFrame(() => anchor.current?.focus?.())
+      return () => cancelAnimationFrame(raf)
+    }
+  }, [open, anchor])
+
+  if (!open) return null
+
+  return (
+    <div
+      ref={popoverRef}
+      role="dialog"
+      aria-label={ariaLabel}
+      className="popover card"
+      style={{ top: pos.top, left: pos.left }}
+    >
+      {children}
+    </div>
+  )
+}
--- a/core/http/react-ui/src/pages/Manage.jsx
+++ b/core/http/react-ui/src/pages/Manage.jsx
@@ -3,6 +3,8 @@ import { useNavigate, useOutletContext, useSearchParams } from 'react-router-dom
 import ResourceMonitor from '../components/ResourceMonitor'
 import ConfirmDialog from '../components/ConfirmDialog'
 import Toggle from '../components/Toggle'
+import NodeDistributionChip from '../components/NodeDistributionChip'
+import FilterBar from '../components/FilterBar'
 import { useModels } from '../hooks/useModels'
 import { backendControlApi, modelsApi, backendsApi, systemApi, nodesApi } from '../utils/api'

@@ -11,6 +13,22 @@ const TABS = [
  { key: 'backends', label: 'Backends', icon: 'fa-server' },
 ]

+// formatInstalledAt renders an installed_at timestamp as a short relative/abs
+// string suitable for dense tables. Returns the raw value if parsing fails so
+// we never display "Invalid Date".
+function formatInstalledAt(value) {
+  if (!value) return '—'
+  const d = new Date(value)
+  if (isNaN(d.getTime())) return value
+  const now = Date.now()
+  const diffMin = Math.floor((now - d.getTime()) / 60000)
+  if (diffMin < 1) return 'just now'
+  if (diffMin < 60) return `${diffMin}m ago`
+  if (diffMin < 60 * 24) return `${Math.floor(diffMin / 60)}h ago`
+  if (diffMin < 60 * 24 * 30) return `${Math.floor(diffMin / (60 * 24))}d ago`
+  return d.toISOString().slice(0, 10)
+}
+
 export default function Manage() {
  const { addToast } = useOutletContext()
  const navigate = useNavigate()
@@ -28,6 +46,24 @@ export default function Manage() {
  const [distributedMode, setDistributedMode] = useState(false)
  const [togglingModels, setTogglingModels] = useState(new Set())
  const [pinningModels, setPinningModels] = useState(new Set())
+  // Filter state per tab. Persisted in the URL query so switching tabs
+  // doesn't lose the filter the operator just set.
+  const [modelsSearch, setModelsSearch] = useState(() => searchParams.get('mq') || '')
+  const [modelsFilter, setModelsFilter] = useState(() => searchParams.get('mf') || 'all')
+  const [backendsSearch, setBackendsSearch] = useState(() => searchParams.get('bq') || '')
+  const [backendsFilter, setBackendsFilter] = useState(() => searchParams.get('bf') || 'all')
+
+  // Sync filter state into the URL so deep-links + tab switches survive.
+  useEffect(() => {
+    const p = new URLSearchParams(searchParams)
+    const setOrDelete = (k, v) => { if (v && v !== 'all') p.set(k, v); else p.delete(k) }
+    setOrDelete('mq', modelsSearch)
+    setOrDelete('mf', modelsFilter)
+    setOrDelete('bq', backendsSearch)
+    setOrDelete('bf', backendsFilter)
+    setSearchParams(p, { replace: true })
+    // eslint-disable-next-line react-hooks/exhaustive-deps
+  }, [modelsSearch, modelsFilter, backendsSearch, backendsFilter])

  const handleTabChange = (tab) => {
    setActiveTab(tab)
@@ -64,6 +100,35 @@ export default function Manage() {
    nodesApi.list().then(() => setDistributedMode(true)).catch(() => {})
  }, [fetchLoadedModels, fetchBackends])

+  // Auto-refresh the Models tab every 10s in distributed mode so ghost models
+  // (loaded on a worker but absent from this frontend's in-memory cache)
+  // clear on their own without the user clicking Update.
+  const [lastSyncedAt, setLastSyncedAt] = useState(() => Date.now())
+  const [nowTick, setNowTick] = useState(() => Date.now())
+  useEffect(() => {
+    if (!distributedMode || activeTab !== 'models') return
+    const interval = setInterval(() => {
+      refetchModels()
+      fetchLoadedModels()
+      setLastSyncedAt(Date.now())
+    }, 10000)
+    return () => clearInterval(interval)
+  }, [distributedMode, activeTab, refetchModels, fetchLoadedModels])
+
+  // Drive the "last synced Ns ago" label without over-rendering the table.
+  useEffect(() => {
+    if (!distributedMode) return
+    const interval = setInterval(() => setNowTick(Date.now()), 1000)
+    return () => clearInterval(interval)
+  }, [distributedMode])
+  const lastSyncedAgo = (() => {
+    const s = Math.max(0, Math.floor((nowTick - lastSyncedAt) / 1000))
+    if (s < 5) return 'just now'
+    if (s < 60) return `${s}s ago`
+    const m = Math.floor(s / 60)
+    return `${m}m ago`
+  })()
+
  // Fetch available backend upgrades
  useEffect(() => {
    if (activeTab === 'backends') {
@@ -196,6 +261,29 @@ export default function Manage() {
    }
  }

+  const [upgradingAll, setUpgradingAll] = useState(false)
+  const [showOnlyUpgradable, setShowOnlyUpgradable] = useState(false)
+  const handleUpgradeAll = async () => {
+    const names = Object.keys(upgrades)
+    if (names.length === 0) return
+    setUpgradingAll(true)
+    try {
+      // Serial upgrade — matches the gallery's Upgrade All behavior.
+      // Each backend upgrade is itself a cluster-wide fan-out, so parallel
+      // calls would multiply load on every worker.
+      for (const name of names) {
+        try {
+          await backendsApi.upgrade(name)
+        } catch (err) {
+          addToast(`Upgrade failed for ${name}: ${err.message}`, 'error')
+        }
+      }
+      addToast(`Upgrade started for ${names.length} backend${names.length === 1 ? '' : 's'}`, 'info')
+    } finally {
+      setUpgradingAll(false)
+    }
+  }
+
  const handleDeleteBackend = (name) => {
    setConfirmDialog({
      title: 'Delete Backend',
@@ -227,29 +315,74 @@ export default function Manage() {

      {/* Tabs */}
      <div className="tabs" style={{ marginTop: 'var(--spacing-lg)', marginBottom: 'var(--spacing-md)' }}>
-        {TABS.map(t => (
-          <button
-            key={t.key}
-            className={`tab ${activeTab === t.key ? 'tab-active' : ''}`}
-            onClick={() => handleTabChange(t.key)}
-          >
-            <i className={`fas ${t.icon}`} style={{ marginRight: 6 }} />
-            {t.label}
-            {t.key === 'models' && !modelsLoading && ` (${models.length})`}
-            {t.key === 'backends' && !backendsLoading && ` (${backends.length})`}
-          </button>
-        ))}
+        {TABS.map(t => {
+          const upgradeCount = t.key === 'backends' ? Object.keys(upgrades).length : 0
+          return (
+            <button
+              key={t.key}
+              className={`tab ${activeTab === t.key ? 'tab-active' : ''}`}
+              onClick={() => handleTabChange(t.key)}
+            >
+              <i className={`fas ${t.icon}`} style={{ marginRight: 6 }} />
+              {t.label}
+              {t.key === 'models' && !modelsLoading && ` (${models.length})`}
+              {t.key === 'backends' && !backendsLoading && ` (${backends.length})`}
+              {upgradeCount > 0 && (
+                <span className="tab-pill tab-pill--warning" title={`${upgradeCount} update${upgradeCount === 1 ? '' : 's'} available`}>
+                  <i className="fas fa-arrow-up" /> {upgradeCount}
+                </span>
+              )}
+            </button>
+          )
+        })}
      </div>

      {/* Models Tab */}
-      {activeTab === 'models' && (
+      {activeTab === 'models' && (() => {
+        // Computed filters — done here so the result is available both to
+        // the FilterBar counts and to the table body.
+        const MODEL_FILTERS = [
+          { key: 'all',      label: 'All',      icon: 'fa-layer-group' },
+          { key: 'running',  label: 'Running',  icon: 'fa-circle-play' },
+          { key: 'idle',     label: 'Idle',     icon: 'fa-pause' },
+          { key: 'disabled', label: 'Disabled', icon: 'fa-ban' },
+          { key: 'pinned',   label: 'Pinned',   icon: 'fa-thumbtack' },
+          ...(distributedMode ? [{ key: 'distributed', label: 'Distributed', icon: 'fa-server' }] : []),
+        ]
+        const passesFilter = (m) => {
+          if (modelsFilter === 'running') return !m.disabled && (loadedModelIds.has(m.id) || (m.loaded_on && m.loaded_on.length > 0))
+          if (modelsFilter === 'idle')    return !m.disabled && !loadedModelIds.has(m.id) && !(m.loaded_on && m.loaded_on.length > 0)
+          if (modelsFilter === 'disabled') return !!m.disabled
+          if (modelsFilter === 'pinned')   return !!m.pinned
+          if (modelsFilter === 'distributed') return Array.isArray(m.loaded_on) && m.loaded_on.length > 0
+          return true
+        }
+        const q = modelsSearch.trim().toLowerCase()
+        const passesSearch = (m) => !q || (m.id || '').toLowerCase().includes(q) || (m.backend || '').toLowerCase().includes(q)
+        const visibleModels = models.filter(m => passesFilter(m) && passesSearch(m))
+        return (
      <div>
-        <div style={{ display: 'flex', alignItems: 'center', justifyContent: 'flex-end', marginBottom: 'var(--spacing-md)' }}>
-          <button className="btn btn-secondary btn-sm" onClick={handleReload} disabled={reloading}>
-            <i className={`fas ${reloading ? 'fa-spinner fa-spin' : 'fa-rotate'}`} />
-            {reloading ? 'Updating...' : 'Update'}
-          </button>
-        </div>
+        <FilterBar
+          search={modelsSearch}
+          onSearchChange={setModelsSearch}
+          searchPlaceholder="Search models by name or backend..."
+          filters={MODEL_FILTERS}
+          activeFilter={modelsFilter}
+          onFilterChange={setModelsFilter}
+          rightSlot={(
+            <>
+              {distributedMode && (
+                <span className="cell-muted" title="Auto-refreshes every 10s in distributed mode so ghost models clear promptly">
+                  <i className="fas fa-rotate" /> Last synced {lastSyncedAgo}
+                </span>
+              )}
+              <button className="btn btn-secondary btn-sm" onClick={handleReload} disabled={reloading}>
+                <i className={`fas ${reloading ? 'fa-spinner fa-spin' : 'fa-rotate'}`} />
+                {reloading ? ' Updating...' : ' Update'}
+              </button>
+            </>
+          )}
+        />

        {modelsLoading ? (
          <div className="card" style={{ padding: 'var(--spacing-xl)', textAlign: 'center', color: 'var(--color-text-muted)' }}>
@@ -274,6 +407,12 @@ export default function Manage() {
              </a>
            </div>
          </div>
+        ) : visibleModels.length === 0 ? (
+          <div className="empty-state">
+            <i className="fas fa-filter" />
+            <p>No models match the current filter.</p>
+            <button className="btn btn-ghost btn-sm" onClick={() => { setModelsSearch(''); setModelsFilter('all') }}>Clear filters</button>
+          </div>
        ) : (
          <div className="table-container">
            <table className="table">
@@ -288,7 +427,7 @@ export default function Manage() {
                </tr>
              </thead>
              <tbody>
-                {models.map(model => (
+                {visibleModels.map(model => (
                  <tr key={model.id} style={{ opacity: model.disabled ? 0.55 : 1, transition: 'opacity 0.2s' }}>
                    {/* Enable/Disable toggle */}
                    <td>
@@ -329,21 +468,33 @@ export default function Manage() {
                        </div>
                      </div>
                    </td>
-                    {/* Status */}
+                    {/* Status / Distribution */}
                    <td>
-                      {model.disabled ? (
-                        <span className="badge" style={{ background: 'var(--color-bg-tertiary)', color: 'var(--color-text-muted)' }}>
-                          <i className="fas fa-ban" style={{ fontSize: '6px' }} /> Disabled
-                        </span>
-                      ) : loadedModelIds.has(model.id) ? (
-                        <span className="badge badge-success">
-                          <i className="fas fa-circle" style={{ fontSize: '6px' }} /> Running
-                        </span>
-                      ) : (
-                        <span className="badge" style={{ background: 'var(--color-bg-tertiary)', color: 'var(--color-text-muted)' }}>
-                          <i className="fas fa-circle" style={{ fontSize: '6px' }} /> Idle
-                        </span>
-                      )}
+                      <div className="cell-stack">
+                        {model.disabled ? (
+                          <span className="badge" style={{ background: 'var(--color-bg-tertiary)', color: 'var(--color-text-muted)' }}>
+                            <i className="fas fa-ban" /> Disabled
+                          </span>
+                        ) : model.loaded_on && model.loaded_on.length > 0 ? (
+                          // Distributed mode: surface where the model is
+                          // actually loaded. Shared chip scales to any cluster
+                          // size (inline for <=3, popover for larger).
+                          <NodeDistributionChip nodes={model.loaded_on} context="models" />
+                        ) : loadedModelIds.has(model.id) ? (
+                          <span className="badge badge-success">
+                            <i className="fas fa-circle" style={{ fontSize: '6px' }} /> Running
+                          </span>
+                        ) : (
+                          <span className="badge" style={{ background: 'var(--color-bg-tertiary)', color: 'var(--color-text-muted)' }}>
+                            <i className="fas fa-circle" style={{ fontSize: '6px' }} /> Idle
+                          </span>
+                        )}
+                        {model.source === 'registry-only' && (
+                          <span className="badge badge-warning" title="Discovered on a worker but not configured locally. Persist the config to make it permanent.">
+                            <i className="fas fa-ghost" /> Adopted
+                          </span>
+                        )}
+                      </div>
                    </td>
                    {/* Backend */}
                    <td>
@@ -394,11 +545,34 @@ export default function Manage() {
          </div>
        )}
      </div>
-      )}
+        )
+      })()}

      {/* Backends Tab */}
      {activeTab === 'backends' && (
      <div>
+        {/* Upgrade banner — mirrors the gallery so operators can't miss updates */}
+        {!backendsLoading && Object.keys(upgrades).length > 0 && (
+          <div className="upgrade-banner">
+            <div className="upgrade-banner__text">
+              <i className="fas fa-arrow-up" />
+              <span>
+                {Object.keys(upgrades).length} backend{Object.keys(upgrades).length === 1 ? ' has' : 's have'} updates available
+              </span>
+            </div>
+            <div className="upgrade-banner__actions">
+              <button
+                className="btn btn-primary btn-sm"
+                onClick={handleUpgradeAll}
+                disabled={upgradingAll}
+              >
+                <i className={`fas ${upgradingAll ? 'fa-spinner fa-spin' : 'fa-arrow-up'}`} />
+                {upgradingAll ? ' Upgrading...' : ' Upgrade all'}
+              </button>
+            </div>
+          </div>
+        )}
+
        {backendsLoading ? (
          <div style={{ textAlign: 'center', padding: 'var(--spacing-md)', color: 'var(--color-text-muted)', fontSize: '0.875rem' }}>
            Loading backends...
@@ -419,109 +593,217 @@ export default function Manage() {
              </a>
            </div>
          </div>
-        ) : (
-          <div className="table-container">
+        ) : (() => {
+          // Count chip badges: show N in the filter buttons so operators can
+          // see at a glance how their chips bucket the list.
+          const upgradableCount = backends.filter(b => upgrades[b.Name]).length
+          const userCount       = backends.filter(b => !b.IsSystem).length
+          const systemCount     = backends.filter(b => b.IsSystem).length
+          const metaCount       = backends.filter(b => b.IsMeta).length
+          const offlineCount    = backends.filter(b => {
+            const n = b.Nodes || b.nodes || []
+            return n.some(x => {
+              const s = x.node_status || x.NodeStatus
+              return s && s !== 'healthy' && s !== 'draining'
+            })
+          }).length
+
+          const BACKEND_FILTERS = [
+            { key: 'all',        label: 'All',        icon: 'fa-layer-group', count: backends.length },
+            { key: 'user',       label: 'User',       icon: 'fa-download',    count: userCount },
+            { key: 'system',     label: 'System',     icon: 'fa-shield-alt',  count: systemCount },
+            { key: 'meta',       label: 'Meta',       icon: 'fa-layer-group', count: metaCount },
+            ...(upgradableCount > 0 ? [{ key: 'upgradable', label: 'Updates', icon: 'fa-arrow-up', count: upgradableCount }] : []),
+            ...(distributedMode && offlineCount > 0 ? [{ key: 'offline', label: 'Offline nodes', icon: 'fa-exclamation-circle', count: offlineCount }] : []),
+          ]
+          const q = backendsSearch.trim().toLowerCase()
+          const passesSearch = (b) => !q
+            || (b.Name || '').toLowerCase().includes(q)
+            || (b.Metadata?.alias || '').toLowerCase().includes(q)
+            || (b.Metadata?.meta_backend_for || '').toLowerCase().includes(q)
+          const passesFilter = (b) => {
+            switch (backendsFilter) {
+              case 'user':       return !b.IsSystem
+              case 'system':     return !!b.IsSystem
+              case 'meta':       return !!b.IsMeta
+              case 'upgradable': return !!upgrades[b.Name]
+              case 'offline': {
+                const n = b.Nodes || b.nodes || []
+                return n.some(x => {
+                  const s = x.node_status || x.NodeStatus
+                  return s && s !== 'healthy' && s !== 'draining'
+                })
+              }
+              default: return true
+            }
+          }
+          // Legacy "showOnlyUpgradable" toggle is now the 'upgradable' chip —
+          // keep backward-compat by mapping it onto the new filter.
+          if (showOnlyUpgradable && backendsFilter !== 'upgradable') {
+            // One-shot reconciliation — the old state becomes the new chip.
+            setBackendsFilter('upgradable')
+            setShowOnlyUpgradable(false)
+          }
+          const visibleBackends = backends.filter(b => passesFilter(b) && passesSearch(b))
+          if (visibleBackends.length === 0) {
+            return (
+              <>
+                <FilterBar
+                  search={backendsSearch}
+                  onSearchChange={setBackendsSearch}
+                  searchPlaceholder="Search backends by name or alias..."
+                  filters={BACKEND_FILTERS}
+                  activeFilter={backendsFilter}
+                  onFilterChange={setBackendsFilter}
+                />
+                <div className="empty-state">
+                  <i className="fas fa-filter" />
+                  <p>No backends match the current filter.</p>
+                  <button className="btn btn-ghost btn-sm" onClick={() => { setBackendsSearch(''); setBackendsFilter('all') }}>Clear filters</button>
+                </div>
+              </>
+            )
+          }
+          return (
+          <>
+            <FilterBar
+              search={backendsSearch}
+              onSearchChange={setBackendsSearch}
+              searchPlaceholder="Search backends by name or alias..."
+              filters={BACKEND_FILTERS}
+              activeFilter={backendsFilter}
+              onFilterChange={setBackendsFilter}
+            />
+            <div className="table-container">
            <table className="table">
              <thead>
                <tr>
                  <th>Name</th>
                  <th>Type</th>
-                  <th>Metadata</th>
+                  <th>Version</th>
+                  {distributedMode && <th>Nodes</th>}
+                  <th>Installed</th>
                  <th style={{ textAlign: 'right' }}>Actions</th>
                </tr>
              </thead>
              <tbody>
-                {backends.map((backend, i) => (
+                {visibleBackends.map((backend, i) => {
+                  const upgradeInfo = upgrades[backend.Name]
+                  const hasDrift = upgradeInfo?.node_drift?.length > 0
+                  const nodes = backend.Nodes || backend.nodes || []
+                  return (
                  <tr key={backend.Name || i}>
                    <td>
-                      <div style={{ display: 'flex', alignItems: 'center', gap: 'var(--spacing-sm)' }}>
-                        <i className="fas fa-cog" style={{ color: 'var(--color-accent)', fontSize: '0.75rem' }} />
-                        <span style={{ fontWeight: 500 }}>{backend.Name}</span>
+                      <div className="cell-name">
+                        <i className="fas fa-cog" />
+                        <span>{backend.Name}</span>
+                        {backend.Metadata?.alias && (
+                          <span className="cell-subtle">alias: {backend.Metadata.alias}</span>
+                        )}
+                        {backend.Metadata?.meta_backend_for && (
+                          <span className="cell-subtle">for: {backend.Metadata.meta_backend_for}</span>
+                        )}
                      </div>
                    </td>
                    <td>
-                      <div style={{ display: 'flex', gap: '4px', flexWrap: 'wrap' }}>
+                      <div className="badge-row">
                        {backend.IsSystem ? (
-                          <span className="badge badge-info" style={{ fontSize: '0.625rem' }}>
-                            <i className="fas fa-shield-alt" style={{ fontSize: '0.5rem', marginRight: 2 }} />System
+                          <span className="badge badge-info">
+                            <i className="fas fa-shield-alt" /> System
                          </span>
                        ) : (
-                          <span className="badge badge-success" style={{ fontSize: '0.625rem' }}>
-                            <i className="fas fa-download" style={{ fontSize: '0.5rem', marginRight: 2 }} />User
+                          <span className="badge badge-success">
+                            <i className="fas fa-download" /> User
                          </span>
                        )}
                        {backend.IsMeta && (
-                          <span className="badge" style={{ background: 'var(--color-accent-light)', color: 'var(--color-accent)', fontSize: '0.625rem' }}>
-                            <i className="fas fa-layer-group" style={{ fontSize: '0.5rem', marginRight: 2 }} />Meta
+                          <span className="badge badge-accent">
+                            <i className="fas fa-layer-group" /> Meta
                          </span>
                        )}
                      </div>
                    </td>
                    <td>
-                      <div style={{ display: 'flex', flexDirection: 'column', gap: 2, fontSize: '0.75rem', color: 'var(--color-text-secondary)' }}>
-                        {backend.Metadata?.alias && (
-                          <span>
-                            <i className="fas fa-tag" style={{ fontSize: '0.5rem', marginRight: 4 }} />
-                            Alias: <span style={{ color: 'var(--color-text-primary)' }}>{backend.Metadata.alias}</span>
+                      <div className="cell-stack">
+                        {backend.Metadata?.version ? (
+                          <span className="cell-mono">v{backend.Metadata.version}</span>
+                        ) : (
+                          <span className="cell-muted">—</span>
+                        )}
+                        {upgradeInfo && (
+                          <span className="badge badge-warning" title={upgradeInfo.available_version ? `Upgrade to v${upgradeInfo.available_version}` : 'Update available'}>
+                            <i className="fas fa-arrow-up" />
+                            {upgradeInfo.available_version ? ` v${upgradeInfo.available_version}` : ' Update available'}
                          </span>
                        )}
-                        {backend.Metadata?.meta_backend_for && (
-                          <span>
-                            <i className="fas fa-link" style={{ fontSize: '0.5rem', marginRight: 4 }} />
-                            For: <span style={{ color: 'var(--color-accent)' }}>{backend.Metadata.meta_backend_for}</span>
+                        {hasDrift && (
+                          <span
+                            className="badge badge-warning"
+                            title={`Drift: ${upgradeInfo.node_drift.map(d => `${d.node_name}${d.version ? ' v' + d.version : ''}`).join(', ')}`}
+                          >
+                            <i className="fas fa-code-branch" />
+                            {' '}Drift: {upgradeInfo.node_drift.length} node{upgradeInfo.node_drift.length === 1 ? '' : 's'}
                          </span>
                        )}
-                        {backend.Metadata?.version && (
-                          <span>
-                            <i className="fas fa-code-branch" style={{ fontSize: '0.5rem', marginRight: 4 }} />
-                            Version: <span style={{ color: 'var(--color-text-primary)' }}>v{backend.Metadata.version}</span>
-                            {upgrades[backend.Name] && (
-                              <span style={{ color: '#856404', marginLeft: 4 }}>
-                                → v{upgrades[backend.Name].available_version}
-                              </span>
-                            )}
-                          </span>
-                        )}
-                        {backend.Metadata?.installed_at && (
-                          <span>
-                            <i className="fas fa-calendar" style={{ fontSize: '0.5rem', marginRight: 4 }} />
-                            {backend.Metadata.installed_at}
-                          </span>
-                        )}
-                        {!backend.Metadata?.alias && !backend.Metadata?.meta_backend_for && !backend.Metadata?.installed_at && '—'}
                      </div>
                    </td>
+                    {distributedMode && (
+                      <td>
+                        <NodeDistributionChip nodes={nodes} context="backends" />
+                      </td>
+                    )}
                    <td>
-                      <div style={{ display: 'flex', gap: 'var(--spacing-xs)', justifyContent: 'flex-end' }}>
-                        {!backend.IsSystem ? (
+                      <span className="cell-muted cell-mono">
+                        {backend.Metadata?.installed_at ? formatInstalledAt(backend.Metadata.installed_at) : '—'}
+                      </span>
+                    </td>
+                    <td>
+                      <div className="row-actions">
+                        {backend.IsSystem ? (
+                          <span className="badge" title="System backends are managed outside the gallery">
+                            <i className="fas fa-lock" /> Protected
+                          </span>
+                        ) : (
                          <>
+                            {upgradeInfo ? (
+                              <button
+                                className="btn btn-primary btn-sm"
+                                onClick={() => handleUpgradeBackend(backend.Name)}
+                                disabled={reinstallingBackends.has(backend.Name)}
+                              >
+                                <i className={`fas ${reinstallingBackends.has(backend.Name) ? 'fa-spinner fa-spin' : 'fa-arrow-up'}`} />
+                                {' '}Upgrade{upgradeInfo.available_version ? ` to v${upgradeInfo.available_version}` : ''}
+                              </button>
+                            ) : (
+                              <button
+                                className="btn btn-secondary btn-sm"
+                                onClick={() => handleReinstallBackend(backend.Name)}
+                                disabled={reinstallingBackends.has(backend.Name)}
+                              >
+                                <i className={`fas ${reinstallingBackends.has(backend.Name) ? 'fa-spinner fa-spin' : 'fa-rotate'}`} />
+                                {' '}Reinstall
+                              </button>
+                            )}
                            <button
-                              className={`btn ${upgrades[backend.Name] ? 'btn-primary' : 'btn-secondary'} btn-sm`}
-                              onClick={() => upgrades[backend.Name] ? handleUpgradeBackend(backend.Name) : handleReinstallBackend(backend.Name)}
-                              disabled={reinstallingBackends.has(backend.Name)}
-                              title={upgrades[backend.Name] ? `Upgrade to v${upgrades[backend.Name]?.available_version || 'latest'}` : 'Reinstall'}
-                            >
-                              <i className={`fas ${reinstallingBackends.has(backend.Name) ? 'fa-spinner fa-spin' : upgrades[backend.Name] ? 'fa-arrow-up' : 'fa-rotate'}`} />
-                            </button>
-                            <button
-                              className="btn btn-danger btn-sm"
+                              className="btn btn-danger-ghost btn-sm"
                              onClick={() => handleDeleteBackend(backend.Name)}
-                              title="Delete"
+                              title="Delete backend (removes from all nodes)"
                            >
                              <i className="fas fa-trash" />
                            </button>
                          </>
-                        ) : (
-                          <span style={{ fontSize: '0.75rem', color: 'var(--color-text-muted)' }}>—</span>
                        )}
                      </div>
                    </td>
                  </tr>
-                ))}
+                  )
+                })}
              </tbody>
            </table>
-          </div>
-        )}
+            </div>
+          </>
+          )
+        })()}
      </div>
      )}

--- a/core/http/react-ui/src/pages/Nodes.jsx
+++ b/core/http/react-ui/src/pages/Nodes.jsx
@@ -51,15 +51,22 @@ const modelStateConfig = {
  idle: { bg: 'var(--color-bg-tertiary)', color: 'var(--color-text-muted)', border: 'var(--color-border-subtle)' },
 }

-function StatCard({ icon, label, value, color }) {
+function StatCard({ icon, label, value, color, accentVar }) {
+  // accentVar: optional CSS variable for the left edge + icon chip, e.g.
+  // "--color-success". When unset the card reads neutral — used for simple
+  // counts so they don't compete with the semantic cards for attention.
+  const accent = color || (accentVar ? `var(${accentVar})` : 'var(--color-text-primary)')
  return (
-    <div className="card" style={{ padding: 'var(--spacing-sm) var(--spacing-md)', flex: '1 1 0', minWidth: 120 }}>
-      <div style={{ display: 'flex', alignItems: 'center', gap: 6, marginBottom: 2 }}>
-        <i className={icon} style={{ color: 'var(--color-text-muted)', fontSize: '0.75rem' }} />
-        <span style={{ fontSize: '0.6875rem', color: 'var(--color-text-muted)', fontWeight: 500, textTransform: 'uppercase', letterSpacing: '0.03em' }}>{label}</span>
+    <div
+      className="stat-card"
+      style={accentVar ? { ['--stat-accent']: `var(${accentVar})` } : undefined}
+    >
+      <div className="stat-card__body">
+        <div className="stat-card__label">{label}</div>
+        <div className="stat-card__value" style={{ color: accent }}>{value}</div>
      </div>
-      <div style={{ fontSize: '1.375rem', fontWeight: 700, fontFamily: 'JetBrains Mono, monospace', color: color || 'var(--color-text-primary)' }}>
-        {value}
+      <div className="stat-card__icon" style={accentVar ? { color: accent } : undefined}>
+        <i className={icon} />
      </div>
    </div>
  )
@@ -543,45 +550,24 @@ export default function Nodes() {
      </div>

      {/* Tabs */}
-      <div style={{ display: 'flex', gap: 'var(--spacing-xs)', marginBottom: 'var(--spacing-lg)', borderBottom: '2px solid var(--color-border)' }}>
+      <div className="tabs" style={{ marginBottom: 'var(--spacing-lg)' }}>
        <button
          onClick={() => setActiveTab('backend')}
-          style={{
-            padding: 'var(--spacing-sm) var(--spacing-lg)',
-            border: 'none', cursor: 'pointer', fontWeight: 600, fontSize: '0.875rem',
-            background: 'none',
-            color: activeTab === 'backend' ? 'var(--color-primary)' : 'var(--color-text-muted)',
-            borderBottom: activeTab === 'backend' ? '2px solid var(--color-primary)' : '2px solid transparent',
-            marginBottom: '-2px',
-          }}
+          className={`tab ${activeTab === 'backend' ? 'tab-active' : ''}`}
        >
          <i className="fas fa-server" style={{ marginRight: 6 }} />
          Backend Workers ({backendNodes.length})
        </button>
        <button
          onClick={() => setActiveTab('agent')}
-          style={{
-            padding: 'var(--spacing-sm) var(--spacing-lg)',
-            border: 'none', cursor: 'pointer', fontWeight: 600, fontSize: '0.875rem',
-            background: 'none',
-            color: activeTab === 'agent' ? 'var(--color-primary)' : 'var(--color-text-muted)',
-            borderBottom: activeTab === 'agent' ? '2px solid var(--color-primary)' : '2px solid transparent',
-            marginBottom: '-2px',
-          }}
+          className={`tab ${activeTab === 'agent' ? 'tab-active' : ''}`}
        >
          <i className="fas fa-robot" style={{ marginRight: 6 }} />
          Agent Workers ({agentNodes.length})
        </button>
        <button
          onClick={() => setActiveTab('scheduling')}
-          style={{
-            padding: 'var(--spacing-sm) var(--spacing-lg)',
-            border: 'none', cursor: 'pointer', fontWeight: 600, fontSize: '0.875rem',
-            background: 'none',
-            color: activeTab === 'scheduling' ? 'var(--color-primary)' : 'var(--color-text-muted)',
-            borderBottom: activeTab === 'scheduling' ? '2px solid var(--color-primary)' : '2px solid transparent',
-            marginBottom: '-2px',
-          }}
+          className={`tab ${activeTab === 'scheduling' ? 'tab-active' : ''}`}
        >
          <i className="fas fa-calendar-alt" style={{ marginRight: 6 }} />
          Scheduling ({schedulingConfigs.length})
@@ -590,13 +576,17 @@ export default function Nodes() {

      {activeTab !== 'scheduling' && <>
      {/* Stat cards */}
-      <div style={{ display: 'flex', gap: 'var(--spacing-md)', marginBottom: 'var(--spacing-xl)', flexWrap: 'wrap' }}>
-        <StatCard icon={activeTab === 'agent' ? 'fas fa-robot' : 'fas fa-server'} label={`Total ${activeTab === 'agent' ? 'Agent' : 'Backend'} Workers`} value={total} />
-        <StatCard icon="fas fa-check-circle" label="Healthy" value={healthy} color="var(--color-success)" />
-        <StatCard icon="fas fa-exclamation-circle" label="Unhealthy" value={unhealthy} color={unhealthy > 0 ? 'var(--color-error)' : undefined} />
-        <StatCard icon="fas fa-hourglass-half" label="Draining" value={draining} color={draining > 0 ? 'var(--color-warning)' : undefined} />
+      <div className="stat-grid">
+        <StatCard icon={activeTab === 'agent' ? 'fas fa-robot' : 'fas fa-server'}
+          label={`Total ${activeTab === 'agent' ? 'Agent' : 'Backend'} Workers`} value={total} />
+        <StatCard icon="fas fa-check-circle" label="Healthy" value={healthy}
+          accentVar={healthy > 0 ? '--color-success' : undefined} />
+        <StatCard icon="fas fa-exclamation-circle" label="Unhealthy" value={unhealthy}
+          accentVar={unhealthy > 0 ? '--color-error' : undefined} />
+        <StatCard icon="fas fa-hourglass-half" label="Draining" value={draining}
+          accentVar={draining > 0 ? '--color-warning' : undefined} />
        {pending > 0 && (
-          <StatCard icon="fas fa-clock" label="Pending" value={pending} color="var(--color-warning)" />
+          <StatCard icon="fas fa-clock" label="Pending" value={pending} accentVar="--color-warning" />
        )}
        {activeTab === 'backend' && (() => {
          const clusterTotalVRAM = backendNodes.reduce((sum, n) => sum + (n.total_vram || 0), 0)
@@ -614,7 +604,7 @@ export default function Nodes() {
              )}
              <StatCard icon="fas fa-cube" label="Models Loaded" value={totalModelsLoaded} />
              <StatCard icon="fas fa-exchange-alt" label="In-Flight Requests" value={totalInFlight}
-                color={totalInFlight > 0 ? 'var(--color-primary)' : undefined} />
+                accentVar={totalInFlight > 0 ? '--color-primary' : undefined} />
            </>
          )
        })()}
@@ -627,15 +617,11 @@ export default function Nodes() {
        <>
          <button
            onClick={() => setShowTips(t => !t)}
-            style={{
-              background: 'none', border: 'none', cursor: 'pointer',
-              color: 'var(--color-primary)', fontSize: '0.8125rem', fontWeight: 500,
-              display: 'flex', alignItems: 'center', gap: 6,
-              padding: 0, marginBottom: 'var(--spacing-md)',
-            }}
+            className="nodes-add-worker"
+            aria-expanded={showTips}
          >
-            <i className={`fas fa-chevron-${showTips ? 'down' : 'right'}`} style={{ fontSize: '0.625rem' }} />
-            Add more workers
+            <i className={`fas ${showTips ? 'fa-chevron-down' : 'fa-plus'}`} />
+            {showTips ? 'Hide instructions' : 'Register a new worker'}
          </button>
          {showTips && <WorkerHintCard addToast={addToast} activeTab={activeTab} hasWorkers />}
        </>
@@ -685,23 +671,28 @@ export default function Nodes() {
                    >
                      <td>
                        <div style={{ display: 'flex', alignItems: 'center', gap: 'var(--spacing-sm)' }}>
-                          <i className="fas fa-server" style={{ color: 'var(--color-text-muted)', fontSize: '0.875rem' }} />
+                          {canExpand && (
+                            <span className={`row-chevron${isExpanded ? ' is-expanded' : ''}`} aria-hidden="true">
+                              <i className="fas fa-chevron-right" />
+                            </span>
+                          )}
+                          <i className="fas fa-server" style={{ color: 'var(--color-text-muted)', fontSize: 'var(--text-sm)' }} />
                          <div>
-                            <div style={{ fontWeight: 600, fontSize: '0.875rem' }}>{node.name}</div>
-                            <div style={{ fontSize: '0.75rem', fontFamily: "'JetBrains Mono', monospace", color: 'var(--color-text-muted)' }}>
+                            <div style={{ fontWeight: 600, fontSize: 'var(--text-sm)' }}>{node.name}</div>
+                            <div className="cell-mono cell-muted">
                              {node.address}
                            </div>
                            {node.labels && Object.keys(node.labels).length > 0 && (
                              <div style={{ display: 'flex', flexWrap: 'wrap', gap: 3, marginTop: 3 }}>
                                {Object.entries(node.labels).slice(0, 5).map(([k, v]) => (
-                                  <span key={k} style={{
-                                    fontSize: '0.625rem', padding: '1px 5px', borderRadius: 3,
-                                    background: 'var(--color-bg-tertiary)', color: 'var(--color-text-muted)',
-                                    fontFamily: "'JetBrains Mono', monospace", border: '1px solid var(--color-border-subtle)',
+                                  <span key={k} className="cell-mono" style={{
+                                    padding: '1px 5px', borderRadius: 3,
+                                    background: 'var(--color-bg-tertiary)',
+                                    border: '1px solid var(--color-border-subtle)',
                                  }}>{k}={v}</span>
                                ))}
                                {Object.keys(node.labels).length > 5 && (
-                                  <span style={{ fontSize: '0.625rem', color: 'var(--color-text-muted)' }}>
+                                  <span className="cell-muted">
                                    +{Object.keys(node.labels).length - 5} more
                                  </span>
                                )}
@@ -711,12 +702,10 @@ export default function Nodes() {
                        </div>
                      </td>
                      <td>
-                        <div style={{ display: 'flex', alignItems: 'center', gap: 6 }}>
-                          <i className="fas fa-circle" style={{ fontSize: '0.5rem', color: status.color }} />
-                          <span style={{ fontSize: '0.8125rem', color: status.color, fontWeight: 500 }}>
-                            {status.label}
-                          </span>
-                        </div>
+                        <span className="node-status" style={{ color: status.color }}>
+                          <span className="node-status__dot" style={{ background: status.color }} />
+                          {status.label}
+                        </span>
                      </td>
                      <td>
                        {hasGPU && totalVRAMStr ? (
@@ -745,38 +734,37 @@ export default function Nodes() {
                        </span>
                      </td>
                      <td style={{ textAlign: 'right' }}>
-                        <div style={{ display: 'flex', gap: 'var(--spacing-xs)', justifyContent: 'flex-end' }} onClick={e => e.stopPropagation()}>
+                        <div className="row-actions" onClick={e => e.stopPropagation()}>
                          {node.status === 'pending' && (
                            <button
                              className="btn btn-primary btn-sm"
                              onClick={() => handleApprove(node.id)}
-                              title="Approve node"
                            >
-                              <i className="fas fa-check" />
+                              <i className="fas fa-check" /> Approve
                            </button>
                          )}
                          {node.status === 'draining' && (
                            <button
                              className="btn btn-secondary btn-sm"
                              onClick={() => handleResume(node.id)}
-                              title="Resume node"
+                              title="Resume accepting requests"
                            >
-                              <i className="fas fa-play" />
+                              <i className="fas fa-play" /> Resume
                            </button>
                          )}
                          {node.status !== 'draining' && node.status !== 'pending' && (
                            <button
                              className="btn btn-secondary btn-sm"
                              onClick={() => handleDrain(node.id)}
-                              title="Drain node"
+                              title="Stop sending new requests to this node"
                            >
-                              <i className="fas fa-pause" />
+                              <i className="fas fa-pause" /> Drain
                            </button>
                          )}
                          <button
-                            className="btn btn-danger btn-sm"
+                            className="btn btn-danger-ghost btn-sm"
                            onClick={() => setConfirmDelete(node)}
-                            title="Remove node"
+                            title="Remove node from cluster"
                          >
                            <i className="fas fa-trash" />
                          </button>
@@ -794,7 +782,10 @@ export default function Nodes() {
                            {!models ? (
                              <LoadingSpinner size="sm" />
                            ) : models.length === 0 ? (
-                              <p style={{ fontSize: '0.8125rem', color: 'var(--color-text-muted)' }}>No models loaded on this node</p>
+                              <div className="drawer-empty">
+                                <i className="fas fa-cube" />
+                                <span>No models loaded on this node yet.</span>
+                              </div>
                            ) : (
                              <table className="table" style={{ margin: 0 }}>
                                <thead>
@@ -870,7 +861,10 @@ export default function Nodes() {
                            {!backends ? (
                              <LoadingSpinner size="sm" />
                            ) : backends.length === 0 ? (
-                              <p style={{ fontSize: '0.8125rem', color: 'var(--color-text-muted)' }}>No backends installed on this node</p>
+                              <div className="drawer-empty">
+                                <i className="fas fa-cogs" />
+                                <span>No backends installed on this node. Install one from the gallery to schedule models here.</span>
+                              </div>
                            ) : (
                              <table className="table" style={{ margin: 0 }}>
                                <thead>
--- a/core/http/routes/ui_api.go
+++ b/core/http/routes/ui_api.go
@@ -23,7 +23,6 @@ import (
 	"github.com/mudler/LocalAI/core/gallery"
 	"github.com/mudler/LocalAI/core/http/auth"
 	"github.com/mudler/LocalAI/core/http/endpoints/localai"
-	"github.com/mudler/LocalAI/core/http/middleware"
 	"github.com/mudler/LocalAI/core/p2p"
 	"github.com/mudler/LocalAI/core/services/galleryop"
 	"github.com/mudler/LocalAI/pkg/model"
@@ -510,28 +509,89 @@ func RegisterUIAPIRoutes(app *echo.Echo, cl *config.ModelConfigLoader, ml *model
 		modelConfigs := cl.GetAllModelsConfigs()
 		modelsWithoutConfig, _ := galleryop.ListModels(cl, ml, config.NoFilterFn, galleryop.LOOSE_ONLY)

+		type loadedOn struct {
+			NodeID     string `json:"node_id"`
+			NodeName   string `json:"node_name"`
+			State      string `json:"state"`
+			NodeStatus string `json:"node_status"`
+		}
 		type modelCapability struct {
-			ID           string   `json:"id"`
-			Capabilities []string `json:"capabilities"`
-			Backend      string   `json:"backend"`
-			Disabled     bool     `json:"disabled"`
-			Pinned       bool     `json:"pinned"`
+			ID           string     `json:"id"`
+			Capabilities []string   `json:"capabilities"`
+			Backend      string     `json:"backend"`
+			Disabled     bool       `json:"disabled"`
+			Pinned       bool       `json:"pinned"`
+			// LoadedOn is populated only when the node registry is active
+			// (distributed mode). Lets the UI show "loaded on worker-1" without
+			// the operator having to expand every node manually. An empty slice
+			// with nil reports "no loaded replicas" vs. nil reports "not in
+			// cluster mode" — the frontend treats both as "no distribution info".
+			LoadedOn []loadedOn `json:"loaded_on,omitempty"`
+			// Source="registry-only" marks models adopted from the cluster that
+			// have no local config yet (ghosts that the reconciler discovered).
+			Source string `json:"source,omitempty"`
+		}
+
+		// Join with the node registry when we have one (distributed mode). A
+		// single registry fetch + map join beats per-model queries for the
+		// 100-model case.
+		var loadedByModel map[string][]loadedOn
+		if ds := applicationInstance.Distributed(); ds != nil && ds.Registry != nil {
+			nodeModels, err := ds.Registry.ListAllLoadedModels(c.Request().Context())
+			if err == nil {
+				allNodes, _ := ds.Registry.List(c.Request().Context())
+				nameByID := make(map[string]string, len(allNodes))
+				statusByID := make(map[string]string, len(allNodes))
+				for _, n := range allNodes {
+					nameByID[n.ID] = n.Name
+					statusByID[n.ID] = n.Status
+				}
+				loadedByModel = make(map[string][]loadedOn)
+				for _, nm := range nodeModels {
+					loadedByModel[nm.ModelName] = append(loadedByModel[nm.ModelName], loadedOn{
+						NodeID:     nm.NodeID,
+						NodeName:   nameByID[nm.NodeID],
+						State:      nm.State,
+						NodeStatus: statusByID[nm.NodeID],
+					})
+				}
+			}
 		}

 		result := make([]modelCapability, 0, len(modelConfigs)+len(modelsWithoutConfig))
+		seen := make(map[string]bool, len(modelConfigs)+len(modelsWithoutConfig))
 		for _, cfg := range modelConfigs {
+			seen[cfg.Name] = true
 			result = append(result, modelCapability{
 				ID:           cfg.Name,
 				Capabilities: cfg.KnownUsecaseStrings,
 				Backend:      cfg.Backend,
 				Disabled:     cfg.IsDisabled(),
 				Pinned:       cfg.IsPinned(),
+				LoadedOn:     loadedByModel[cfg.Name],
 			})
 		}
 		for _, name := range modelsWithoutConfig {
+			seen[name] = true
 			result = append(result, modelCapability{
 				ID:           name,
 				Capabilities: []string{},
+				LoadedOn:     loadedByModel[name],
+			})
+		}
+		// Emit entries for cluster models that have no local config — these
+		// are the actual ghosts. Without this the operator would have no way
+		// to see a model the cluster is running if its config file wasn't
+		// synced to this frontend's filesystem.
+		for name, loc := range loadedByModel {
+			if seen[name] {
+				continue
+			}
+			result = append(result, modelCapability{
+				ID:           name,
+				Capabilities: []string{},
+				LoadedOn:     loc,
+				Source:       "registry-only",
 			})
 		}

@@ -1397,24 +1457,5 @@ func RegisterUIAPIRoutes(app *echo.Echo, cl *config.ModelConfigLoader, ml *model
 		app.POST("/api/settings", localai.UpdateSettingsEndpoint(applicationInstance), adminMiddleware)
 	}

-	// Logs API (admin only)
-	app.GET("/api/traces", func(c echo.Context) error {
-		if !appConfig.EnableTracing {
-			return c.JSON(503, map[string]any{
-				"error": "Tracing disabled",
-			})
-		}
-		traces := middleware.GetTraces()
-		return c.JSON(200, map[string]any{
-			"traces": traces,
-		})
-	}, adminMiddleware)
-
-	app.POST("/api/traces/clear", func(c echo.Context) error {
-		middleware.ClearTraces()
-		return c.JSON(200, map[string]any{
-			"message": "Traces cleared",
-		})
-	}, adminMiddleware)
 }

--- a/core/services/advisorylock/keys.go
+++ b/core/services/advisorylock/keys.go
@@ -11,4 +11,5 @@ const (
 	KeyHealthCheck      int64 = 104
 	KeySchemaMigrate        int64 = 105
 	KeyBackendUpgradeCheck  int64 = 106
+	KeyStateReconciler      int64 = 107
 )
--- a/core/services/galleryop/service.go
+++ b/core/services/galleryop/service.go
@@ -57,6 +57,16 @@ func (g *GalleryService) SetBackendManager(b BackendManager) {
 	g.backendManager = b
 }

+// BackendManager returns the current backend manager. Callers like the
+// periodic upgrade checker need this so they run CheckUpgrades through the
+// distributed implementation (which asks workers) instead of the frontend's
+// local filesystem — the latter is always empty in distributed deployments.
+func (g *GalleryService) BackendManager() BackendManager {
+	g.Lock()
+	defer g.Unlock()
+	return g.backendManager
+}
+
 // SetNATSClient sets the NATS client for distributed progress publishing.
 func (g *GalleryService) SetNATSClient(nc messaging.Publisher) {
 	g.Lock()
--- a/core/services/messaging/subjects.go
+++ b/core/services/messaging/subjects.go
@@ -124,8 +124,13 @@ func SubjectNodeBackendInstall(nodeID string) string {
 // BackendInstallRequest is the payload for a backend.install NATS request.
 type BackendInstallRequest struct {
 	Backend          string `json:"backend"`
-	ModelID          string `json:"model_id,omitempty"` // unique model identifier — each model gets its own gRPC process
+	ModelID          string `json:"model_id,omitempty"`
 	BackendGalleries string `json:"backend_galleries,omitempty"`
+	// URI is set for external installs (OCI image, URL, or path). When non-empty
+	// the worker routes to InstallExternalBackend instead of the gallery lookup.
+	URI   string `json:"uri,omitempty"`
+	Name  string `json:"name,omitempty"`
+	Alias string `json:"alias,omitempty"`
 }

 // BackendInstallReply is the response from a backend.install NATS request.
@@ -157,6 +162,12 @@ type NodeBackendInfo struct {
 	IsMeta      bool   `json:"is_meta"`
 	InstalledAt string `json:"installed_at,omitempty"`
 	GalleryURL  string `json:"gallery_url,omitempty"`
+	// Version, URI and Digest enable cluster-wide upgrade detection —
+	// without them, the frontend cannot tell whether the installed OCI
+	// image matches the gallery entry, and upgrades silently never surface.
+	Version string `json:"version,omitempty"`
+	URI     string `json:"uri,omitempty"`
+	Digest  string `json:"digest,omitempty"`
 }

 // SubjectNodeBackendStop tells a worker node to stop its gRPC backend process.
--- a/core/services/nodes/managers_distributed.go
+++ b/core/services/nodes/managers_distributed.go
@@ -10,6 +10,7 @@ import (
 	"github.com/mudler/LocalAI/core/gallery"
 	"github.com/mudler/LocalAI/core/services/galleryop"
 	"github.com/mudler/LocalAI/pkg/model"
+	"github.com/mudler/LocalAI/pkg/system"
 	"github.com/mudler/xlog"
 	"github.com/nats-io/nats.go"
 )
@@ -53,6 +54,7 @@ type DistributedBackendManager struct {
 	adapter          *RemoteUnloaderAdapter
 	registry         *NodeRegistry
 	backendGalleries []config.Gallery
+	systemState      *system.SystemState
 }

 // NewDistributedBackendManager creates a DistributedBackendManager.
@@ -62,46 +64,168 @@ func NewDistributedBackendManager(appConfig *config.ApplicationConfig, ml *model
 		adapter:          adapter,
 		registry:         registry,
 		backendGalleries: appConfig.BackendGalleries,
+		systemState:      appConfig.SystemState,
 	}
 }

+// NodeOpStatus is the per-node outcome of a backend lifecycle operation.
+// Returned as part of BackendOpResult so the frontend can surface exactly
+// what happened on each worker instead of a single joined error string.
+type NodeOpStatus struct {
+	NodeID   string `json:"node_id"`
+	NodeName string `json:"node_name"`
+	Status   string `json:"status"` // "success" | "queued" | "error"
+	Error    string `json:"error,omitempty"`
+}
+
+// BackendOpResult aggregates per-node outcomes.
+type BackendOpResult struct {
+	Nodes []NodeOpStatus `json:"nodes"`
+}
+
+// enqueueAndDrainBackendOp is the shared scaffolding for
+// delete/install/upgrade. Every non-pending node gets a pending_backend_ops
+// row (intent is durable even if the node is offline). Currently-healthy
+// nodes get an immediate attempt; success deletes the row, failure records
+// the error and leaves the row for the reconciler to retry.
+//
+// `apply` is the NATS round-trip for one node. Returning an error keeps the
+// row in the queue and marks the per-node status as "error"; returning nil
+// deletes the row and reports "success". For non-healthy nodes the status
+// is "queued" — no attempt is made right now, reconciler will pick it up
+// when the node returns.
+func (d *DistributedBackendManager) enqueueAndDrainBackendOp(ctx context.Context, op, backend string, galleriesJSON []byte, apply func(node BackendNode) error) (BackendOpResult, error) {
+	allNodes, err := d.registry.List(ctx)
+	if err != nil {
+		return BackendOpResult{}, err
+	}
+
+	result := BackendOpResult{Nodes: make([]NodeOpStatus, 0, len(allNodes))}
+	for _, node := range allNodes {
+		// Pending nodes haven't been approved yet — no intent to apply.
+		if node.Status == StatusPending {
+			continue
+		}
+		// Backend lifecycle ops only make sense on backend-type workers.
+		// Agent workers don't subscribe to backend.install/delete/list, so
+		// enqueueing for them guarantees a forever-retrying row that the
+		// reconciler can never drain. Silently skip — they aren't consumers.
+		if node.NodeType != "" && node.NodeType != NodeTypeBackend {
+			continue
+		}
+		if err := d.registry.UpsertPendingBackendOp(ctx, node.ID, backend, op, galleriesJSON); err != nil {
+			xlog.Warn("Failed to enqueue backend op", "op", op, "node", node.Name, "backend", backend, "error", err)
+			result.Nodes = append(result.Nodes, NodeOpStatus{
+				NodeID: node.ID, NodeName: node.Name, Status: "error",
+				Error: fmt.Sprintf("enqueue failed: %v", err),
+			})
+			continue
+		}
+
+		if node.Status != StatusHealthy {
+			// Intent is recorded; reconciler will retry when the node recovers.
+			result.Nodes = append(result.Nodes, NodeOpStatus{
+				NodeID: node.ID, NodeName: node.Name, Status: "queued",
+				Error: fmt.Sprintf("node %s, will retry when healthy", node.Status),
+			})
+			continue
+		}
+
+		applyErr := apply(node)
+		if applyErr == nil {
+			// Find the row we just upserted and delete it; cheap but requires
+			// a lookup since UpsertPendingBackendOp doesn't return the ID.
+			if err := d.deletePendingRow(ctx, node.ID, backend, op); err != nil {
+				xlog.Debug("Failed to clear pending backend op after success", "error", err)
+			}
+			result.Nodes = append(result.Nodes, NodeOpStatus{
+				NodeID: node.ID, NodeName: node.Name, Status: "success",
+			})
+			continue
+		}
+
+		// Record failure for backoff. If it's an ErrNoResponders, the node's
+		// gone AWOL — mark unhealthy so the router stops picking it too.
+		errMsg := applyErr.Error()
+		if errors.Is(applyErr, nats.ErrNoResponders) {
+			xlog.Warn("No NATS responders for node, marking unhealthy", "node", node.Name, "nodeID", node.ID)
+			d.registry.MarkUnhealthy(ctx, node.ID)
+		}
+		if id, err := d.findPendingRow(ctx, node.ID, backend, op); err == nil {
+			_ = d.registry.RecordPendingBackendOpFailure(ctx, id, errMsg)
+		}
+		result.Nodes = append(result.Nodes, NodeOpStatus{
+			NodeID: node.ID, NodeName: node.Name, Status: "error", Error: errMsg,
+		})
+	}
+	return result, nil
+}
+
+// findPendingRow looks up the ID of a pending_backend_ops row by its
+// composite key. Used to hand off to RecordPendingBackendOpFailure /
+// DeletePendingBackendOp after UpsertPendingBackendOp upserts by the same
+// composite key.
+func (d *DistributedBackendManager) findPendingRow(ctx context.Context, nodeID, backend, op string) (uint, error) {
+	var row PendingBackendOp
+	if err := d.registry.db.WithContext(ctx).
+		Where("node_id = ? AND backend = ? AND op = ?", nodeID, backend, op).
+		First(&row).Error; err != nil {
+		return 0, err
+	}
+	return row.ID, nil
+}
+
+// deletePendingRow removes the queue row keyed by (nodeID, backend, op).
+func (d *DistributedBackendManager) deletePendingRow(ctx context.Context, nodeID, backend, op string) error {
+	return d.registry.db.WithContext(ctx).
+		Where("node_id = ? AND backend = ? AND op = ?", nodeID, backend, op).
+		Delete(&PendingBackendOp{}).Error
+}
+
+// DeleteBackend fans out backend deletion to every known node. The previous
+// implementation silently skipped non-healthy nodes, which meant zombies
+// reappeared once those nodes returned. Now the intent is durable — see
+// enqueueAndDrainBackendOp — and the reconciler catches up later.
 func (d *DistributedBackendManager) DeleteBackend(name string) error {
-	// Try local deletion but ignore "not found" errors — in distributed mode
-	// the frontend node typically doesn't have backends installed locally;
-	// they only exist on worker nodes.
+	// Local delete first (frontend rarely has backends installed in
+	// distributed mode, but the gallery operation still expects it; ignore
+	// "not found" which is the common case).
 	if err := d.local.DeleteBackend(name); err != nil {
 		if !errors.Is(err, gallery.ErrBackendNotFound) {
 			return err
 		}
 		xlog.Debug("Backend not found locally, will attempt deletion on workers", "backend", name)
 	}
-	// Fan out backend.delete to all healthy nodes
-	allNodes, listErr := d.registry.List(context.Background())
-	if listErr != nil {
-		xlog.Warn("Failed to list nodes for backend deletion fan-out", "error", listErr)
-		return listErr
-	}
-	var errs []error
-	for _, node := range allNodes {
-		if node.Status != StatusHealthy {
-			continue
-		}
-		if _, delErr := d.adapter.DeleteBackend(node.ID, name); delErr != nil {
-			if errors.Is(delErr, nats.ErrNoResponders) {
-				// Node's NATS subscription is gone — likely restarted with a new ID.
-				// Mark it unhealthy so future fan-outs skip it.
-				xlog.Warn("No NATS responders for node, marking unhealthy", "node", node.Name, "nodeID", node.ID)
-				d.registry.MarkUnhealthy(context.Background(), node.ID)
-				continue
-			}
-			xlog.Warn("Failed to propagate backend deletion to worker", "node", node.Name, "backend", name, "error", delErr)
-			errs = append(errs, fmt.Errorf("node %s: %w", node.Name, delErr))
-		}
-	}
-	return errors.Join(errs...)
+
+	ctx := context.Background()
+	_, err := d.enqueueAndDrainBackendOp(ctx, OpBackendDelete, name, nil, func(node BackendNode) error {
+		_, err := d.adapter.DeleteBackend(node.ID, name)
+		return err
+	})
+	return err
 }

-// ListBackends aggregates installed backends from all healthy worker nodes.
+// DeleteBackendDetailed is the per-node-result variant called by the HTTP
+// handler so the UI can render a per-node status drawer. DeleteBackend still
+// returns error-only for callers that don't care about node breakdown.
+func (d *DistributedBackendManager) DeleteBackendDetailed(ctx context.Context, name string) (BackendOpResult, error) {
+	if err := d.local.DeleteBackend(name); err != nil && !errors.Is(err, gallery.ErrBackendNotFound) {
+		return BackendOpResult{}, err
+	}
+	return d.enqueueAndDrainBackendOp(ctx, OpBackendDelete, name, nil, func(node BackendNode) error {
+		_, err := d.adapter.DeleteBackend(node.ID, name)
+		return err
+	})
+}
+
+// ListBackends aggregates installed backends from all worker nodes, preserving
+// per-node attribution. Each SystemBackend.Nodes entry records which node has
+// the backend and the version/digest it reports. The top-level Metadata is
+// populated from the first node seen so single-node-minded callers still work.
+//
+// Pending/offline/draining nodes are skipped because they aren't expected to
+// answer NATS requests; unhealthy nodes are still queried — ErrNoResponders
+// then marks them unhealthy and the loop continues.
 func (d *DistributedBackendManager) ListBackends() (gallery.SystemBackends, error) {
 	result := make(gallery.SystemBackends)
 	allNodes, err := d.registry.List(context.Background())
@@ -110,7 +234,7 @@ func (d *DistributedBackendManager) ListBackends() (gallery.SystemBackends, erro
 	}

 	for _, node := range allNodes {
-		if node.Status != StatusHealthy {
+		if node.Status == StatusPending || node.Status == StatusOffline || node.Status == StatusDraining {
 			continue
 		}
 		reply, err := d.adapter.ListBackends(node.ID)
@@ -128,89 +252,92 @@ func (d *DistributedBackendManager) ListBackends() (gallery.SystemBackends, erro
 			continue
 		}
 		for _, b := range reply.Backends {
-			if _, exists := result[b.Name]; !exists {
-				result[b.Name] = gallery.SystemBackend{
+			ref := gallery.NodeBackendRef{
+				NodeID:      node.ID,
+				NodeName:    node.Name,
+				NodeStatus:  node.Status,
+				Version:     b.Version,
+				Digest:      b.Digest,
+				URI:         b.URI,
+				InstalledAt: b.InstalledAt,
+			}
+			entry, exists := result[b.Name]
+			if !exists {
+				entry = gallery.SystemBackend{
 					Name:     b.Name,
 					IsSystem: b.IsSystem,
 					IsMeta:   b.IsMeta,
 					Metadata: &gallery.BackendMetadata{
+						Name:        b.Name,
 						InstalledAt: b.InstalledAt,
 						GalleryURL:  b.GalleryURL,
+						Version:     b.Version,
+						URI:         b.URI,
+						Digest:      b.Digest,
 					},
 				}
 			}
+			entry.Nodes = append(entry.Nodes, ref)
+			result[b.Name] = entry
 		}
 	}
 	return result, nil
 }

-// InstallBackend fans out backend installation to all healthy worker nodes.
+// InstallBackend fans out installation through the pending-ops queue so
+// non-healthy nodes get retried when they come back instead of being silently
+// skipped. Reply success from the NATS round-trip deletes the queue row;
+// reply.Success==false is treated as an error so the row stays for retry.
 func (d *DistributedBackendManager) InstallBackend(ctx context.Context, op *galleryop.ManagementOp[gallery.GalleryBackend, any], progressCb galleryop.ProgressCallback) error {
-	allNodes, err := d.registry.List(context.Background())
-	if err != nil {
-		return err
-	}
-
 	galleriesJSON, _ := json.Marshal(op.Galleries)
 	backendName := op.GalleryElementName

-	for _, node := range allNodes {
-		if node.Status != StatusHealthy {
-			continue
-		}
-		reply, err := d.adapter.InstallBackend(node.ID, backendName, "", string(galleriesJSON))
+	_, err := d.enqueueAndDrainBackendOp(ctx, OpBackendInstall, backendName, galleriesJSON, func(node BackendNode) error {
+		reply, err := d.adapter.InstallBackend(node.ID, backendName, "", string(galleriesJSON), op.ExternalURI, op.ExternalName, op.ExternalAlias)
 		if err != nil {
-			if errors.Is(err, nats.ErrNoResponders) {
-				xlog.Warn("No NATS responders for node, marking unhealthy", "node", node.Name, "nodeID", node.ID)
-				d.registry.MarkUnhealthy(context.Background(), node.ID)
-				continue
-			}
-			xlog.Warn("Failed to install backend on worker", "node", node.Name, "backend", backendName, "error", err)
-			continue
+			return err
 		}
 		if !reply.Success {
-			xlog.Warn("Backend install failed on worker", "node", node.Name, "backend", backendName, "error", reply.Error)
+			return fmt.Errorf("install failed: %s", reply.Error)
 		}
-	}
-	return nil
+		return nil
+	})
+	return err
 }

-// UpgradeBackend fans out a backend upgrade to all healthy worker nodes.
-// TODO: Add dedicated NATS subject for upgrade (currently reuses install with force flag)
+// UpgradeBackend reuses the install NATS subject (the worker re-downloads
+// from the gallery). Same queue semantics as Install/Delete.
 func (d *DistributedBackendManager) UpgradeBackend(ctx context.Context, name string, progressCb galleryop.ProgressCallback) error {
-	allNodes, err := d.registry.List(context.Background())
-	if err != nil {
-		return err
-	}
-
 	galleriesJSON, _ := json.Marshal(d.backendGalleries)
-	var errs []error

-	for _, node := range allNodes {
-		if node.Status != StatusHealthy {
-			continue
-		}
-		// Reuse install endpoint which will re-download the backend (force mode)
-		reply, err := d.adapter.InstallBackend(node.ID, name, "", string(galleriesJSON))
+	_, err := d.enqueueAndDrainBackendOp(ctx, OpBackendUpgrade, name, galleriesJSON, func(node BackendNode) error {
+		reply, err := d.adapter.InstallBackend(node.ID, name, "", string(galleriesJSON), "", "", "")
 		if err != nil {
-			if errors.Is(err, nats.ErrNoResponders) {
-				xlog.Warn("No NATS responders for node during upgrade, marking unhealthy", "node", node.Name, "nodeID", node.ID)
-				d.registry.MarkUnhealthy(context.Background(), node.ID)
-				continue
-			}
-			errs = append(errs, fmt.Errorf("node %s: %w", node.Name, err))
-			continue
+			return err
 		}
 		if !reply.Success {
-			errs = append(errs, fmt.Errorf("node %s: %s", node.Name, reply.Error))
+			return fmt.Errorf("upgrade failed: %s", reply.Error)
 		}
-	}
-
-	return errors.Join(errs...)
+		return nil
+	})
+	return err
 }

-// CheckUpgrades checks for available backend upgrades.
-// Gallery comparison is global (not per-node), so we delegate to the local manager.
+// CheckUpgrades checks for available backend upgrades across the cluster.
+//
+// The previous implementation delegated to d.local, which called
+// ListSystemBackends on the frontend — but in distributed mode the frontend
+// has no backends installed locally, so the upgrade loop never ran and the UI
+// never surfaced any upgrades. We now feed the cluster-wide aggregation
+// (including per-node versions/digests) into gallery.CheckUpgradesAgainst so
+// digest-based detection actually works and cluster drift is visible.
 func (d *DistributedBackendManager) CheckUpgrades(ctx context.Context) (map[string]gallery.UpgradeInfo, error) {
-	return d.local.CheckUpgrades(ctx)
+	installed, err := d.ListBackends()
+	if err != nil {
+		return nil, err
+	}
+	// systemState is used by AvailableBackends (gallery paths + meta-backend
+	// resolution). The `installed` argument is what the old code got wrong —
+	// it used to come from the empty frontend filesystem.
+	return gallery.CheckUpgradesAgainst(ctx, d.backendGalleries, d.systemState, installed)
 }
--- a/core/services/nodes/reconciler.go
+++ b/core/services/nodes/reconciler.go
@@ -3,26 +3,59 @@ package nodes
 import (
 	"context"
 	"encoding/json"
+	"errors"
+	"fmt"
 	"time"

 	"github.com/mudler/LocalAI/core/services/advisorylock"
+	grpcclient "github.com/mudler/LocalAI/pkg/grpc"
 	"github.com/mudler/xlog"
+	"github.com/nats-io/nats.go"
 	"gorm.io/gorm"
 )

+// ModelProber checks whether a model's backend process is still reachable.
+// Defaulted to a gRPC health probe but overridable for tests so we don't
+// need to stand up a real server. Returning false without an error means the
+// process is reachable but unhealthy (same as a timeout for our purposes).
+type ModelProber interface {
+	IsAlive(ctx context.Context, address string) bool
+}
+
+// grpcModelProber does a 1s HealthCheck on the model's stored gRPC address.
+type grpcModelProber struct{ token string }
+
+func (g grpcModelProber) IsAlive(ctx context.Context, address string) bool {
+	client := grpcclient.NewClientWithToken(address, false, nil, false, g.token)
+	probeCtx, cancel := context.WithTimeout(ctx, 1*time.Second)
+	defer cancel()
+	ok, _ := client.HealthCheck(probeCtx)
+	return ok
+}
+
 // ReplicaReconciler periodically ensures model replica counts match their
 // scheduling configs. It scales up replicas when below MinReplicas or when
 // all replicas are busy (up to MaxReplicas), and scales down idle replicas
 // above MinReplicas.
 //
+// Alongside replica scaling it runs two state-reconciliation passes — draining
+// the pending_backend_ops queue and probing loaded models' gRPC addresses to
+// orphan ghosts. Both passes are wrapped in the KeyStateReconciler advisory
+// lock so N frontends don't stampede.
+//
 // Only processes models with auto-scaling enabled (MinReplicas > 0 or MaxReplicas > 0).
 type ReplicaReconciler struct {
 	registry       *NodeRegistry
 	scheduler      ModelScheduler // interface for scheduling new models
 	unloader       NodeCommandSender
+	adapter        *RemoteUnloaderAdapter // NATS sender for pending-op drain
+	prober         ModelProber            // health probe for model gRPC addrs
 	db             *gorm.DB
 	interval       time.Duration
 	scaleDownDelay time.Duration
+	// probeStaleAfter: only probe node_models rows older than this so we
+	// don't hammer every worker every tick for models we just heard from.
+	probeStaleAfter time.Duration
 }

 // ModelScheduler abstracts the scheduling logic needed by the reconciler.
@@ -35,12 +68,21 @@ type ModelScheduler interface {

 // ReplicaReconcilerOptions holds configuration for creating a ReplicaReconciler.
 type ReplicaReconcilerOptions struct {
-	Registry       *NodeRegistry
-	Scheduler      ModelScheduler
-	Unloader       NodeCommandSender
-	DB             *gorm.DB
-	Interval       time.Duration // default 30s
-	ScaleDownDelay time.Duration // default 5m
+	Registry *NodeRegistry
+	Scheduler ModelScheduler
+	Unloader NodeCommandSender
+	// Adapter is the NATS sender used to retry pending backend ops. When nil,
+	// the state-reconciler pending-drain pass is a no-op (single-node mode).
+	Adapter *RemoteUnloaderAdapter
+	// RegistrationToken is used by the default gRPC prober when probing model
+	// addresses. Matches the worker's token so HealthCheck auth succeeds.
+	RegistrationToken string
+	// Prober overrides the default gRPC health probe (used by tests).
+	Prober ModelProber
+	DB              *gorm.DB
+	Interval        time.Duration // default 30s
+	ScaleDownDelay  time.Duration // default 5m
+	ProbeStaleAfter time.Duration // default 2m
 }

 // NewReplicaReconciler creates a new ReplicaReconciler.
@@ -53,13 +95,24 @@ func NewReplicaReconciler(opts ReplicaReconcilerOptions) *ReplicaReconciler {
 	if scaleDownDelay == 0 {
 		scaleDownDelay = 5 * time.Minute
 	}
+	probeStaleAfter := opts.ProbeStaleAfter
+	if probeStaleAfter == 0 {
+		probeStaleAfter = 2 * time.Minute
+	}
+	prober := opts.Prober
+	if prober == nil {
+		prober = grpcModelProber{token: opts.RegistrationToken}
+	}
 	return &ReplicaReconciler{
-		registry:       opts.Registry,
-		scheduler:      opts.Scheduler,
-		unloader:       opts.Unloader,
-		db:             opts.DB,
-		interval:       interval,
-		scaleDownDelay: scaleDownDelay,
+		registry:        opts.Registry,
+		scheduler:       opts.Scheduler,
+		unloader:        opts.Unloader,
+		adapter:         opts.Adapter,
+		prober:          prober,
+		db:              opts.DB,
+		interval:        interval,
+		scaleDownDelay:  scaleDownDelay,
+		probeStaleAfter: probeStaleAfter,
 	}
 }

@@ -78,17 +131,157 @@ func (rc *ReplicaReconciler) Run(ctx context.Context) {
 	}
 }

-// reconcileOnce performs a single reconciliation pass.
-// Uses an advisory lock so only one frontend instance reconciles at a time.
+// reconcileOnce performs a single reconciliation pass. Replica work and
+// state-reconciliation work run under *different* advisory locks so multiple
+// frontends can share load across passes, and one long-running pass doesn't
+// block the other forever if a frontend wedges.
 func (rc *ReplicaReconciler) reconcileOnce(ctx context.Context) {
 	if rc.db != nil {
-		lockKey := advisorylock.KeyFromString("replica-reconciler")
-		_ = advisorylock.WithLockCtx(ctx, rc.db, lockKey, func() error {
+		replicaKey := advisorylock.KeyFromString("replica-reconciler")
+		_ = advisorylock.WithLockCtx(ctx, rc.db, replicaKey, func() error {
 			rc.reconcile(ctx)
 			return nil
 		})
+		// Try, don't block: if another frontend is already running the state
+		// pass, this tick is a no-op. Matches the health monitor pattern.
+		_, _ = advisorylock.TryWithLockCtx(ctx, rc.db, advisorylock.KeyStateReconciler, func() error {
+			rc.reconcileState(ctx)
+			return nil
+		})
 	} else {
 		rc.reconcile(ctx)
+		rc.reconcileState(ctx)
+	}
+}
+
+// reconcileState runs the state-reconciliation passes: drain pending backend
+// ops for freshly-healthy nodes, then probe model gRPC addresses to orphan
+// ghosts. Both passes are best-effort: a failure on one node doesn't stop
+// the rest.
+func (rc *ReplicaReconciler) reconcileState(ctx context.Context) {
+	if rc.adapter != nil {
+		rc.drainPendingBackendOps(ctx)
+	}
+	rc.probeLoadedModels(ctx)
+}
+
+// drainPendingBackendOps retries queued backend ops whose next_retry_at has
+// passed on nodes that are currently healthy. On success the row is deleted;
+// on failure attempts++ and next_retry_at moves out via exponential backoff.
+func (rc *ReplicaReconciler) drainPendingBackendOps(ctx context.Context) {
+	ops, err := rc.registry.ListDuePendingBackendOps(ctx)
+	if err != nil {
+		xlog.Warn("Reconciler: failed to list pending backend ops", "error", err)
+		return
+	}
+	if len(ops) == 0 {
+		return
+	}
+	xlog.Debug("Reconciler: draining pending backend ops", "count", len(ops))
+
+	for _, op := range ops {
+		if err := ctx.Err(); err != nil {
+			return
+		}
+		var applyErr error
+		switch op.Op {
+		case OpBackendDelete:
+			_, applyErr = rc.adapter.DeleteBackend(op.NodeID, op.Backend)
+		case OpBackendInstall, OpBackendUpgrade:
+			reply, err := rc.adapter.InstallBackend(op.NodeID, op.Backend, "", string(op.Galleries), "", "", "")
+			if err != nil {
+				applyErr = err
+			} else if !reply.Success {
+				applyErr = fmt.Errorf("%s failed: %s", op.Op, reply.Error)
+			}
+		default:
+			xlog.Warn("Reconciler: unknown pending op", "op", op.Op, "id", op.ID)
+			continue
+		}
+
+		if applyErr == nil {
+			if err := rc.registry.DeletePendingBackendOp(ctx, op.ID); err != nil {
+				xlog.Warn("Reconciler: failed to delete drained op row", "id", op.ID, "error", err)
+			} else {
+				xlog.Info("Reconciler: pending backend op applied",
+					"op", op.Op, "backend", op.Backend, "node", op.NodeID, "attempts", op.Attempts+1)
+			}
+			continue
+		}
+
+		// ErrNoResponders means the node has no active NATS subscription for
+		// this subject. Either its connection dropped, or it's the wrong
+		// node type entirely. Mark unhealthy so the health monitor's
+		// heartbeat-only pass doesn't immediately flip it back — and so
+		// ListDuePendingBackendOps (which filters by status=healthy) stops
+		// picking the row until the node genuinely recovers.
+		if errors.Is(applyErr, nats.ErrNoResponders) {
+			xlog.Warn("Reconciler: no NATS responders — marking node unhealthy",
+				"op", op.Op, "backend", op.Backend, "node", op.NodeID)
+			_ = rc.registry.MarkUnhealthy(ctx, op.NodeID)
+		}
+
+		// Dead-letter cap: after maxAttempts the row is the reconciler
+		// equivalent of a poison message. Delete it loudly so the queue
+		// doesn't churn NATS every tick forever — operators can re-issue
+		// the op from the UI if they still want it applied.
+		if op.Attempts+1 >= maxPendingBackendOpAttempts {
+			xlog.Error("Reconciler: abandoning pending backend op after max attempts",
+				"op", op.Op, "backend", op.Backend, "node", op.NodeID,
+				"attempts", op.Attempts+1, "last_error", applyErr)
+			if err := rc.registry.DeletePendingBackendOp(ctx, op.ID); err != nil {
+				xlog.Warn("Reconciler: failed to delete abandoned op row", "id", op.ID, "error", err)
+			}
+			continue
+		}
+
+		_ = rc.registry.RecordPendingBackendOpFailure(ctx, op.ID, applyErr.Error())
+		xlog.Warn("Reconciler: pending backend op retry failed",
+			"op", op.Op, "backend", op.Backend, "node", op.NodeID, "attempts", op.Attempts+1, "error", applyErr)
+	}
+}
+
+// maxPendingBackendOpAttempts caps how many times the reconciler retries a
+// failing row before dead-lettering it. Ten attempts at exponential backoff
+// (30s → 15m cap) is >1h of wall-clock patience — well past any transient
+// worker restart or network blip. Poisoned rows beyond that are almost
+// certainly structural (wrong node type, non-existent gallery entry) and no
+// amount of further retrying will help.
+const maxPendingBackendOpAttempts = 10
+
+// probeLoadedModels gRPC-health-checks model addresses that the DB says are
+// loaded. If a model's backend process is gone (OOM, crash, manual restart)
+// we remove the row so ghosts don't linger. Only probes rows older than
+// probeStaleAfter so we don't hammer every worker every tick for models we
+// just heard from.
+func (rc *ReplicaReconciler) probeLoadedModels(ctx context.Context) {
+	var stale []NodeModel
+	cutoff := time.Now().Add(-rc.probeStaleAfter)
+	err := rc.registry.db.WithContext(ctx).
+		Joins("JOIN backend_nodes ON backend_nodes.id = node_models.node_id").
+		Where("node_models.state = ? AND backend_nodes.status = ? AND node_models.updated_at < ? AND node_models.address != ''",
+			"loaded", StatusHealthy, cutoff).
+		Find(&stale).Error
+	if err != nil {
+		xlog.Warn("Reconciler: failed to list loaded models for probe", "error", err)
+		return
+	}
+	for _, m := range stale {
+		if err := ctx.Err(); err != nil {
+			return
+		}
+		if rc.prober.IsAlive(ctx, m.Address) {
+			// Bump updated_at so we don't probe this row again immediately.
+			_ = rc.registry.db.WithContext(ctx).Model(&NodeModel{}).
+				Where("id = ?", m.ID).Update("updated_at", time.Now()).Error
+			continue
+		}
+		if err := rc.registry.RemoveNodeModel(ctx, m.NodeID, m.ModelName); err != nil {
+			xlog.Warn("Reconciler: failed to remove unreachable model", "node", m.NodeID, "model", m.ModelName, "error", err)
+			continue
+		}
+		xlog.Warn("Reconciler: model unreachable, removed from registry",
+			"node", m.NodeID, "model", m.ModelName, "address", m.Address)
 	}
 }

--- a/core/services/nodes/reconciler_test.go
+++ b/core/services/nodes/reconciler_test.go
@@ -239,3 +239,164 @@ var _ = Describe("ReplicaReconciler", func() {
 		})
 	})
 })
+
+// fakeProber lets tests control whether a model's gRPC address "responds".
+type fakeProber struct {
+	alive map[string]bool
+	calls int
+}
+
+func (f *fakeProber) IsAlive(_ context.Context, address string) bool {
+	f.calls++
+	if f.alive == nil {
+		return false
+	}
+	return f.alive[address]
+}
+
+var _ = Describe("ReplicaReconciler — state reconciliation", func() {
+	var (
+		db       *gorm.DB
+		registry *NodeRegistry
+	)
+
+	BeforeEach(func() {
+		if runtime.GOOS == "darwin" {
+			Skip("testcontainers requires Docker, not available on macOS CI")
+		}
+		db = testutil.SetupTestDB()
+		var err error
+		registry, err = NewNodeRegistry(db)
+		Expect(err).ToNot(HaveOccurred())
+	})
+
+	Describe("probeLoadedModels", func() {
+		It("removes loaded models whose gRPC address is unreachable", func() {
+			node := &BackendNode{Name: "n1", NodeType: NodeTypeBackend, Address: "10.0.0.1:50051"}
+			Expect(registry.Register(context.Background(), node, true)).To(Succeed())
+			// Two loaded models — one stale (will probe), one fresh (skipped).
+			stale := &NodeModel{
+				ID:        "stale-1",
+				NodeID:    node.ID,
+				ModelName: "stale-model",
+				Address:   "10.0.0.1:12345",
+				State:     "loaded",
+				UpdatedAt: time.Now().Add(-5 * time.Minute),
+			}
+			fresh := &NodeModel{
+				ID:        "fresh-1",
+				NodeID:    node.ID,
+				ModelName: "fresh-model",
+				Address:   "10.0.0.1:54321",
+				State:     "loaded",
+				UpdatedAt: time.Now(), // within probeStaleAfter
+			}
+			Expect(db.Create(stale).Error).To(Succeed())
+			Expect(db.Create(fresh).Error).To(Succeed())
+
+			prober := &fakeProber{alive: map[string]bool{"10.0.0.1:12345": false}}
+			rc := NewReplicaReconciler(ReplicaReconcilerOptions{
+				Registry:        registry,
+				DB:              db,
+				Prober:          prober,
+				ProbeStaleAfter: 2 * time.Minute,
+			})
+
+			rc.probeLoadedModels(context.Background())
+
+			// Stale was unreachable — row removed.
+			var after []NodeModel
+			Expect(db.Find(&after).Error).To(Succeed())
+			Expect(after).To(HaveLen(1))
+			Expect(after[0].ModelName).To(Equal("fresh-model"))
+			// Prober was only called once (the fresh row was filtered out).
+			Expect(prober.calls).To(Equal(1))
+		})
+
+		It("keeps reachable models and bumps their updated_at", func() {
+			node := &BackendNode{Name: "n1", NodeType: NodeTypeBackend, Address: "10.0.0.1:50051"}
+			Expect(registry.Register(context.Background(), node, true)).To(Succeed())
+			stale := &NodeModel{
+				ID:        "stale-2",
+				NodeID:    node.ID,
+				ModelName: "alive-model",
+				Address:   "10.0.0.1:12345",
+				State:     "loaded",
+				UpdatedAt: time.Now().Add(-5 * time.Minute),
+			}
+			Expect(db.Create(stale).Error).To(Succeed())
+
+			prober := &fakeProber{alive: map[string]bool{"10.0.0.1:12345": true}}
+			rc := NewReplicaReconciler(ReplicaReconcilerOptions{
+				Registry:        registry,
+				DB:              db,
+				Prober:          prober,
+				ProbeStaleAfter: 2 * time.Minute,
+			})
+
+			rc.probeLoadedModels(context.Background())
+
+			var after NodeModel
+			Expect(db.First(&after, "id = ?", "stale-2").Error).To(Succeed())
+			Expect(after.UpdatedAt).To(BeTemporally("~", time.Now(), time.Second))
+		})
+	})
+
+	Describe("UpsertPendingBackendOp + RecordPendingBackendOpFailure", func() {
+		It("upserts on the composite key rather than duplicating rows", func() {
+			node := &BackendNode{Name: "n1", NodeType: NodeTypeBackend, Address: "10.0.0.1:50051"}
+			Expect(registry.Register(context.Background(), node, true)).To(Succeed())
+
+			Expect(registry.UpsertPendingBackendOp(context.Background(), node.ID, "foo", OpBackendDelete, nil)).To(Succeed())
+			// Second call for the same (node, backend, op) should not create a
+			// new row — that's how re-issuing a delete works.
+			Expect(registry.UpsertPendingBackendOp(context.Background(), node.ID, "foo", OpBackendDelete, nil)).To(Succeed())
+
+			var rows []PendingBackendOp
+			Expect(db.Find(&rows).Error).To(Succeed())
+			Expect(rows).To(HaveLen(1))
+		})
+
+		It("increments attempts and moves next_retry_at out on failure", func() {
+			node := &BackendNode{Name: "n1", NodeType: NodeTypeBackend, Address: "10.0.0.1:50051"}
+			Expect(registry.Register(context.Background(), node, true)).To(Succeed())
+			Expect(registry.UpsertPendingBackendOp(context.Background(), node.ID, "foo", OpBackendDelete, nil)).To(Succeed())
+
+			var row PendingBackendOp
+			Expect(db.First(&row).Error).To(Succeed())
+			before := row.NextRetryAt
+
+			Expect(registry.RecordPendingBackendOpFailure(context.Background(), row.ID, "boom")).To(Succeed())
+			Expect(db.First(&row, row.ID).Error).To(Succeed())
+			Expect(row.Attempts).To(Equal(1))
+			Expect(row.LastError).To(Equal("boom"))
+			Expect(row.NextRetryAt).To(BeTemporally(">", before))
+		})
+	})
+
+	Describe("NewNodeRegistry malformed-row pruning", func() {
+		It("drops queue rows for agent nodes and non-existent nodes on startup", func() {
+			agent := &BackendNode{Name: "agent-1", NodeType: NodeTypeAgent, Address: "x"}
+			Expect(registry.Register(context.Background(), agent, true)).To(Succeed())
+			backend := &BackendNode{Name: "backend-1", NodeType: NodeTypeBackend, Address: "y"}
+			Expect(registry.Register(context.Background(), backend, true)).To(Succeed())
+
+			// Three rows: one for a valid backend node (should survive),
+			// one for an agent node (pruned), one for an empty backend name
+			// on the valid node (pruned).
+			Expect(registry.UpsertPendingBackendOp(context.Background(), backend.ID, "foo", OpBackendInstall, nil)).To(Succeed())
+			Expect(registry.UpsertPendingBackendOp(context.Background(), agent.ID, "foo", OpBackendInstall, nil)).To(Succeed())
+			Expect(registry.UpsertPendingBackendOp(context.Background(), backend.ID, "", OpBackendInstall, nil)).To(Succeed())
+
+			// Re-instantiating the registry runs the cleanup migration.
+			_, err := NewNodeRegistry(db)
+			Expect(err).ToNot(HaveOccurred())
+
+			var rows []PendingBackendOp
+			Expect(db.Find(&rows).Error).To(Succeed())
+			Expect(rows).To(HaveLen(1))
+			Expect(rows[0].NodeID).To(Equal(backend.ID))
+			Expect(rows[0].Backend).To(Equal("foo"))
+		})
+	})
+})
--- a/core/services/nodes/registry.go
+++ b/core/services/nodes/registry.go
@@ -104,6 +104,36 @@ type NodeWithExtras struct {
 	Labels        map[string]string `json:"labels,omitempty"`
 }

+// PendingBackendOp is a durable intent for a backend lifecycle operation
+// (delete/install/upgrade) that needs to eventually apply on a specific node.
+//
+// Without this table, a backend delete against an offline node silently
+// dropped: the frontend skipped the node, the node came back later with the
+// backend still installed, and the operator saw a zombie. Now the intent is
+// recorded regardless of node status; the state reconciler drains the queue
+// whenever a node is healthy and removes the row on success. Reissuing the
+// same operation while a row exists updates NextRetryAt instead of stacking
+// duplicates (see the unique index).
+type PendingBackendOp struct {
+	ID          uint      `gorm:"primaryKey;autoIncrement" json:"id"`
+	NodeID      string    `gorm:"index;size:36;not null;uniqueIndex:idx_pending_backend_op,priority:1" json:"node_id"`
+	Backend     string    `gorm:"index;size:255;not null;uniqueIndex:idx_pending_backend_op,priority:2" json:"backend"`
+	Op          string    `gorm:"size:16;not null;uniqueIndex:idx_pending_backend_op,priority:3" json:"op"`
+	Galleries   []byte    `gorm:"type:bytea" json:"-"` // serialized JSON for install/upgrade retries
+	Attempts    int       `gorm:"default:0" json:"attempts"`
+	LastError   string    `gorm:"type:text" json:"last_error,omitempty"`
+	CreatedAt   time.Time `json:"created_at"`
+	NextRetryAt time.Time `gorm:"index" json:"next_retry_at"`
+}
+
+// Op constants mirror the operation names used by DistributedBackendManager
+// so callers don't repeat stringly-typed values.
+const (
+	OpBackendDelete  = "delete"
+	OpBackendInstall = "install"
+	OpBackendUpgrade = "upgrade"
+)
+
 // NodeRegistry manages backend node registration and lookup in PostgreSQL.
 type NodeRegistry struct {
 	db *gorm.DB
@@ -114,10 +144,34 @@ type NodeRegistry struct {
 // when multiple instances (frontend + workers) start at the same time.
 func NewNodeRegistry(db *gorm.DB) (*NodeRegistry, error) {
 	if err := advisorylock.WithLockCtx(context.Background(), db, advisorylock.KeySchemaMigrate, func() error {
-		return db.AutoMigrate(&BackendNode{}, &NodeModel{}, &NodeLabel{}, &ModelSchedulingConfig{})
+		return db.AutoMigrate(&BackendNode{}, &NodeModel{}, &NodeLabel{}, &ModelSchedulingConfig{}, &PendingBackendOp{})
 	}); err != nil {
 		return nil, fmt.Errorf("migrating node tables: %w", err)
 	}
+
+	// One-shot cleanup of queue rows that can never drain: ops targeted at
+	// agent workers (wrong subscription set), at non-existent nodes, or with
+	// an empty backend name. The guard in enqueueAndDrainBackendOp prevents
+	// new ones from being written, but rows persisted by earlier versions
+	// keep the reconciler busy retrying a permanently-failing NATS request
+	// every 30s. Guarded by the same migration advisory lock so only one
+	// frontend runs it.
+	_ = advisorylock.WithLockCtx(context.Background(), db, advisorylock.KeySchemaMigrate, func() error {
+		res := db.Exec(`
+			DELETE FROM pending_backend_ops
+			WHERE backend = ''
+			   OR node_id NOT IN (SELECT id FROM backend_nodes WHERE node_type = ? OR node_type = '')
+		`, NodeTypeBackend)
+		if res.Error != nil {
+			xlog.Warn("Failed to prune malformed pending_backend_ops rows", "error", res.Error)
+			return res.Error
+		}
+		if res.RowsAffected > 0 {
+			xlog.Info("Pruned pending_backend_ops rows (wrong node type or empty backend)", "count", res.RowsAffected)
+		}
+		return nil
+	})
+
 	return &NodeRegistry{db: db}, nil
 }

@@ -946,3 +1000,114 @@ func (r *NodeRegistry) ApplyAutoLabels(ctx context.Context, nodeID string, node
 		_ = r.SetNodeLabel(ctx, nodeID, "node.name", node.Name)
 	}
 }
+
+// UpsertPendingBackendOp records or refreshes a pending backend operation for
+// a node. If a row already exists for (nodeID, backend, op) we keep its
+// Attempts/LastError but reset NextRetryAt to now, so reissuing the same
+// delete/upgrade nudges it to the front of the queue instead of stacking a
+// duplicate intent.
+func (r *NodeRegistry) UpsertPendingBackendOp(ctx context.Context, nodeID, backend, op string, galleries []byte) error {
+	row := PendingBackendOp{
+		NodeID:      nodeID,
+		Backend:     backend,
+		Op:          op,
+		Galleries:   galleries,
+		NextRetryAt: time.Now(),
+	}
+	return r.db.WithContext(ctx).Clauses(clause.OnConflict{
+		Columns: []clause.Column{{Name: "node_id"}, {Name: "backend"}, {Name: "op"}},
+		DoUpdates: clause.AssignmentColumns([]string{"galleries", "next_retry_at"}),
+	}).Create(&row).Error
+}
+
+// ListDuePendingBackendOps returns queued ops whose NextRetryAt has passed
+// AND whose node is currently healthy. The reconciler drains this list; we
+// filter by node status in the query so a tick doesn't hammer NATS for
+// nodes that obviously can't answer.
+func (r *NodeRegistry) ListDuePendingBackendOps(ctx context.Context) ([]PendingBackendOp, error) {
+	var ops []PendingBackendOp
+	err := r.db.WithContext(ctx).
+		Joins("JOIN backend_nodes ON backend_nodes.id = pending_backend_ops.node_id").
+		Where("pending_backend_ops.next_retry_at <= ? AND backend_nodes.status = ?", time.Now(), StatusHealthy).
+		Order("pending_backend_ops.next_retry_at ASC").
+		Find(&ops).Error
+	if err != nil {
+		return nil, fmt.Errorf("listing due pending backend ops: %w", err)
+	}
+	return ops, nil
+}
+
+// ListPendingBackendOps returns every queued row (for the UI "pending on N
+// nodes" chip and the pre-delete ConfirmDialog).
+func (r *NodeRegistry) ListPendingBackendOps(ctx context.Context) ([]PendingBackendOp, error) {
+	var ops []PendingBackendOp
+	if err := r.db.WithContext(ctx).Order("backend ASC, created_at ASC").Find(&ops).Error; err != nil {
+		return nil, fmt.Errorf("listing pending backend ops: %w", err)
+	}
+	return ops, nil
+}
+
+// DeletePendingBackendOp removes a queue row — called after the op succeeds.
+func (r *NodeRegistry) DeletePendingBackendOp(ctx context.Context, id uint) error {
+	if err := r.db.WithContext(ctx).Delete(&PendingBackendOp{}, id).Error; err != nil {
+		return fmt.Errorf("deleting pending backend op %d: %w", id, err)
+	}
+	return nil
+}
+
+// RecordPendingBackendOpFailure bumps Attempts, captures the error, and
+// pushes NextRetryAt out with exponential backoff capped at 15 minutes.
+func (r *NodeRegistry) RecordPendingBackendOpFailure(ctx context.Context, id uint, errMsg string) error {
+	return r.db.WithContext(ctx).Transaction(func(tx *gorm.DB) error {
+		var row PendingBackendOp
+		if err := tx.First(&row, id).Error; err != nil {
+			return err
+		}
+		row.Attempts++
+		row.LastError = errMsg
+		row.NextRetryAt = time.Now().Add(backoffForAttempt(row.Attempts))
+		return tx.Save(&row).Error
+	})
+}
+
+// backoffForAttempt is exponential from 30s doubling up to a 15m cap. The
+// reconciler tick is 30s so anything shorter would just re-fire immediately.
+func backoffForAttempt(attempts int) time.Duration {
+	const cap = 15 * time.Minute
+	base := 30 * time.Second
+	shift := attempts - 1
+	if shift < 0 {
+		shift = 0
+	}
+	if shift > 10 { // 2^10 * 30s already exceeds the cap
+		shift = 10
+	}
+	d := base << shift
+	if d > cap {
+		return cap
+	}
+	return d
+}
+
+// CountPendingBackendOpsByBackend returns a map of backend name to the count
+// of pending rows. Used to decorate Manage → Backends with a "pending on N
+// nodes" chip without exposing the full queue.
+func (r *NodeRegistry) CountPendingBackendOpsByBackend(ctx context.Context) (map[string]int, error) {
+	type row struct {
+		Backend string
+		Count   int
+	}
+	var rows []row
+	err := r.db.WithContext(ctx).Model(&PendingBackendOp{}).
+		Select("backend, COUNT(*) as count").
+		Group("backend").
+		Scan(&rows).Error
+	if err != nil {
+		return nil, fmt.Errorf("counting pending backend ops: %w", err)
+	}
+	out := make(map[string]int, len(rows))
+	for _, r := range rows {
+		out[r.Backend] = r.Count
+	}
+	return out, nil
+}
--- a/core/services/nodes/router.go
+++ b/core/services/nodes/router.go
@@ -504,7 +504,7 @@ func (r *SmartRouter) installBackendOnNode(ctx context.Context, node *BackendNod
 		return "", fmt.Errorf("no NATS connection for backend installation")
 	}

-	reply, err := r.unloader.InstallBackend(node.ID, backendType, modelID, r.galleriesJSON)
+	reply, err := r.unloader.InstallBackend(node.ID, backendType, modelID, r.galleriesJSON, "", "", "")
 	if err != nil {
 		return "", err
 	}
--- a/core/services/nodes/router_test.go
+++ b/core/services/nodes/router_test.go
@@ -244,7 +244,7 @@ type fakeUnloader struct {
 	unloadErr    error
 }

-func (f *fakeUnloader) InstallBackend(_, _, _, _ string) (*messaging.BackendInstallReply, error) {
+func (f *fakeUnloader) InstallBackend(_, _, _, _, _, _, _ string) (*messaging.BackendInstallReply, error) {
 	return f.installReply, f.installErr
 }

--- a/core/services/nodes/unloader.go
+++ b/core/services/nodes/unloader.go
@@ -17,7 +17,7 @@ type backendStopRequest struct {
 // NodeCommandSender abstracts NATS-based commands to worker nodes.
 // Used by HTTP endpoint handlers to avoid coupling to the concrete RemoteUnloaderAdapter.
 type NodeCommandSender interface {
-	InstallBackend(nodeID, backendType, modelID, galleriesJSON string) (*messaging.BackendInstallReply, error)
+	InstallBackend(nodeID, backendType, modelID, galleriesJSON, uri, name, alias string) (*messaging.BackendInstallReply, error)
 	DeleteBackend(nodeID, backendName string) (*messaging.BackendDeleteReply, error)
 	ListBackends(nodeID string) (*messaging.BackendListReply, error)
 	StopBackend(nodeID, backend string) error
@@ -72,7 +72,7 @@ func (a *RemoteUnloaderAdapter) UnloadRemoteModel(modelName string) error {
 // The worker installs the backend from gallery (if not already installed),
 // starts the gRPC process, and replies when ready.
 // Timeout: 5 minutes (gallery install can take a while).
-func (a *RemoteUnloaderAdapter) InstallBackend(nodeID, backendType, modelID, galleriesJSON string) (*messaging.BackendInstallReply, error) {
+func (a *RemoteUnloaderAdapter) InstallBackend(nodeID, backendType, modelID, galleriesJSON, uri, name, alias string) (*messaging.BackendInstallReply, error) {
 	subject := messaging.SubjectNodeBackendInstall(nodeID)
 	xlog.Info("Sending NATS backend.install", "nodeID", nodeID, "backend", backendType, "modelID", modelID)

@@ -80,6 +80,9 @@ func (a *RemoteUnloaderAdapter) InstallBackend(nodeID, backendType, modelID, gal
 		Backend:          backendType,
 		ModelID:          modelID,
 		BackendGalleries: galleriesJSON,
+		URI:              uri,
+		Name:             name,
+		Alias:            alias,
 	}, 5*time.Minute)
 }

--- a/docs/content/features/backend-monitor.md
+++ b/docs/content/features/backend-monitor.md
@@ -14,11 +14,13 @@ LocalAI provides endpoints to monitor and manage running backends. The `/backend

 ### Request

-The request body is JSON:
+The model to monitor is passed as a query parameter:

-| Parameter | Type     | Required | Description                    |
-|-----------|----------|----------|--------------------------------|
-| `model`   | `string` | Yes      | Name of the model to monitor   |
+| Parameter | Type     | Required | Location | Description                    |
+|-----------|----------|----------|----------|--------------------------------|
+| `model`   | `string` | Yes      | query    | Name of the model to monitor   |
+
+For backwards compatibility, a JSON body with the same field is still accepted when the `model` query parameter is not set, but new clients should use the query parameter.

 ### Response

@@ -42,9 +44,7 @@ If the gRPC status call fails, the endpoint falls back to local process metrics:
 ### Usage

 ```bash
-curl http://localhost:8080/backend/monitor \
-  -H "Content-Type: application/json" \
-  -d '{"model": "my-model"}'
+curl "http://localhost:8080/backend/monitor?model=my-model"
 ```

 ### Example response
--- a/docs/content/reference/_index.md
+++ b/docs/content/reference/_index.md
@@ -130,6 +130,19 @@ Reference for system information commands and diagnostics.

 ---

+### 🤖 [AI Coding Assistants](ai-coding-assistants.md)
+Policy for AI-assisted contributions — licensing, DCO, and attribution.
+
+**Key topics:**
+- Aligned with the Linux kernel's AI assistants policy
+- Signed-off-by and DCO rules
+- `Assisted-by` commit trailer format
+- Scope and responsibility of the human submitter
+
+**Recommended for:** Contributors using AI coding assistants (Claude, Copilot, Cursor, Codex, etc.)
+
+---
+
 ## Quick Links

 | Task | Documentation |
@@ -138,6 +151,7 @@ Reference for system information commands and diagnostics.
 | CLI commands | [CLI Reference](cli-reference.md) |
 | Check compatibility | [Compatibility Table](compatibility-table.md) |
 | System diagnostics | [System Info](system-info.md) |
+| Contribute with AI assistance | [AI Coding Assistants](ai-coding-assistants.md) |

 ---

--- a/docs/content/reference/ai-coding-assistants.md
+++ b/docs/content/reference/ai-coding-assistants.md
@@ -0,0 +1,79 @@
+
+++
+disableToc = false
+title = "AI Coding Assistants"
+weight = 28
+++
+
+This document provides guidance for AI tools and developers using AI assistance when contributing to LocalAI.
+
+**LocalAI follows the same guidelines as the Linux kernel project for AI-assisted contributions.** See the upstream policy here: <https://docs.kernel.org/process/coding-assistants.html>. The rules below mirror that policy, adapted to LocalAI's license and project layout.
+
+AI tools helping with LocalAI development should follow the standard project development process:
+
+- [CONTRIBUTING.md](https://github.com/mudler/LocalAI/blob/master/CONTRIBUTING.md) — development workflow, commit conventions, and PR guidelines
+- [AGENTS.md](https://github.com/mudler/LocalAI/blob/master/AGENTS.md) — the agent entry point with links to all detailed topic guides
+- [.agents/ai-coding-assistants.md](https://github.com/mudler/LocalAI/blob/master/.agents/ai-coding-assistants.md) — the full policy source of truth
+
+## Licensing and Legal Requirements
+
+All contributions must comply with LocalAI's licensing requirements:
+
+- LocalAI is licensed under the **MIT License**
+- New source files should use the SPDX license identifier `MIT` where applicable to the file type
+- Contributions must be compatible with the MIT License and must not introduce code under incompatible licenses (e.g., GPL) without an explicit discussion with maintainers
+
+## Signed-off-by and Developer Certificate of Origin
+
+**AI agents MUST NOT add `Signed-off-by` tags.** Only humans can legally certify the Developer Certificate of Origin (DCO). The human submitter is responsible for:
+
+- Reviewing all AI-generated code
+- Ensuring compliance with licensing requirements
+- Adding their own `Signed-off-by` tag (when the project requires DCO) to certify the contribution
+- Taking full responsibility for the contribution
+
+AI agents MUST NOT add `Co-Authored-By` trailers for themselves either. A human reviewer owns the contribution; the AI's involvement is recorded via `Assisted-by` (see below).
+
+## Attribution
+
+When AI tools contribute to LocalAI development, proper attribution helps track the evolving role of AI in the development process. Contributions should include an `Assisted-by` tag in the commit message trailer in the following format:
+
+```
+Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]
+```
+
+Where:
+
+- `AGENT_NAME` — name of the AI tool or framework (e.g., `Claude`, `Copilot`, `Cursor`)
+- `MODEL_VERSION` — specific model version used (e.g., `claude-opus-4-7`, `gpt-5`)
+- `[TOOL1] [TOOL2]` — optional specialized analysis tools invoked by the agent (e.g., `golangci-lint`, `staticcheck`, `go vet`)
+
+Basic development tools (git, go, make, editors) should **not** be listed.
+
+### Example
+
+```
+fix(llama-cpp): handle empty tool call arguments
+
+Previously the parser panicked when the model returned a tool call with
+an empty arguments object. Fall back to an empty JSON object in that
+case so downstream consumers receive a valid payload.
+
+Assisted-by: Claude:claude-opus-4-7 golangci-lint
+Signed-off-by: Jane Developer <jane@example.com>
+```
+
+## Scope and Responsibility
+
+Using an AI assistant does not reduce the contributor's responsibility. The human submitter must:
+
+- Understand every line that lands in the PR
+- Verify that generated code compiles, passes tests, and follows the project style
+- Confirm that any referenced APIs, flags, or file paths actually exist in the current tree (AI models may hallucinate identifiers)
+- Not submit AI output verbatim without review
+
+Reviewers may ask for clarification on any change regardless of how it was produced. "An AI wrote it" is not an acceptable answer to a design question.
+
+{{% notice note %}}
+This policy is a living document. If you're unsure how to apply it to a specific contribution, open an issue or ask in the [Discord channel](https://discord.gg/uJAeKSAGDy) before submitting.
+{{% /notice %}}
--- a/docs/content/reference/compatibility-table.md
+++ b/docs/content/reference/compatibility-table.md
@@ -33,7 +33,7 @@ LocalAI will attempt to automatically load models which are not explicitly confi
 |---------|-------------|-------------|
 | [whisper.cpp](https://github.com/ggml-org/whisper.cpp) | OpenAI Whisper in C/C++ | CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T |
 | [faster-whisper](https://github.com/SYSTRAN/faster-whisper) | Fast Whisper with CTranslate2 | CUDA 12/13, ROCm, Intel, Metal |
-| [WhisperX](https://github.com/m-bain/whisperX) | Word-level timestamps and speaker diarization | CPU, CUDA 12/13, ROCm, Metal |
+| [WhisperX](https://github.com/m-bain/whisperX) | Word-level timestamps and speaker diarization | CPU, CUDA 12/13, Metal |
 | [moonshine](https://github.com/moonshine-ai/moonshine) | Ultra-fast transcription for low-end devices | CPU, CUDA 12/13, Metal |
 | [voxtral](https://github.com/mudler/voxtral.c) | Voxtral Realtime 4B speech-to-text in C | CPU, Metal |
 | [Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR) | Qwen3 automatic speech recognition | CPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T |
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -1,4 +1,206 @@
 ---
+- name: "qwen3.6-35b-a3b-claude-4.6-opus-reasoning-distilled"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF
+  description: |
+    # 🔥 Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled
+
+    A reasoning SFT fine-tune of `Qwen/Qwen3.6-35B-A3B` on chain-of-thought (CoT) distillation mostly sourced from Claude Opus 4.6. The goal is to preserve Qwen3.6's strong agentic coding and reasoning base while nudging the model toward structured Claude Opus-style reasoning traces and more stable long-form problem solving.
+
+    The training path is text-only. The Qwen3.6 base architecture includes a vision encoder, but this fine-tuning run did not train on image or video examples.
+
+      - **Developed by:** @hesamation
+      - **Base model:** `Qwen/Qwen3.6-35B-A3B`
+      - **License:** apache-2.0
+
+    This fine-tuning run is inspired by Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, including the notebook/training workflow style and Claude Opus reasoning-distillation direction.
+
+    [](https://x.com/Hesamation) [](https://discord.gg/vtJykN3t)
+
+    ## Benchmark Results
+
+    The MMLU-Pro pass used 70 total questions per model: `--limit 5` across 14 MMLU-Pro subjects. Treat this as a smoke/comparative check, not a release-quality full benchmark.
+
+    ...
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - qwen
+    - reasoning
+  icon: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3.6/Figures/qwen3.6_35b_a3b_score.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    options:
+      - use_jinja:true
+    parameters:
+      min_p: 0
+      model: llama-cpp/models/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.Q4_K_M.gguf
+      presence_penalty: 1.5
+      repeat_penalty: 1
+      temperature: 0.7
+      top_k: 20
+      top_p: 0.8
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.Q4_K_M.gguf
+      sha256: fd3bf7586354890a2710d69357c30fb221a31eecf9f3cd9418257d9289e02765
+      uri: https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF/resolve/main/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.Q4_K_M.gguf
+- name: "qwen3.5-9b-glm5.1-distill-v1"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF
+  description: |
+    # 🪐 Qwen3.5-9B-GLM5.1-Distill-v1
+
+    ## 📌 Model Overview
+
+    **Model Name:** `Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1`
+    **Base Model:** Qwen3.5-9B
+    **Training Type:** Supervised Fine-Tuning (SFT, Distillation)
+    **Parameter Scale:** 9B
+    **Training Framework:** Unsloth
+
+    This model is a distilled variant of **Qwen3.5-9B**, trained on high-quality reasoning data derived from **GLM-5.1**.
+
+    The primary goals are to:
+
+      - Improve **structured reasoning ability**
+      - Enhance **instruction-following consistency**
+      - Activate **latent knowledge via better reasoning structure**
+
+    ## 📊 Training Data
+
+    ### Main Dataset
+
+      - `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`
+      - Cleaned from the original `Kassadin88/GLM-5.1-1000000x` dataset.
+      - Generated from a **GLM-5.1 teacher model**
+      - Approximately **700x** the scale of `Qwen3.5-reasoning-700x`
+      - Training used a **filtered subset**, not the full source dataset.
+
+    ### Auxiliary Dataset
+
+      - `Jackrong/Qwen3.5-reasoning-700x`
+
+    ...
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - qwen
+    - instruction-tuned
+    - reasoning
+  icon: https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/BnSg_x99v9bG9T5-8sKa1.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    mmproj: llama-cpp/mmproj/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF/mmproj.gguf
+    options:
+      - use_jinja:true
+    parameters:
+      min_p: 0
+      model: llama-cpp/models/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF/Qwen3.5-9B-GLM5.1-Distill-v1-Q4_K_M.gguf
+      presence_penalty: 1.5
+      repeat_penalty: 1
+      temperature: 0.7
+      top_k: 20
+      top_p: 0.8
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF/Qwen3.5-9B-GLM5.1-Distill-v1-Q4_K_M.gguf
+      sha256: f6f1d2b8efb2339ce9d4dd0f0329d2f2e4cf765eda49aa3f6df8f629f871a151
+      uri: https://huggingface.co/Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF/resolve/main/Qwen3.5-9B-GLM5.1-Distill-v1-Q4_K_M.gguf
+    - filename: llama-cpp/mmproj/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF/mmproj.gguf
+      sha256: e42c1c2ed0eaf6ea88a6ba10b26b4adf00a96a8c3d1803534a4c41060ad9e86b
+      uri: https://huggingface.co/Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF/resolve/main/mmproj.gguf
+- name: "supergemma4-26b-uncensored-v2"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/Jiunsong/supergemma4-26b-uncensored-gguf-v2
+  description: |
+    Hugging Face |
+    GitHub |
+    Launch Blog |
+    Documentation
+
+    License: Apache 2.0 | Authors: Google DeepMind
+
+    Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages.
+
+    Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: **E2B**, **E4B**, **26B A4B**, and **31B**. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI.
+
+    Gemma 4 introduces key **capability and architectural advancements**:
+
+    * **Reasoning** – All models in the family are designed as highly capable reasoners, with configurable thinking modes.
+
+    ...
+  license: "gemma"
+  tags:
+    - llm
+    - gguf
+  icon: https://ai.google.dev/gemma/images/gemma4_banner.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    options:
+      - use_jinja:true
+    parameters:
+      model: llama-cpp/models/supergemma4-26b-uncensored-gguf-v2/supergemma4-26b-uncensored-fast-v2-Q4_K_M.gguf
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/supergemma4-26b-uncensored-gguf-v2/supergemma4-26b-uncensored-fast-v2-Q4_K_M.gguf
+      sha256: e773b0a209d48524f9d485bca0818247f75d7ddde7cce951367a7e441fb59137
+      uri: https://huggingface.co/Jiunsong/supergemma4-26b-uncensored-gguf-v2/resolve/main/supergemma4-26b-uncensored-fast-v2-Q4_K_M.gguf
+- name: "qwopus-glm-18b-merged"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/Jackrong/Qwopus-GLM-18B-Merged-GGUF
+  description: "# \U0001FA90 Qwen3.5-9B-GLM5.1-Distill-v1\n\n## \U0001F4CC Model Overview\n\n**Model Name:** `Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1`\n**Base Model:** Qwen3.5-9B\n**Training Type:** Supervised Fine-Tuning (SFT, Distillation)\n**Parameter Scale:** 9B\n**Training Framework:** Unsloth\n\nThis model is a distilled variant of **Qwen3.5-9B**, trained on high-quality reasoning data derived from **GLM-5.1**.\n\nThe primary goals are to:\n\n  - Improve **structured reasoning ability**\n  - Enhance **instruction-following consistency**\n  - Activate **latent knowledge via better reasoning structure**\n\n## \U0001F4CA Training Data\n\n### Main Dataset\n\n  - `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`\n  - Cleaned from the original `Kassadin88/GLM-5.1-1000000x` dataset.\n  - Generated from a **GLM-5.1 teacher model**\n  - Approximately **700x** the scale of `Qwen3.5-reasoning-700x`\n  - Training used a **filtered subset**, not the full source dataset.\n\n### Auxiliary Dataset\n\n  - `Jackrong/Qwen3.5-reasoning-700x`\n\n...\n"
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - reasoning
+  icon: https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/BnSg_x99v9bG9T5-8sKa1.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    options:
+      - use_jinja:true
+    parameters:
+      model: llama-cpp/models/Qwopus-GLM-18B-Merged-GGUF/Qwopus-GLM-18B-Healed-Q4_K_M.gguf
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/Qwopus-GLM-18B-Merged-GGUF/Qwopus-GLM-18B-Healed-Q4_K_M.gguf
+      sha256: 13bd039f95c9ea46ef1d75905faa7be6ca4e47a5af9d4cf62e298a738a5b195f
+      uri: https://huggingface.co/Jackrong/Qwopus-GLM-18B-Merged-GGUF/resolve/main/Qwopus-GLM-18B-Healed-Q4_K_M.gguf
 - name: "qwen3.6-35b-a3b-apex"
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
  urls:
@@ -887,6 +1089,8 @@
    - gpu
  overrides:
    backend: neutts
+    parameters:
+      model: neuphonic/neutts-air
    known_usecases:
      - tts
 - name: vllm-omni-z-image-turbo
@@ -15186,14 +15390,16 @@
    - gpu
  overrides:
    parameters:
-      model: wan2.1-t2v-1.3B-Q8_0.gguf
+      model: wan2.1_t2v_1.3b-q8_0.gguf
  files:
-    - filename: "wan2.1-t2v-1.3B-Q8_0.gguf"
-      uri: "huggingface://calcuis/wan-gguf/wan2.1-t2v-1.3B-Q8_0.gguf"
+    - filename: "wan2.1_t2v_1.3b-q8_0.gguf"
+      sha256: "8f10260cc26498fee303851ee1c2047918934125731b9b78d4babfce4ec27458"
+      uri: "huggingface://calcuis/wan-gguf/wan2.1_t2v_1.3b-q8_0.gguf"
    - filename: "wan_2.1_vae.safetensors"
      uri: "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors"
    - filename: "umt5-xxl-encoder-Q8_0.gguf"
      uri: "huggingface://city96/umt5-xxl-encoder-gguf/umt5-xxl-encoder-Q8_0.gguf"
+      sha256: 2521d4de0bf9e1cc6549866463ceae85e4ec3239bc6063f7488810be39033bbc
 - name: wan-2.1-i2v-14b-480p-ggml
  license: apache-2.0
  url: "github:mudler/LocalAI/gallery/wan-ggml.yaml@master"
@@ -15214,11 +15420,103 @@
      model: wan2.1-i2v-14b-480p-Q4_K_M.gguf
    options:
      - "clip_vision_path:clip_vision_h.safetensors"
+      - "diffusion_model"
+      - "vae_decode_only:false"
+      - "sampler:euler"
+      - "flow_shift:3.0"
+      - "t5xxl_path:umt5-xxl-encoder-Q8_0.gguf"
+      - "vae_path:wan_2.1_vae.safetensors"
  files:
    - filename: "wan2.1-i2v-14b-480p-Q4_K_M.gguf"
+      sha256: "d91f7139acadb42ea05cdf97b311e5099f714f11fbe4d90916500e2f53cbba82"
      uri: "huggingface://city96/Wan2.1-I2V-14B-480P-gguf/wan2.1-i2v-14b-480p-Q4_K_M.gguf"
    - filename: "wan_2.1_vae.safetensors"
      uri: "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors"
+    - filename: "umt5-xxl-encoder-Q8_0.gguf"
+      uri: "huggingface://city96/umt5-xxl-encoder-gguf/umt5-xxl-encoder-Q8_0.gguf"
+      sha256: 2521d4de0bf9e1cc6549866463ceae85e4ec3239bc6063f7488810be39033bbc
+    - filename: "clip_vision_h.safetensors"
+      uri: "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/clip_vision/clip_vision_h.safetensors"
+- name: wan-2.1-flf2v-14b-720p-ggml
+  license: apache-2.0
+  url: "github:mudler/LocalAI/gallery/wan-ggml.yaml@master"
+  description: |
+    Wan 2.1 FLF2V 14B 720P — first-last-frame-to-video diffusion, GGUF Q4_K_M.
+    Takes a start and end reference image and interpolates a 33-frame clip
+    between them. Unlike the plain I2V variant this model feeds the end
+    frame through clip_vision as well, so it conditions semantically (not
+    just in pixel-space) on both endpoints. That makes it the right choice
+    for seamless loops (start_image == end_image) and clean narrative cuts.
+    Native 720p but accepts 480p resolutions; shares the same VAE, t5xxl
+    text encoder, and clip_vision_h as I2V 14B.
+  urls:
+    - https://huggingface.co/city96/Wan2.1-FLF2V-14B-720P-gguf
+  tags:
+    - image-to-video
+    - first-last-frame-to-video
+    - wan
+    - video-generation
+    - cpu
+    - gpu
+  overrides:
+    parameters:
+      model: wan2.1-flf2v-14b-720p-Q4_K_M.gguf
+    options:
+      - "clip_vision_path:clip_vision_h.safetensors"
+      - "diffusion_model"
+      - "vae_decode_only:false"
+      - "sampler:euler"
+      - "flow_shift:3.0"
+      - "t5xxl_path:umt5-xxl-encoder-Q8_0.gguf"
+      - "vae_path:wan_2.1_vae.safetensors"
+  files:
+    - filename: "wan2.1-flf2v-14b-720p-Q4_K_M.gguf"
+      sha256: "7652d7d8b0795009ff21ed83d806af762aae8a8faa8640dd07b3a67e4dfab445"
+      uri: "huggingface://city96/Wan2.1-FLF2V-14B-720P-gguf/wan2.1-flf2v-14b-720p-Q4_K_M.gguf"
+    - filename: "wan_2.1_vae.safetensors"
+      uri: "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors"
+    - filename: "umt5-xxl-encoder-Q8_0.gguf"
+      uri: "huggingface://city96/umt5-xxl-encoder-gguf/umt5-xxl-encoder-Q8_0.gguf"
+      sha256: 2521d4de0bf9e1cc6549866463ceae85e4ec3239bc6063f7488810be39033bbc
+    - filename: "clip_vision_h.safetensors"
+      uri: "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/clip_vision/clip_vision_h.safetensors"
+- name: wan-2.1-i2v-14b-720p-ggml
+  license: apache-2.0
+  url: "github:mudler/LocalAI/gallery/wan-ggml.yaml@master"
+  description: |
+    Wan 2.1 I2V 14B 720P — image-to-video diffusion, GGUF Q4_K_M.
+    Native 720p sibling of the 480p I2V model: animates a single
+    reference image into a 33-frame clip at up to 1280x720. Trained
+    purely as image-to-video (no first-last-frame interpolation path),
+    so motion is freer and better-suited to single-anchor animation
+    than repurposing the FLF2V 720P variant for i2v. Shares the same
+    VAE, umt5_xxl text encoder, and clip_vision_h as the I2V 14B 480P
+    and FLF2V 14B 720P entries.
+  urls:
+    - https://huggingface.co/city96/Wan2.1-I2V-14B-720P-gguf
+  tags:
+    - image-to-video
+    - wan
+    - video-generation
+    - cpu
+    - gpu
+  overrides:
+    parameters:
+      model: wan2.1-i2v-14b-720p-Q4_K_M.gguf
+    options:
+      - "clip_vision_path:clip_vision_h.safetensors"
+      - "diffusion_model"
+      - "vae_decode_only:false"
+      - "sampler:euler"
+      - "flow_shift:3.0"
+      - "t5xxl_path:umt5-xxl-encoder-Q8_0.gguf"
+      - "vae_path:wan_2.1_vae.safetensors"
+  files:
+    - filename: "wan2.1-i2v-14b-720p-Q4_K_M.gguf"
+      sha256: "ffecd91e4b636d8e3e43f3fa388218158ba447109547bde777c6d67ef4fe42a4"
+      uri: "huggingface://city96/Wan2.1-I2V-14B-720P-gguf/wan2.1-i2v-14b-720p-Q4_K_M.gguf"
+    - filename: "wan_2.1_vae.safetensors"
+      uri: "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors"
    - filename: "umt5-xxl-encoder-Q8_0.gguf"
      uri: "huggingface://city96/umt5-xxl-encoder-gguf/umt5-xxl-encoder-Q8_0.gguf"
    - filename: "clip_vision_h.safetensors"
--- a/gallery/wan-ggml.yaml
+++ b/gallery/wan-ggml.yaml
@@ -9,11 +9,6 @@ config_file: |
    - "diffusion_model"
    - "vae_decode_only:false"
    - "sampler:euler"
-    - "scheduler:discrete"
    - "flow_shift:3.0"
-    - "diffusion_flash_attn:true"
-    - "offload_params_to_cpu:true"
-    - "keep_vae_on_cpu:true"
-    - "keep_clip_on_cpu:true"
    - "t5xxl_path:umt5-xxl-encoder-Q8_0.gguf"
    - "vae_path:wan_2.1_vae.safetensors"
--- a/go.mod
+++ b/go.mod
@@ -8,13 +8,13 @@ require (
 	github.com/Masterminds/sprig/v3 v3.3.0
 	github.com/alecthomas/kong v1.14.0
 	github.com/anthropics/anthropic-sdk-go v1.27.0
-	github.com/aws/aws-sdk-go-v2 v1.41.5
-	github.com/aws/aws-sdk-go-v2/config v1.32.14
-	github.com/aws/aws-sdk-go-v2/credentials v1.19.14
-	github.com/aws/aws-sdk-go-v2/service/s3 v1.97.1
+	github.com/aws/aws-sdk-go-v2 v1.41.6
+	github.com/aws/aws-sdk-go-v2/config v1.32.16
+	github.com/aws/aws-sdk-go-v2/credentials v1.19.15
+	github.com/aws/aws-sdk-go-v2/service/s3 v1.99.1
 	github.com/charmbracelet/glamour v1.0.0
-	github.com/containerd/containerd v1.7.30
-	github.com/coreos/go-oidc/v3 v3.17.0
+	github.com/containerd/containerd v1.7.31
+	github.com/coreos/go-oidc/v3 v3.18.0
 	github.com/dhowden/tag v0.0.0-20240417053706-3d75831295e8
 	github.com/ebitengine/purego v0.10.0
 	github.com/emirpasic/gods/v2 v2.0.0-alpha
@@ -35,7 +35,7 @@ require (
 	github.com/lithammer/fuzzysearch v1.1.8
 	github.com/mholt/archiver/v3 v3.5.1
 	github.com/microcosm-cc/bluemonday v1.0.27
-	github.com/modelcontextprotocol/go-sdk v1.4.1
+	github.com/modelcontextprotocol/go-sdk v1.5.0
 	github.com/mudler/cogito v0.9.5-0.20260315222927-63abdec7189b
 	github.com/mudler/edgevpn v0.31.1
 	github.com/mudler/go-processmanager v0.1.0
@@ -75,24 +75,23 @@ require (
 )

 require (
-	github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.7 // indirect
-	github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.18.21 // indirect
-	github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.21 // indirect
-	github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.7.21 // indirect
-	github.com/aws/aws-sdk-go-v2/internal/ini v1.8.6 // indirect
-	github.com/aws/aws-sdk-go-v2/internal/v4a v1.4.21 // indirect
-	github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.13.7 // indirect
-	github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.9.12 // indirect
-	github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.13.21 // indirect
-	github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.19.20 // indirect
-	github.com/aws/aws-sdk-go-v2/service/signin v1.0.9 // indirect
-	github.com/aws/aws-sdk-go-v2/service/sso v1.30.15 // indirect
-	github.com/aws/aws-sdk-go-v2/service/ssooidc v1.35.19 // indirect
-	github.com/aws/aws-sdk-go-v2/service/sts v1.41.10 // indirect
-	github.com/aws/smithy-go v1.24.2 // indirect
+	github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.9 // indirect
+	github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.18.22 // indirect
+	github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.22 // indirect
+	github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.7.22 // indirect
+	github.com/aws/aws-sdk-go-v2/internal/v4a v1.4.23 // indirect
+	github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.13.8 // indirect
+	github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.9.14 // indirect
+	github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.13.22 // indirect
+	github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.19.22 // indirect
+	github.com/aws/aws-sdk-go-v2/service/signin v1.0.10 // indirect
+	github.com/aws/aws-sdk-go-v2/service/sso v1.30.16 // indirect
+	github.com/aws/aws-sdk-go-v2/service/ssooidc v1.35.20 // indirect
+	github.com/aws/aws-sdk-go-v2/service/sts v1.42.0 // indirect
+	github.com/aws/smithy-go v1.25.0 // indirect
 	github.com/bahlo/generic-list-go v0.2.0 // indirect
 	github.com/buger/jsonparser v1.1.1 // indirect
-	github.com/go-jose/go-jose/v4 v4.1.3 // indirect
+	github.com/go-jose/go-jose/v4 v4.1.4 // indirect
 	github.com/jinzhu/inflection v1.0.0 // indirect
 	github.com/jinzhu/now v1.1.5 // indirect
 	github.com/mattn/go-sqlite3 v1.14.24 // indirect
--- a/go.sum
+++ b/go.sum
@@ -70,44 +70,42 @@ github.com/anthropics/anthropic-sdk-go v1.27.0 h1:0CWbmBq5ofGAjF2H6lefCNRbnaUMGi
 github.com/anthropics/anthropic-sdk-go v1.27.0/go.mod h1:qUKmaW+uuPB64iy1l+4kOSvaLqPXnHTTBKH6RVZ7q5Q=
 github.com/armon/go-socks5 v0.0.0-20160902184237-e75332964ef5 h1:0CwZNZbxp69SHPdPJAN/hZIm0C4OItdklCFmMRWYpio=
 github.com/armon/go-socks5 v0.0.0-20160902184237-e75332964ef5/go.mod h1:wHh0iHkYZB8zMSxRWpUBQtwG5a7fFgvEO+odwuTv2gs=
-github.com/aws/aws-sdk-go-v2 v1.41.5 h1:dj5kopbwUsVUVFgO4Fi5BIT3t4WyqIDjGKCangnV/yY=
-github.com/aws/aws-sdk-go-v2 v1.41.5/go.mod h1:mwsPRE8ceUUpiTgF7QmQIJ7lgsKUPQOUl3o72QBrE1o=
-github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.7 h1:3kGOqnh1pPeddVa/E37XNTaWJ8W6vrbYV9lJEkCnhuY=
-github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.7/go.mod h1:lyw7GFp3qENLh7kwzf7iMzAxDn+NzjXEAGjKS2UOKqI=
-github.com/aws/aws-sdk-go-v2/config v1.32.14 h1:opVIRo/ZbbI8OIqSOKmpFaY7IwfFUOCCXBsUpJOwDdI=
-github.com/aws/aws-sdk-go-v2/config v1.32.14/go.mod h1:U4/V0uKxh0Tl5sxmCBZ3AecYny4UNlVmObYjKuuaiOo=
-github.com/aws/aws-sdk-go-v2/credentials v1.19.14 h1:n+UcGWAIZHkXzYt87uMFBv/l8THYELoX6gVcUvgl6fI=
-github.com/aws/aws-sdk-go-v2/credentials v1.19.14/go.mod h1:cJKuyWB59Mqi0jM3nFYQRmnHVQIcgoxjEMAbLkpr62w=
-github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.18.21 h1:NUS3K4BTDArQqNu2ih7yeDLaS3bmHD0YndtA6UP884g=
-github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.18.21/go.mod h1:YWNWJQNjKigKY1RHVJCuupeWDrrHjRqHm0N9rdrWzYI=
-github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.21 h1:Rgg6wvjjtX8bNHcvi9OnXWwcE0a2vGpbwmtICOsvcf4=
-github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.21/go.mod h1:A/kJFst/nm//cyqonihbdpQZwiUhhzpqTsdbhDdRF9c=
-github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.7.21 h1:PEgGVtPoB6NTpPrBgqSE5hE/o47Ij9qk/SEZFbUOe9A=
-github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.7.21/go.mod h1:p+hz+PRAYlY3zcpJhPwXlLC4C+kqn70WIHwnzAfs6ps=
-github.com/aws/aws-sdk-go-v2/internal/ini v1.8.6 h1:qYQ4pzQ2Oz6WpQ8T3HvGHnZydA72MnLuFK9tJwmrbHw=
-github.com/aws/aws-sdk-go-v2/internal/ini v1.8.6/go.mod h1:O3h0IK87yXci+kg6flUKzJnWeziQUKciKrLjcatSNcY=
-github.com/aws/aws-sdk-go-v2/internal/v4a v1.4.21 h1:SwGMTMLIlvDNyhMteQ6r8IJSBPlRdXX5d4idhIGbkXA=
-github.com/aws/aws-sdk-go-v2/internal/v4a v1.4.21/go.mod h1:UUxgWxofmOdAMuqEsSppbDtGKLfR04HGsD0HXzvhI1k=
-github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.13.7 h1:5EniKhLZe4xzL7a+fU3C2tfUN4nWIqlLesfrjkuPFTY=
-github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.13.7/go.mod h1:x0nZssQ3qZSnIcePWLvcoFisRXJzcTVvYpAAdYX8+GI=
-github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.9.12 h1:qtJZ70afD3ISKWnoX3xB0J2otEqu3LqicRcDBqsj0hQ=
-github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.9.12/go.mod h1:v2pNpJbRNl4vEUWEh5ytQok0zACAKfdmKS51Hotc3pQ=
-github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.13.21 h1:c31//R3xgIJMSC8S6hEVq+38DcvUlgFY0FM6mSI5oto=
-github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.13.21/go.mod h1:r6+pf23ouCB718FUxaqzZdbpYFyDtehyZcmP5KL9FkA=
-github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.19.20 h1:siU1A6xjUZ2N8zjTHSXFhB9L/2OY8Dqs0xXiLjF30jA=
-github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.19.20/go.mod h1:4TLZCmVJDM3FOu5P5TJP0zOlu9zWgDWU7aUxWbr+rcw=
-github.com/aws/aws-sdk-go-v2/service/s3 v1.97.1 h1:csi9NLpFZXb9fxY7rS1xVzgPRGMt7MSNWeQ6eo247kE=
-github.com/aws/aws-sdk-go-v2/service/s3 v1.97.1/go.mod h1:qXVal5H0ChqXP63t6jze5LmFalc7+ZE7wOdLtZ0LCP0=
-github.com/aws/aws-sdk-go-v2/service/signin v1.0.9 h1:QKZH0S178gCmFEgst8hN0mCX1KxLgHBKKY/CLqwP8lg=
-github.com/aws/aws-sdk-go-v2/service/signin v1.0.9/go.mod h1:7yuQJoT+OoH8aqIxw9vwF+8KpvLZ8AWmvmUWHsGQZvI=
-github.com/aws/aws-sdk-go-v2/service/sso v1.30.15 h1:lFd1+ZSEYJZYvv9d6kXzhkZu07si3f+GQ1AaYwa2LUM=
-github.com/aws/aws-sdk-go-v2/service/sso v1.30.15/go.mod h1:WSvS1NLr7JaPunCXqpJnWk1Bjo7IxzZXrZi1QQCkuqM=
-github.com/aws/aws-sdk-go-v2/service/ssooidc v1.35.19 h1:dzztQ1YmfPrxdrOiuZRMF6fuOwWlWpD2StNLTceKpys=
-github.com/aws/aws-sdk-go-v2/service/ssooidc v1.35.19/go.mod h1:YO8TrYtFdl5w/4vmjL8zaBSsiNp3w0L1FfKVKenZT7w=
-github.com/aws/aws-sdk-go-v2/service/sts v1.41.10 h1:p8ogvvLugcR/zLBXTXrTkj0RYBUdErbMnAFFp12Lm/U=
-github.com/aws/aws-sdk-go-v2/service/sts v1.41.10/go.mod h1:60dv0eZJfeVXfbT1tFJinbHrDfSJ2GZl4Q//OSSNAVw=
-github.com/aws/smithy-go v1.24.2 h1:FzA3bu/nt/vDvmnkg+R8Xl46gmzEDam6mZ1hzmwXFng=
-github.com/aws/smithy-go v1.24.2/go.mod h1:YE2RhdIuDbA5E5bTdciG9KrW3+TiEONeUWCqxX9i1Fc=
+github.com/aws/aws-sdk-go-v2 v1.41.6 h1:1AX0AthnBQzMx1vbmir3Y4WsnJgiydmnJjiLu+LvXOg=
+github.com/aws/aws-sdk-go-v2 v1.41.6/go.mod h1:dy0UzBIfwSeot4grGvY1AqFWN5zgziMmWGzysDnHFcQ=
+github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.9 h1:adBsCIIpLbLmYnkQU+nAChU5yhVTvu5PerROm+/Kq2A=
+github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.9/go.mod h1:uOYhgfgThm/ZyAuJGNQ5YgNyOlYfqnGpTHXvk3cpykg=
+github.com/aws/aws-sdk-go-v2/config v1.32.16 h1:Q0iQ7quUgJP0F/SCRTieScnaMdXr9h/2+wze1u3cNeM=
+github.com/aws/aws-sdk-go-v2/config v1.32.16/go.mod h1:duCCnJEFqpt2RC6no1iK6q+8HpwOAkiUua0pY507dQc=
+github.com/aws/aws-sdk-go-v2/credentials v1.19.15 h1:fyvgWTszojq8hEnMi8PPBTvZdTtEVmAVyo+NFLHBhH4=
+github.com/aws/aws-sdk-go-v2/credentials v1.19.15/go.mod h1:gJiYyMOjNg8OEdRWOf3CrFQxM2a98qmrtjx1zuiQfB8=
+github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.18.22 h1:IOGsJ1xVWhsi+ZO7/NW8OuZZBtMJLZbk4P5HDjJO0jQ=
+github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.18.22/go.mod h1:b+hYdbU+jGKfXE8kKM6g1+h+L/Go3vMvzlxBsiuGsxg=
+github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.22 h1:GmLa5Kw1ESqtFpXsx5MmC84QWa/ZrLZvlJGa2y+4kcQ=
+github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.22/go.mod h1:6sW9iWm9DK9YRpRGga/qzrzNLgKpT2cIxb7Vo2eNOp0=
+github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.7.22 h1:dY4kWZiSaXIzxnKlj17nHnBcXXBfac6UlsAx2qL6XrU=
+github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.7.22/go.mod h1:KIpEUx0JuRZLO7U6cbV204cWAEco2iC3l061IxlwLtI=
+github.com/aws/aws-sdk-go-v2/internal/v4a v1.4.23 h1:FPXsW9+gMuIeKmz7j6ENWcWtBGTe1kH8r9thNt5Uxx4=
+github.com/aws/aws-sdk-go-v2/internal/v4a v1.4.23/go.mod h1:7J8iGMdRKk6lw2C+cMIphgAnT8uTwBwNOsGkyOCm80U=
+github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.13.8 h1:HtOTYcbVcGABLOVuPYaIihj6IlkqubBwFj10K5fxRek=
+github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.13.8/go.mod h1:VsK9abqQeGlzPgUr+isNWzPlK2vKe9INMLWnY65f5Xs=
+github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.9.14 h1:xnvDEnw+pnj5mctWiYuFbigrEzSm35x7k4KS/ZkCANg=
+github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.9.14/go.mod h1:yS5rNogD8e0Wu9+l3MUwr6eENBzEeGejvINpN5PAYfY=
+github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.13.22 h1:PUmZeJU6Y1Lbvt9WFuJ0ugUK2xn6hIWUBBbKuOWF30s=
+github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.13.22/go.mod h1:nO6egFBoAaoXze24a2C0NjQCvdpk8OueRoYimvEB9jo=
+github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.19.22 h1:SE+aQ4DEqG53RRCAIHlCf//B2ycxGH7jFkpnAh/kKPM=
+github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.19.22/go.mod h1:ES3ynECd7fYeJIL6+oax+uIEljmfps0S70BaQzbMd/o=
+github.com/aws/aws-sdk-go-v2/service/s3 v1.99.1 h1:kU/eBN5+MWNo/LcbNa4hWDdN76hdcd7hocU5kvu7IsU=
+github.com/aws/aws-sdk-go-v2/service/s3 v1.99.1/go.mod h1:Fw9aqhJicIVee1VytBBjH+l+5ov6/PhbtIK/u3rt/ls=
+github.com/aws/aws-sdk-go-v2/service/signin v1.0.10 h1:a1Fq/KXn75wSzoJaPQTgZO0wHGqE9mjFnylnqEPTchA=
+github.com/aws/aws-sdk-go-v2/service/signin v1.0.10/go.mod h1:p6+MXNxW7IA6dMgHfTAzljuwSKD0NCm/4lbS4t6+7vI=
+github.com/aws/aws-sdk-go-v2/service/sso v1.30.16 h1:x6bKbmDhsgSZwv6q19wY/u3rLk/3FGjJWyqKcIRufpE=
+github.com/aws/aws-sdk-go-v2/service/sso v1.30.16/go.mod h1:CudnEVKRtLn0+3uMV0yEXZ+YZOKnAtUJ5DmDhilVnIw=
+github.com/aws/aws-sdk-go-v2/service/ssooidc v1.35.20 h1:oK/njaL8GtyEihkWMD4k3VgHCT64RQKkZwh0DG5j8ak=
+github.com/aws/aws-sdk-go-v2/service/ssooidc v1.35.20/go.mod h1:JHs8/y1f3zY7U5WcuzoJ/yAYGYtNIVPKLIbp61euvmg=
+github.com/aws/aws-sdk-go-v2/service/sts v1.42.0 h1:ks8KBcZPh3PYISr5dAiXCM5/Thcuxk8l+PG4+A0exds=
+github.com/aws/aws-sdk-go-v2/service/sts v1.42.0/go.mod h1:pFw33T0WLvXU3rw1WBkpMlkgIn54eCB5FYLhjDc9Foo=
+github.com/aws/smithy-go v1.25.0 h1:Sz/XJ64rwuiKtB6j98nDIPyYrV1nVNJ4YU74gttcl5U=
+github.com/aws/smithy-go v1.25.0/go.mod h1:YE2RhdIuDbA5E5bTdciG9KrW3+TiEONeUWCqxX9i1Fc=
 github.com/aymanbagabas/go-osc52/v2 v2.0.1 h1:HwpRHbFMcZLEVr42D4p7XBqjyuxQH5SMiErDT4WkJ2k=
 github.com/aymanbagabas/go-osc52/v2 v2.0.1/go.mod h1:uYgXzlJ7ZpABp8OJ+exZzJJhRNQ2ASbcXHWsFqH8hp8=
 github.com/aymanbagabas/go-udiff v0.2.0 h1:TK0fH4MteXUDspT88n8CKzvK0X9O2xu9yQjWpi6yML8=
@@ -198,8 +196,8 @@ github.com/cloudflare/circl v1.6.1/go.mod h1:uddAzsPgqdMAYatqJ0lsjX1oECcQLIlRpzZ
 github.com/cncf/udpa/go v0.0.0-20191209042840-269d4d468f6f/go.mod h1:M8M6+tZqaGXZJjfX53e64911xZQV5JYwmTeXPW+k8Sc=
 github.com/containerd/cgroups v1.1.0 h1:v8rEWFl6EoqHB+swVNjVoCJE8o3jX7e8nqBGPLaDFBM=
 github.com/containerd/cgroups v1.1.0/go.mod h1:6ppBcbh/NOOUU+dMKrykgaBnK9lCIBxHqJDGwsa1mIw=
-github.com/containerd/containerd v1.7.30 h1:/2vezDpLDVGGmkUXmlNPLCCNKHJ5BbC5tJB5JNzQhqE=
-github.com/containerd/containerd v1.7.30/go.mod h1:fek494vwJClULlTpExsmOyKCMUAbuVjlFsJQc4/j44M=
+github.com/containerd/containerd v1.7.31 h1:jn3IMuTV4Bb1Uwb0MFPW2ASJAD3W1lh6QqqZHIZwDh4=
+github.com/containerd/containerd v1.7.31/go.mod h1:jdwD6s/BhV4XVJGrvtziNPVA+83n66TwptVaPKprq4E=
 github.com/containerd/continuity v0.4.4 h1:/fNVfTJ7wIl/YPMHjf+5H32uFhl63JucB34PlCpMKII=
 github.com/containerd/continuity v0.4.4/go.mod h1:/lNJvtJKUQStBzpVQ1+rasXO1LAWtUQssk28EZvJ3nE=
 github.com/containerd/errdefs v1.0.0 h1:tg5yIfIlQIrxYtu9ajqY42W3lpS19XqdxRQeEwYG8PI=
@@ -212,8 +210,8 @@ github.com/containerd/platforms v0.2.1 h1:zvwtM3rz2YHPQsF2CHYM8+KtB5dvhISiXh5ZpS
 github.com/containerd/platforms v0.2.1/go.mod h1:XHCb+2/hzowdiut9rkudds9bE5yJ7npe7dG/wG+uFPw=
 github.com/containerd/stargz-snapshotter/estargz v0.18.2 h1:yXkZFYIzz3eoLwlTUZKz2iQ4MrckBxJjkmD16ynUTrw=
 github.com/containerd/stargz-snapshotter/estargz v0.18.2/go.mod h1:XyVU5tcJ3PRpkA9XS2T5us6Eg35yM0214Y+wvrZTBrY=
-github.com/coreos/go-oidc/v3 v3.17.0 h1:hWBGaQfbi0iVviX4ibC7bk8OKT5qNr4klBaCHVNvehc=
-github.com/coreos/go-oidc/v3 v3.17.0/go.mod h1:wqPbKFrVnE90vty060SB40FCJ8fTHTxSwyXJqZH+sI8=
+github.com/coreos/go-oidc/v3 v3.18.0 h1:V9orjXynvu5wiC9SemFTWnG4F45v403aIcjWo0d41+A=
+github.com/coreos/go-oidc/v3 v3.18.0/go.mod h1:DYCf24+ncYi+XkIH97GY1+dqoRlbaSI26KVTCI9SrY4=
 github.com/coreos/go-systemd v0.0.0-20181012123002-c6f51f82210d/go.mod h1:F5haX7vjVVG0kc13fIWeqUViNPyEJxv/OmvnBo0Yme4=
 github.com/coreos/go-systemd/v22 v22.5.0/go.mod h1:Y58oyj3AT4RCenI/lSvhwexgC+NSVTIJ3seZv2GcEnc=
 github.com/cpuguy83/dockercfg v0.3.2 h1:DlJTyZGBDlXqUZ2Dk2Q3xHs/FtnooJJVaad2S9GKorA=
@@ -336,8 +334,8 @@ github.com/go-gl/gl v0.0.0-20231021071112-07e5d0ea2e71 h1:5BVwOaUSBTlVZowGO6VZGw
 github.com/go-gl/gl v0.0.0-20231021071112-07e5d0ea2e71/go.mod h1:9YTyiznxEY1fVinfM7RvRcjRHbw2xLBJ3AAGIT0I4Nw=
 github.com/go-gl/glfw/v3.3/glfw v0.0.0-20240506104042-037f3cc74f2a h1:vxnBhFDDT+xzxf1jTJKMKZw3H0swfWk9RpWbBbDK5+0=
 github.com/go-gl/glfw/v3.3/glfw v0.0.0-20240506104042-037f3cc74f2a/go.mod h1:tQ2UAYgL5IevRw8kRxooKSPJfGvJ9fJQFa0TUsXzTg8=
-github.com/go-jose/go-jose/v4 v4.1.3 h1:CVLmWDhDVRa6Mi/IgCgaopNosCaHz7zrMeF9MlZRkrs=
-github.com/go-jose/go-jose/v4 v4.1.3/go.mod h1:x4oUasVrzR7071A4TnHLGSPpNOm2a21K9Kf04k1rs08=
+github.com/go-jose/go-jose/v4 v4.1.4 h1:moDMcTHmvE6Groj34emNPLs/qtYXRVcd6S7NHbHz3kA=
+github.com/go-jose/go-jose/v4 v4.1.4/go.mod h1:x4oUasVrzR7071A4TnHLGSPpNOm2a21K9Kf04k1rs08=
 github.com/go-logr/logr v1.2.2/go.mod h1:jdQByPbusPIv2/zmleS9BjJVeZ6kBagPoEUsqbVz/1A=
 github.com/go-logr/logr v1.4.3 h1:CjnDlHq8ikf6E492q6eKboGOC0T8CDaOvkHCIg8idEI=
 github.com/go-logr/logr v1.4.3/go.mod h1:9T104GzyrTigFIr8wt5mBrctHMim0Nb2HLGrmQ40KvY=
@@ -385,8 +383,8 @@ github.com/gofrs/flock v0.13.0/go.mod h1:jxeyy9R1auM5S6JYDBhDt+E2TCo7DkratH4Pgi8
 github.com/gogo/protobuf v1.1.1/go.mod h1:r8qH/GZQm5c6nD/R0oafs1akxWv10x8SbQlK7atdtwQ=
 github.com/gogo/protobuf v1.3.2 h1:Ov1cvc58UF3b5XjBnZv7+opcTcQFZebYjWzi34vdm4Q=
 github.com/gogo/protobuf v1.3.2/go.mod h1:P1XiOD3dCwIKUDQYPy72D8LYyHL2YPYrpS2s69NZV8Q=
-github.com/golang-jwt/jwt/v5 v5.3.0 h1:pv4AsKCKKZuqlgs5sUmn4x8UlGa0kEVt/puTpKx9vvo=
-github.com/golang-jwt/jwt/v5 v5.3.0/go.mod h1:fxCRLWMO43lRc8nhHWY6LGqRcf+1gQWArsqaEUEa5bE=
+github.com/golang-jwt/jwt/v5 v5.3.1 h1:kYf81DTWFe7t+1VvL7eS+jKFVWaUnK9cB1qbwn63YCY=
+github.com/golang-jwt/jwt/v5 v5.3.1/go.mod h1:fxCRLWMO43lRc8nhHWY6LGqRcf+1gQWArsqaEUEa5bE=
 github.com/golang/glog v0.0.0-20160126235308-23def4e6c14b/go.mod h1:SBH7ygxi8pfUlaOkMMuAQtPIUF8ecWP5IEl/CR7VP2Q=
 github.com/golang/groupcache v0.0.0-20200121045136-8c9f03a8e57e/go.mod h1:cIg4eruTrX1D+g88fzRXU5OdNfaM+9IcxsU14FzY7Hc=
 github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da/go.mod h1:cIg4eruTrX1D+g88fzRXU5OdNfaM+9IcxsU14FzY7Hc=
@@ -691,8 +689,8 @@ github.com/moby/sys/userns v0.1.0 h1:tVLXkFOxVu9A64/yh59slHVv9ahO9UIev4JZusOLG/g
 github.com/moby/sys/userns v0.1.0/go.mod h1:IHUYgu/kao6N8YZlp9Cf444ySSvCmDlmzUcYfDHOl28=
 github.com/moby/term v0.5.2 h1:6qk3FJAFDs6i/q3W/pQ97SX192qKfZgGjCQqfCJkgzQ=
 github.com/moby/term v0.5.2/go.mod h1:d3djjFCrjnB+fl8NJux+EJzu0msscUP+f8it8hPkFLc=
-github.com/modelcontextprotocol/go-sdk v1.4.1 h1:M4x9GyIPj+HoIlHNGpK2hq5o3BFhC+78PkEaldQRphc=
-github.com/modelcontextprotocol/go-sdk v1.4.1/go.mod h1:Bo/mS87hPQqHSRkMv4dQq1XCu6zv4INdXnFZabkNU6s=
+github.com/modelcontextprotocol/go-sdk v1.5.0 h1:CHU0FIX9kpueNkxuYtfYQn1Z0slhFzBZuq+x6IiblIU=
+github.com/modelcontextprotocol/go-sdk v1.5.0/go.mod h1:gggDIhoemhWs3BGkGwd1umzEXCEMMvAnhTrnbXJKKKA=
 github.com/modern-go/concurrent v0.0.0-20180228061459-e0a39a4cb421/go.mod h1:6dJC0mAP4ikYIbvyc7fijjWJddQyLn8Ig3JB5CqoB9Q=
 github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd h1:TRLaZ9cD/w8PVh93nsPXa1VrQ6jlwL5oN8l14QlcNfg=
 github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd/go.mod h1:6dJC0mAP4ikYIbvyc7fijjWJddQyLn8Ig3JB5CqoB9Q=
--- a/pkg/system/capabilities_test.go
+++ b/pkg/system/capabilities_test.go
@@ -159,7 +159,6 @@ var _ = Describe("CapabilityFilterDisabled", func() {
 		os.Setenv(capabilityEnv, "disable")
 		s := &SystemState{}
 		Expect(s.IsBackendCompatible("cuda12-whisperx", "quay.io/nvidia-cuda-12")).To(BeTrue())
-		Expect(s.IsBackendCompatible("rocm-whisperx", "quay.io/rocm")).To(BeTrue())
 		Expect(s.IsBackendCompatible("metal-whisperx", "quay.io/metal-darwin")).To(BeTrue())
 		Expect(s.IsBackendCompatible("intel-whisperx", "quay.io/intel-sycl")).To(BeTrue())
 		Expect(s.IsBackendCompatible("cpu-whisperx", "quay.io/cpu")).To(BeTrue())
--- a/swagger/docs.go
+++ b/swagger/docs.go
@@ -985,13 +985,11 @@ const docTemplate = `{
                "summary": "Backend monitor endpoint",
                "parameters": [
                    {
-                        "description": "Backend statistics request",
-                        "name": "request",
-                        "in": "body",
-                        "required": true,
-                        "schema": {
-                            "$ref": "#/definitions/schema.BackendMonitorRequest"
-                        }
+                        "type": "string",
+                        "description": "Name of the model to monitor",
+                        "name": "model",
+                        "in": "query",
+                        "required": true
                    }
                ],
                "responses": {
@@ -2408,6 +2406,23 @@ const docTemplate = `{
                }
            }
        },
+        "gallery.NodeDriftInfo": {
+            "type": "object",
+            "properties": {
+                "digest": {
+                    "type": "string"
+                },
+                "node_id": {
+                    "type": "string"
+                },
+                "node_name": {
+                    "type": "string"
+                },
+                "version": {
+                    "type": "string"
+                }
+            }
+        },
        "gallery.UpgradeInfo": {
            "type": "object",
            "properties": {
@@ -2425,6 +2440,13 @@ const docTemplate = `{
                },
                "installed_version": {
                    "type": "string"
+                },
+                "node_drift": {
+                    "description": "NodeDrift lists nodes whose installed version or digest differs from\nthe cluster majority. Non-empty means the cluster has diverged and an\nupgrade will realign it. Empty in single-node mode.",
+                    "type": "array",
+                    "items": {
+                        "$ref": "#/definitions/gallery.NodeDriftInfo"
+                    }
                }
            }
        },
--- a/swagger/swagger.json
+++ b/swagger/swagger.json
@@ -982,13 +982,11 @@
                "summary": "Backend monitor endpoint",
                "parameters": [
                    {
-                        "description": "Backend statistics request",
-                        "name": "request",
-                        "in": "body",
-                        "required": true,
-                        "schema": {
-                            "$ref": "#/definitions/schema.BackendMonitorRequest"
-                        }
+                        "type": "string",
+                        "description": "Name of the model to monitor",
+                        "name": "model",
+                        "in": "query",
+                        "required": true
                    }
                ],
                "responses": {
@@ -2405,6 +2403,23 @@
                }
            }
        },
+        "gallery.NodeDriftInfo": {
+            "type": "object",
+            "properties": {
+                "digest": {
+                    "type": "string"
+                },
+                "node_id": {
+                    "type": "string"
+                },
+                "node_name": {
+                    "type": "string"
+                },
+                "version": {
+                    "type": "string"
+                }
+            }
+        },
        "gallery.UpgradeInfo": {
            "type": "object",
            "properties": {
@@ -2422,6 +2437,13 @@
                },
                "installed_version": {
                    "type": "string"
+                },
+                "node_drift": {
+                    "description": "NodeDrift lists nodes whose installed version or digest differs from\nthe cluster majority. Non-empty means the cluster has diverged and an\nupgrade will realign it. Empty in single-node mode.",
+                    "type": "array",
+                    "items": {
+                        "$ref": "#/definitions/gallery.NodeDriftInfo"
+                    }
                }
            }
        },
--- a/swagger/swagger.yaml
+++ b/swagger/swagger.yaml
@@ -157,6 +157,17 @@ definitions:
          type: string
        type: array
    type: object
+  gallery.NodeDriftInfo:
+    properties:
+      digest:
+        type: string
+      node_id:
+        type: string
+      node_name:
+        type: string
+      version:
+        type: string
+    type: object
  gallery.UpgradeInfo:
    properties:
      available_digest:
@@ -169,6 +180,14 @@ definitions:
        type: string
      installed_version:
        type: string
+      node_drift:
+        description: |-
+          NodeDrift lists nodes whose installed version or digest differs from
+          the cluster majority. Non-empty means the cluster has diverged and an
+          upgrade will realign it. Empty in single-node mode.
+        items:
+          $ref: '#/definitions/gallery.NodeDriftInfo'
+        type: array
    type: object
  galleryop.OpStatus:
    properties:
@@ -2363,12 +2382,11 @@ paths:
  /backend/monitor:
    get:
      parameters:
-      - description: Backend statistics request
-        in: body
-        name: request
+      - description: Name of the model to monitor
+        in: query
+        name: model
        required: true
-        schema:
-          $ref: '#/definitions/schema.BackendMonitorRequest'
+        type: string
      responses:
        "200":
          description: Response
--- a/tests/e2e/distributed/node_lifecycle_test.go
+++ b/tests/e2e/distributed/node_lifecycle_test.go
@@ -57,7 +57,7 @@ var _ = Describe("Node Backend Lifecycle (NATS-driven)", Label("Distributed"), f
 			FlushNATS(infra.NC)

 			adapter := nodes.NewRemoteUnloaderAdapter(registry, infra.NC)
-			installReply, err := adapter.InstallBackend(node.ID, "llama-cpp", "", "")
+			installReply, err := adapter.InstallBackend(node.ID, "llama-cpp", "", "", "", "", "")
 			Expect(err).ToNot(HaveOccurred())
 			Expect(installReply.Success).To(BeTrue())
 		})
@@ -78,7 +78,7 @@ var _ = Describe("Node Backend Lifecycle (NATS-driven)", Label("Distributed"), f
 			FlushNATS(infra.NC)

 			adapter := nodes.NewRemoteUnloaderAdapter(registry, infra.NC)
-			installReply, err := adapter.InstallBackend(node.ID, "nonexistent", "", "")
+			installReply, err := adapter.InstallBackend(node.ID, "nonexistent", "", "", "", "", "")
 			Expect(err).ToNot(HaveOccurred())
 			Expect(installReply.Success).To(BeFalse())
 			Expect(installReply.Error).To(ContainSubstring("backend not found"))