fix(turboquant): patch ggml-hip CMakeLists to compile new f16-turbo fattn-vec instances

Fork commit fa4e8be0a0ce ("fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu") added three new template instance files under ggml-cuda/template-instances/ (fattn-vec-instance-f16-turbo{2,3,4}_0.cu) and wired matching FATTN_VEC_CASES_ALL_D(GGML_TYPE_F16, GGML_TYPE_TURBO*) dispatch cases into fattn.cu. fattn.cu is shared with the HIP build via hipify, but the fork forgot to mirror the new source files into ggml/src/ggml-hip/CMakeLists.txt. CMake's ROCm branch carries a hand-curated template-instance list (used when GGML_CUDA_FA_ALL_QUANTS is OFF, the default), so the HIP build ended up with the extern template declarations but no matching instantiations — the -gpu-rocm-hipblas-turboquant job failed partway through the 3h+ build. Add patches/0001-ggml-hip-add-f16-turbo-vec-instances.patch, which the existing apply-patches.sh machinery applies to the cloned fork sources after fetch. The patch appends the three new f16-turbo instance files to ggml-hip's source list in the same interleaved order used by ggml-cuda's CMakeLists.txt. Drop this patch once the fork syncs the ROCm list — the build will fail fast if the anchor context goes stale, which is the signal to retire it. CUDA builds were unaffected (ggml-cuda's CMakeLists.txt was updated upstream) — the link failure was isolated to HIP. Assisted-by: Claude:claude-opus-4-7 [Claude Code]
⬆️ Update TheTom/llama-cpp-turboquant
2026-05-19 14:17:21 -04:00 · 2026-04-22 07:17:33 +00:00 · 2026-04-21 21:28:32 +00:00 · 2026-04-21 22:06:35 +02:00 · 2026-04-21 21:59:33 +02:00 · 2026-04-21 21:53:10 +02:00
61 changed files with 2251 additions and 520 deletions
--- a/.agents/ai-coding-assistants.md
+++ b/.agents/ai-coding-assistants.md
@@ -0,0 +1,101 @@
+# AI Coding Assistants
+
+This document provides guidance for AI tools and developers using AI
+assistance when contributing to LocalAI.
+
+**LocalAI follows the same guidelines as the Linux kernel project for
+AI-assisted contributions.** See the upstream policy here:
+<https://docs.kernel.org/process/coding-assistants.html>
+
+The rules below mirror that policy, adapted to LocalAI's license and
+project layout. If anything is unclear, the kernel document is the
+authoritative reference for intent.
+
+AI tools helping with LocalAI development should follow the standard
+project development process:
+
+- [CONTRIBUTING.md](../CONTRIBUTING.md) — development workflow, commit
+  conventions, and PR guidelines
+- [.agents/coding-style.md](coding-style.md) — code style, editorconfig,
+  logging, and documentation conventions
+- [.agents/building-and-testing.md](building-and-testing.md) — build and
+  test procedures
+
+## Licensing and Legal Requirements
+
+All contributions must comply with LocalAI's licensing requirements:
+
+- LocalAI is licensed under the **MIT License** — see the [LICENSE](../LICENSE)
+  file
+- New source files should use the SPDX license identifier `MIT` where
+  applicable to the file type
+- Contributions must be compatible with the MIT License and must not
+  introduce code under incompatible licenses (e.g., GPL) without an
+  explicit discussion with maintainers
+
+## Signed-off-by and Developer Certificate of Origin
+
+**AI agents MUST NOT add `Signed-off-by` tags.** Only humans can legally
+certify the Developer Certificate of Origin (DCO). The human submitter
+is responsible for:
+
+- Reviewing all AI-generated code
+- Ensuring compliance with licensing requirements
+- Adding their own `Signed-off-by` tag (when the project requires DCO)
+  to certify the contribution
+- Taking full responsibility for the contribution
+
+AI agents MUST NOT add `Co-Authored-By` trailers for themselves either.
+A human reviewer owns the contribution; the AI's involvement is recorded
+via `Assisted-by` (see below).
+
+## Attribution
+
+When AI tools contribute to LocalAI development, proper attribution helps
+track the evolving role of AI in the development process. Contributions
+should include an `Assisted-by` tag in the commit message trailer in the
+following format:
+
+```
+Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]
+```
+
+Where:
+
+- `AGENT_NAME` — name of the AI tool or framework (e.g., `Claude`,
+  `Copilot`, `Cursor`)
+- `MODEL_VERSION` — specific model version used (e.g.,
+  `claude-opus-4-7`, `gpt-5`)
+- `[TOOL1] [TOOL2]` — optional specialized analysis tools invoked by the
+  agent (e.g., `golangci-lint`, `staticcheck`, `go vet`)
+
+Basic development tools (git, go, make, editors) should **not** be listed.
+
+### Example
+
+```
+fix(llama-cpp): handle empty tool call arguments
+
+Previously the parser panicked when the model returned a tool call with
+an empty arguments object. Fall back to an empty JSON object in that
+case so downstream consumers receive a valid payload.
+
+Assisted-by: Claude:claude-opus-4-7 golangci-lint
+Signed-off-by: Jane Developer <jane@example.com>
+```
+
+## Scope and Responsibility
+
+Using an AI assistant does not reduce the contributor's responsibility.
+The human submitter must:
+
+- Understand every line that lands in the PR
+- Verify that generated code compiles, passes tests, and follows the
+  project style
+- Confirm that any referenced APIs, flags, or file paths actually exist
+  in the current tree (AI models may hallucinate identifiers)
+- Not submit AI output verbatim without review
+
+Reviewers may ask for clarification on any change regardless of how it
+was produced. "An AI wrote it" is not an acceptable answer to a design
+question.
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -30,6 +30,7 @@ jobs:
      skip-drivers: ${{ matrix.skip-drivers }}
      context: ${{ matrix.context }}
      ubuntu-version: ${{ matrix.ubuntu-version }}
+      amdgpu-targets: ${{ matrix.amdgpu-targets }}
    secrets:
      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
@@ -1623,19 +1624,6 @@ jobs:
            dockerfile: "./backend/Dockerfile.python"
            context: "./"
            ubuntu-version: '2404'
-          - build-type: 'hipblas'
-            cuda-major-version: ""
-            cuda-minor-version: ""
-            platforms: 'linux/amd64'
-            tag-latest: 'auto'
-            tag-suffix: '-gpu-rocm-hipblas-whisperx'
-            runs-on: 'bigger-runner'
-            base-image: "rocm/dev-ubuntu-24.04:7.2.1"
-            skip-drivers: 'false'
-            backend: "whisperx"
-            dockerfile: "./backend/Dockerfile.python"
-            context: "./"
-            ubuntu-version: '2404'
          - build-type: 'hipblas'
            cuda-major-version: ""
            cuda-minor-version: ""
--- a/.github/workflows/backend_build.yml
+++ b/.github/workflows/backend_build.yml
@@ -58,6 +58,11 @@ on:
        required: false
        default: '2204'
        type: string
+      amdgpu-targets:
+        description: 'AMD GPU targets for ROCm/HIP builds'
+        required: false
+        default: 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201'
+        type: string
    secrets:
      dockerUsername:
        required: false
@@ -214,6 +219,7 @@ jobs:
            BASE_IMAGE=${{ inputs.base-image }}
            BACKEND=${{ inputs.backend }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
+            AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
          context: ${{ inputs.context }}
          file: ${{ inputs.dockerfile }}
          cache-from: type=gha
@@ -235,6 +241,7 @@ jobs:
            BASE_IMAGE=${{ inputs.base-image }}
            BACKEND=${{ inputs.backend }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
+            AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
          context: ${{ inputs.context }}
          file: ${{ inputs.dockerfile }}
          cache-from: type=gha
--- a/.github/workflows/gallery-agent.yaml
+++ b/.github/workflows/gallery-agent.yaml
@@ -54,24 +54,41 @@ jobs:
          REPO: ${{ github.repository }}
          SEARCH: 'gallery agent in:title'
        run: |
-          # Walk open gallery-agent PRs and act on maintainer comments:
+          # Walk gallery-agent PRs and act on maintainer comments:
          #   /gallery-agent blacklist → label `gallery-agent/blacklisted` + close (never repropose)
          #   /gallery-agent recreate  → close without label (next run may repropose)
          # Only comments from OWNER / MEMBER / COLLABORATOR are honored so
          # random users can't drive the bot.
+          #
+          # We scan both open PRs AND recently-closed PRs that don't already
+          # carry the blacklist label. This covers the common flow where a
+          # maintainer writes /gallery-agent blacklist and immediately clicks
+          # Close — without this, the next scheduled run wouldn't see the
+          # command (PR is already closed) and would repropose the model.
          gh label create gallery-agent/blacklisted \
            --repo "$REPO" --color ededed \
            --description "gallery-agent must not repropose this model" 2>/dev/null || true

-          prs=$(gh pr list --repo "$REPO" --state open --search "$SEARCH" --json number --jq '.[].number')
+          prs_open=$(gh pr list --repo "$REPO" --state open --search "$SEARCH" \
+            --json number --jq '.[].number')
+          # Closed PRs from the last 14 days that don't yet have the blacklist label.
+          # Bounded window keeps the scan cheap while covering late-applied commands.
+          since=$(date -u -d '14 days ago' +%Y-%m-%d)
+          prs_closed=$(gh pr list --repo "$REPO" --state closed \
+            --search "$SEARCH closed:>=$since -label:gallery-agent/blacklisted" \
+            --json number --jq '.[].number')
+          prs=$(printf '%s\n%s\n' "$prs_open" "$prs_closed" | sort -u | sed '/^$/d')
          for pr in $prs; do
+            state=$(gh pr view "$pr" --repo "$REPO" --json state --jq '.state')
            cmds=$(gh pr view "$pr" --repo "$REPO" --json comments \
              --jq '.comments[] | select(.authorAssociation=="OWNER" or .authorAssociation=="MEMBER" or .authorAssociation=="COLLABORATOR") | .body')
            if echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+blacklist([[:space:]]|$)'; then
-              echo "PR #$pr: blacklist command found"
+              echo "PR #$pr: blacklist command found (state=$state)"
              gh pr edit "$pr" --repo "$REPO" --add-label gallery-agent/blacklisted || true
-              gh pr close "$pr" --repo "$REPO" --comment "Blacklisted via \`/gallery-agent blacklist\`. This model will not be reproposed." || true
-            elif echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+recreate([[:space:]]|$)'; then
+              if [ "$state" = "OPEN" ]; then
+                gh pr close "$pr" --repo "$REPO" --comment "Blacklisted via \`/gallery-agent blacklist\`. This model will not be reproposed." || true
+              fi
+            elif [ "$state" = "OPEN" ] && echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+recreate([[:space:]]|$)'; then
              echo "PR #$pr: recreate command found"
              gh pr close "$pr" --repo "$REPO" --comment "Closed via \`/gallery-agent recreate\`. The next scheduled run will propose this model again." || true
            fi
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,11 +1,23 @@
 # LocalAI Agent Instructions

-This file is an index to detailed topic guides in the `.agents/` directory. Read the relevant file(s) for the task at hand — you don't need to load all of them.
+This file is the entry point for AI coding assistants (Claude Code, Cursor, Copilot, Codex, Aider, etc.) working on LocalAI. It is an index to detailed topic guides in the `.agents/` directory. Read the relevant file(s) for the task at hand — you don't need to load all of them.
+
+Human contributors: see [CONTRIBUTING.md](CONTRIBUTING.md) for the development workflow.
+
+## Policy for AI-Assisted Contributions
+
+LocalAI follows the Linux kernel project's [guidelines for AI coding assistants](https://docs.kernel.org/process/coding-assistants.html). Before submitting AI-assisted code, read [.agents/ai-coding-assistants.md](.agents/ai-coding-assistants.md). Key rules:
+
+- **No `Signed-off-by` from AI.** Only the human submitter may sign off on the Developer Certificate of Origin.
+- **No `Co-Authored-By: <AI>` trailers.** The human contributor owns the change.
+- **Use an `Assisted-by:` trailer** to attribute AI involvement. Format: `Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]`.
+- **The human submitter is responsible** for reviewing, testing, and understanding every line of generated code.

 ## Topics

 | File | When to read |
 |------|-------------|
+| [.agents/ai-coding-assistants.md](.agents/ai-coding-assistants.md) | Policy for AI-assisted contributions — licensing, DCO, attribution |
 | [.agents/building-and-testing.md](.agents/building-and-testing.md) | Building the project, running tests, Docker builds for specific platforms |
 | [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist |
 | [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -13,6 +13,7 @@ Thank you for your interest in contributing to LocalAI! We appreciate your time
  - [Development Workflow](#development-workflow)
  - [Creating a Pull Request (PR)](#creating-a-pull-request-pr)
 - [Coding Guidelines](#coding-guidelines)
+- [AI Coding Assistants](#ai-coding-assistants)
 - [Testing](#testing)
 - [Documentation](#documentation)
 - [Community and Communication](#community-and-communication)
@@ -185,7 +186,7 @@ Before jumping into a PR for a massive feature or big change, it is preferred to

 This project uses an [`.editorconfig`](.editorconfig) file to define formatting standards (indentation, line endings, charset, etc.). Please configure your editor to respect it.

-For AI-assisted development, see [`CLAUDE.md`](CLAUDE.md) for agent-specific guidelines including build instructions and backend architecture details.
+For AI-assisted development, see [`AGENTS.md`](AGENTS.md) (or the equivalent [`CLAUDE.md`](CLAUDE.md) symlink) for agent-specific guidelines including build instructions and backend architecture details. Contributions produced with AI assistance must follow the rules in the [AI Coding Assistants](#ai-coding-assistants) section below.

 ### General Principles

@@ -211,6 +212,26 @@ For AI-assisted development, see [`CLAUDE.md`](CLAUDE.md) for agent-specific gui
 - Reviewers will check for correctness, test coverage, adherence to these guidelines, and clarity of intent.
 - Be responsive to review feedback and keep discussions constructive.

+## AI Coding Assistants
+
+LocalAI follows the **same guidelines as the Linux kernel project** for AI-assisted contributions: <https://docs.kernel.org/process/coding-assistants.html>.
+
+The full policy for this repository lives in [`.agents/ai-coding-assistants.md`](.agents/ai-coding-assistants.md). Summary:
+
+- **AI agents MUST NOT add `Signed-off-by` tags.** Only humans can certify the Developer Certificate of Origin.
+- **AI agents MUST NOT add `Co-Authored-By` trailers** attributing themselves as co-authors.
+- **Attribute AI involvement with an `Assisted-by` trailer** in the commit message:
+
+  ```
+  Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]
+  ```
+
+  Example: `Assisted-by: Claude:claude-opus-4-7 golangci-lint`
+
+  Basic development tools (git, go, make, editors) should not be listed.
+- **The human submitter is responsible** for reviewing, testing, and fully understanding every line of AI-generated code — including verifying that any referenced APIs, flags, or file paths actually exist in the tree.
+- Contributions must remain compatible with LocalAI's **MIT License**.
+
 ## Testing

 All new features and bug fixes should include test coverage. The project uses [Ginkgo](https://onsi.github.io/ginkgo/) as its test framework.
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=8befd92ea5f702494ea9813fe42a52fb015db5fe
+IK_LLAMA_VERSION?=d4824131580b94ffa7b0e91c955e2b237c2fe16e
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/ik-llama-cpp/patches/0002-gemma3-default-rms-norm-eps.patch
+++ b/backend/cpp/ik-llama-cpp/patches/0002-gemma3-default-rms-norm-eps.patch
@@ -1,38 +0,0 @@
-From: LocalAI maintainers <noreply@localai.io>
-Subject: [PATCH] gemma3: default rms norm eps when GGUF metadata key is missing
-
-Some Gemma 3 GGUF files (notably those distributed via the Ollama
-registry) do not embed the `gemma3.attention.layer_norm_rms_epsilon`
-metadata key. ik_llama.cpp currently requires the key to be present and
-fails the entire model load with:
-
-    error loading model hyperparameters:
-    key not found in model: gemma3.attention.layer_norm_rms_epsilon
-
-Ollama's own loader silently falls back to ~1e-6 in the same situation,
-which is the canonical Gemma 3 default (see google/gemma_pytorch
-config.py and the Hugging Face Gemma3Config), so the model still loads
-and works correctly.
-
-Mirror that behavior here: pre-seed the field with the Gemma 3 default
-and mark the metadata key as optional. This unblocks Ollama-converted
-Gemma 3 models without affecting GGUFs that already carry the key.
-
-Refs: ggml-org/llama.cpp#12367, ollama/ollama#10262, mudler/LocalAI#9414
---
- src/llama-hparams.cpp | 3 ++-
- 1 file changed, 2 insertions(+), 1 deletion(-)
-
-diff --git a/src/llama-hparams.cpp b/src/llama-hparams.cpp
--- a/src/llama-hparams.cpp
-+++ b/src/llama-hparams.cpp
-@@ -679,7 +679,8 @@
-                 hparams.rope_freq_scale_train_swa = 1.0f;
-
-                 ml.get_key(LLM_KV_ATTENTION_SLIDING_WINDOW,    hparams.n_swa);
-                ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps);
-+                hparams.f_norm_rms_eps = 1e-6f; // Gemma 3 canonical default; some Ollama GGUFs omit the key
-+                ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps, false);
-
-                 switch (hparams.n_layer) {
-                     case 26: model.type = e_model::MODEL_2B; break;
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@

-LLAMA_VERSION?=4f02d4733934179386cbc15b3454be26237940bb
+LLAMA_VERSION?=cf8b0dbda9ac0eac30ee33f87bc6702ead1c4664
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/patches/0001-gemma3-default-rms-norm-eps.patch
+++ b/backend/cpp/llama-cpp/patches/0001-gemma3-default-rms-norm-eps.patch
@@ -1,38 +0,0 @@
-From: LocalAI maintainers <noreply@localai.io>
-Subject: [PATCH] gemma3: default rms norm eps when GGUF metadata key is missing
-
-Some Gemma 3 GGUF files (notably those distributed via the Ollama
-registry) do not embed the `gemma3.attention.layer_norm_rms_epsilon`
-metadata key. llama.cpp currently requires the key to be present and
-fails the entire model load with:
-
-    error loading model hyperparameters:
-    key not found in model: gemma3.attention.layer_norm_rms_epsilon
-
-Ollama's own loader silently falls back to ~1e-6 in the same situation,
-which is the canonical Gemma 3 default (see google/gemma_pytorch
-config.py and the Hugging Face Gemma3Config), so the model still loads
-and works correctly.
-
-Mirror that behavior here: pre-seed the field with the Gemma 3 default
-and mark the metadata key as optional. This unblocks Ollama-converted
-Gemma 3 models without affecting GGUFs that already carry the key.
-
-Refs: ggml-org/llama.cpp#12367, ollama/ollama#10262, mudler/LocalAI#9414
---
- src/llama-model.cpp | 3 ++-
- 1 file changed, 2 insertions(+), 1 deletion(-)
-
-diff --git a/src/llama-model.cpp b/src/llama-model.cpp
--- a/src/llama-model.cpp
-+++ b/src/llama-model.cpp
-@@ -1568,7 +1568,8 @@
-
-                 hparams.f_final_logit_softcapping = 0.0f;
-                 ml.get_key(LLM_KV_FINAL_LOGIT_SOFTCAPPING, hparams.f_final_logit_softcapping, false);
-                ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps);
-+                hparams.f_norm_rms_eps = 1e-6f; // Gemma 3 canonical default; some Ollama GGUFs omit the key
-+                ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps, false);
-
-                 switch (hparams.n_layer) {
-                     case 18: type = LLM_TYPE_270M; break;
--- a/backend/cpp/turboquant/Makefile
+++ b/backend/cpp/turboquant/Makefile
@@ -1,7 +1,7 @@

 # Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
 # Auto-bumped nightly by .github/workflows/bump_deps.yaml.
-TURBOQUANT_VERSION?=45f8a066ed5f5bb38c695cec532f6cef9f4efa9d
+TURBOQUANT_VERSION?=4d24ad87b8ed2ad160809af41930f1e04b83f234
 LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant

 CMAKE_ARGS?=
--- a/backend/cpp/turboquant/patch-grpc-server.sh
+++ b/backend/cpp/turboquant/patch-grpc-server.sh
@@ -1,13 +1,22 @@
 #!/bin/bash
-# Augment the shared backend/cpp/llama-cpp/grpc-server.cpp allow-list of KV-cache
-# types so the gRPC `LoadModel` call accepts the TurboQuant-specific
-# `turbo2` / `turbo3` / `turbo4` cache types.
+# Patch the shared backend/cpp/llama-cpp/grpc-server.cpp *copy* used by the
+# turboquant build to account for two gaps between upstream and the fork:
 #
-# We do this on the *copy* sitting in turboquant-<flavor>-build/, never on the
-# original under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps
-# compiling against vanilla upstream which does not know about GGML_TYPE_TURBO*.
+#   1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
+#      fork-specific `turbo2` / `turbo3` / `turbo4` cache types.
+#   2. Replace `get_media_marker()` (added upstream in ggml-org/llama.cpp#21962,
+#      server-side random per-instance marker) with the legacy "<__media__>"
+#      literal. The fork branched before that PR, so server-common.cpp has no
+#      get_media_marker symbol. The fork's mtmd_default_marker() still returns
+#      "<__media__>", and Go-side tooling falls back to that sentinel when the
+#      backend does not expose media_marker, so substituting the literal keeps
+#      behavior identical on the turboquant path.
 #
-# Idempotent: skips the insertion if the marker is already present (so re-runs
+# We patch the *copy* sitting in turboquant-<flavor>-build/, never the original
+# under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps compiling
+# against vanilla upstream.
+#
+# Idempotent: skips each insertion if its marker is already present (so re-runs
 # of the same build dir don't double-insert).

 set -euo pipefail
@@ -25,33 +34,47 @@ if [[ ! -f "$SRC" ]]; then
 fi

 if grep -q 'GGML_TYPE_TURBO2_0' "$SRC"; then
-    echo "==> $SRC already has TurboQuant cache types, skipping"
-    exit 0
+    echo "==> $SRC already has TurboQuant cache types, skipping KV allow-list patch"
+else
+    echo "==> patching $SRC to allow turbo2/turbo3/turbo4 KV-cache types"
+
+    # Insert the three TURBO entries right after the first `    GGML_TYPE_Q5_1,`
+    # line (the kv_cache_types[] allow-list). Using awk because the builder image
+    # does not ship python3, and GNU sed's multi-line `a\` quoting is awkward.
+    awk '
+        /^    GGML_TYPE_Q5_1,$/ && !done {
+            print
+            print "    // turboquant fork extras — added by patch-grpc-server.sh"
+            print "    GGML_TYPE_TURBO2_0,"
+            print "    GGML_TYPE_TURBO3_0,"
+            print "    GGML_TYPE_TURBO4_0,"
+            done = 1
+            next
+        }
+        { print }
+        END {
+            if (!done) {
+                print "patch-grpc-server.sh: anchor `    GGML_TYPE_Q5_1,` not found" > "/dev/stderr"
+                exit 1
+            }
+        }
+    ' "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+
+    echo "==> KV allow-list patch OK"
 fi

-echo "==> patching $SRC to allow turbo2/turbo3/turbo4 KV-cache types"
+if grep -q 'get_media_marker()' "$SRC"; then
+    echo "==> patching $SRC to replace get_media_marker() with legacy \"<__media__>\" literal"
+    # Only one call site today (ModelMetadata), but replace all occurrences to
+    # stay robust if upstream adds more. Use a temp file to avoid relying on
+    # sed -i portability (the builder image uses GNU sed, but keeping this
+    # consistent with the awk block above).
+    sed 's/get_media_marker()/"<__media__>"/g' "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+    echo "==> get_media_marker() substitution OK"
+else
+    echo "==> $SRC has no get_media_marker() call, skipping media-marker patch"
+fi

-# Insert the three TURBO entries right after the first `    GGML_TYPE_Q5_1,`
-# line (the kv_cache_types[] allow-list). Using awk because the builder image
-# does not ship python3, and GNU sed's multi-line `a\` quoting is awkward.
-awk '
-    /^    GGML_TYPE_Q5_1,$/ && !done {
-        print
-        print "    // turboquant fork extras — added by patch-grpc-server.sh"
-        print "    GGML_TYPE_TURBO2_0,"
-        print "    GGML_TYPE_TURBO3_0,"
-        print "    GGML_TYPE_TURBO4_0,"
-        done = 1
-        next
-    }
-    { print }
-    END {
-        if (!done) {
-            print "patch-grpc-server.sh: anchor `    GGML_TYPE_Q5_1,` not found" > "/dev/stderr"
-            exit 1
-        }
-    }
-' "$SRC" > "$SRC.tmp"
-mv "$SRC.tmp" "$SRC"
-
-echo "==> patched OK"
+echo "==> all patches applied"
--- a/backend/cpp/turboquant/patches/0001-ggml-hip-add-f16-turbo-vec-instances.patch
+++ b/backend/cpp/turboquant/patches/0001-ggml-hip-add-f16-turbo-vec-instances.patch
@@ -0,0 +1,47 @@
+From: LocalAI turboquant backend maintainers <noreply@localai.io>
+Subject: ggml-hip: add F16-K + TURBO-V fattn-vec template instances
+
+Upstream commit fa4e8be0a0ce ("fix(cuda): add F16-K + TURBO-V dispatch cases
+in fattn.cu") added three new template instance files under ggml-cuda/:
+
+  - fattn-vec-instance-f16-turbo2_0.cu
+  - fattn-vec-instance-f16-turbo3_0.cu
+  - fattn-vec-instance-f16-turbo4_0.cu
+
+and registered them in ggml/src/ggml-cuda/CMakeLists.txt. The companion
+dispatch cases FATTN_VEC_CASES_ALL_D(GGML_TYPE_F16, GGML_TYPE_TURBO{2,3,4}_0)
+were added to ggml/src/ggml-cuda/fattn.cu, which is shared with the HIP
+build path via hipify.
+
+However, ggml/src/ggml-hip/CMakeLists.txt carries its own explicit list of
+template instance sources (used when GGML_CUDA_FA_ALL_QUANTS is OFF, which
+is the default) and was never updated for the new F16-K + TURBO-V combos.
+The HIP build therefore compiles the dispatch cases (which reference
+ggml_cuda_flash_attn_ext_vec_case<D, F16, TURBO*>) without ever compiling
+the matching template instantiations, causing a link-time failure in the
+-gpu-rocm-hipblas-turboquant CI job.
+
+Add the three new template instance files to ggml-hip's list so the HIP
+build links cleanly. Drop this patch once the fork picks up the
+corresponding upstream sync in ggml-hip/CMakeLists.txt.
+
+--- a/ggml/src/ggml-hip/CMakeLists.txt
+++ b/ggml/src/ggml-hip/CMakeLists.txt
+@@ -85,14 +85,17 @@ else()
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-turbo3_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-q8_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-q8_0-turbo3_0.cu
+        ../ggml-cuda/template-instances/fattn-vec-instance-f16-turbo3_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo2_0-turbo2_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo2_0-q8_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-q8_0-turbo2_0.cu
+        ../ggml-cuda/template-instances/fattn-vec-instance-f16-turbo2_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-turbo2_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo2_0-turbo3_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo4_0-turbo4_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo4_0-q8_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-q8_0-turbo4_0.cu
+        ../ggml-cuda/template-instances/fattn-vec-instance-f16-turbo4_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo4_0-turbo3_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-turbo4_0.cu
+         ../ggml-cuda/template-instances/fattn-vec-instance-turbo4_0-turbo2_0.cu
--- a/backend/cpp/turboquant/patches/0001-server-respect-the-ignore-eos-flag.patch
+++ b/backend/cpp/turboquant/patches/0001-server-respect-the-ignore-eos-flag.patch
@@ -1,83 +0,0 @@
-From 660600081fb7b9b769ded5c805a2d39a419f0a0d Mon Sep 17 00:00:00 2001
-From: Yuri Khrustalev <ykhrustalev@users.noreply.github.com>
-Date: Wed, 8 Apr 2026 11:12:15 -0400
-Subject: [PATCH] server: respect the ignore eos flag (#21203)
-
---
- tools/server/server-context.cpp | 3 +++
- tools/server/server-context.h   | 3 +++
- tools/server/server-task.cpp    | 3 ++-
- tools/server/server-task.h      | 1 +
- 4 files changed, 9 insertions(+), 1 deletion(-)
-
-diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
-index 9d3ac538..b31981c5 100644
--- a/tools/server/server-context.cpp
-+++ b/tools/server/server-context.cpp
-@@ -3033,6 +3033,8 @@ server_context_meta server_context::get_meta() const {
-         /* fim_rep_token          */ llama_vocab_fim_rep(impl->vocab),
-         /* fim_sep_token          */ llama_vocab_fim_sep(impl->vocab),
- 
-+        /* logit_bias_eog         */ impl->params_base.sampling.logit_bias_eog,
-+
-         /* model_vocab_type       */ llama_vocab_type(impl->vocab),
-         /* model_vocab_n_tokens   */ llama_vocab_n_tokens(impl->vocab),
-         /* model_n_ctx_train      */ llama_model_n_ctx_train(impl->model),
-@@ -3117,6 +3119,7 @@ std::unique_ptr<server_res_generator> server_routes::handle_completions_impl(
-                     ctx_server.vocab,
-                     params,
-                     meta->slot_n_ctx,
-+                    meta->logit_bias_eog,
-                     data);
-             task.id_slot = json_value(data, "id_slot", -1);
- 
-diff --git a/tools/server/server-context.h b/tools/server/server-context.h
-index d7ce8735..6ea9afc0 100644
--- a/tools/server/server-context.h
-+++ b/tools/server/server-context.h
-@@ -39,6 +39,9 @@ struct server_context_meta {
-     llama_token fim_rep_token;
-     llama_token fim_sep_token;
- 
-+    // sampling
-+    std::vector<llama_logit_bias> logit_bias_eog;
-+
-     // model meta
-     enum llama_vocab_type model_vocab_type;
-     int32_t model_vocab_n_tokens;
-diff --git a/tools/server/server-task.cpp b/tools/server/server-task.cpp
-index 4cc87bc5..856b3f0e 100644
--- a/tools/server/server-task.cpp
-+++ b/tools/server/server-task.cpp
-@@ -239,6 +239,7 @@ task_params server_task::params_from_json_cmpl(
-         const llama_vocab * vocab,
-         const common_params & params_base,
-         const int n_ctx_slot,
-+        const std::vector<llama_logit_bias> & logit_bias_eog,
-         const json & data) {
-     task_params params;
- 
-@@ -562,7 +563,7 @@ task_params server_task::params_from_json_cmpl(
-         if (params.sampling.ignore_eos) {
-             params.sampling.logit_bias.insert(
-                     params.sampling.logit_bias.end(),
-                    defaults.sampling.logit_bias_eog.begin(), defaults.sampling.logit_bias_eog.end());
-+                    logit_bias_eog.begin(), logit_bias_eog.end());
-         }
-     }
- 
-diff --git a/tools/server/server-task.h b/tools/server/server-task.h
-index d855bf08..243e47a8 100644
--- a/tools/server/server-task.h
-+++ b/tools/server/server-task.h
-@@ -209,6 +209,7 @@ struct server_task {
-         const llama_vocab * vocab,
-         const common_params & params_base,
-         const int n_ctx_slot,
-+        const std::vector<llama_logit_bias> & logit_bias_eog,
-         const json & data);
- 
-     // utility function
-- 
-2.43.0
-
--- a/backend/go/stablediffusion-ggml/Makefile
+++ b/backend/go/stablediffusion-ggml/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # stablediffusion.cpp (ggml)
 STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
-STABLEDIFFUSION_GGML_VERSION?=7d33d4b2ddeafa672761a5880ec33bdff452504d
+STABLEDIFFUSION_GGML_VERSION?=44cca3d626d301e2215d5e243277e8f0e65bfa78

 CMAKE_ARGS+=-DGGML_MAX_NAME=128

--- a/backend/go/stablediffusion-ggml/gosd.cpp
+++ b/backend/go/stablediffusion-ggml/gosd.cpp
@@ -1106,6 +1106,11 @@ static int ffmpeg_mux_raw_to_mp4(sd_image_t* frames, int num_frames, int fps, co
            const_cast<char*>("-c:v"), const_cast<char*>("libx264"),
            const_cast<char*>("-pix_fmt"), const_cast<char*>("yuv420p"),
            const_cast<char*>("-movflags"), const_cast<char*>("+faststart"),
+            // Force MP4 container. Distributed LocalAI hands us a staging
+            // path (e.g. /staging/localai-output-NNN.tmp) with a non-standard
+            // extension; relying on filename suffix makes ffmpeg bail with
+            // "Unable to choose an output format".
+            const_cast<char*>("-f"), const_cast<char*>("mp4"),
            const_cast<char*>(dst),
            nullptr
        };
--- a/backend/go/whisper/Makefile
+++ b/backend/go/whisper/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # whisper.cpp version
 WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
-WHISPER_CPP_VERSION?=166c20b473d5f4d04052e699f992f625ea2a2fdd
+WHISPER_CPP_VERSION?=fc674574ca27cac59a15e5b22a09b9d9ad62aafe
 SO_TARGET?=libgowhisper.so

 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -587,7 +587,6 @@
  alias: "whisperx"
  capabilities:
    nvidia: "cuda12-whisperx"
-    amd: "rocm-whisperx"
    metal: "metal-whisperx"
    default: "cpu-whisperx"
    nvidia-cuda-13: "cuda13-whisperx"
@@ -1008,6 +1007,20 @@
    nvidia-cuda-12: "cuda12-turboquant-development"
    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-turboquant-development"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-turboquant-development"
+- !!merge <<: *stablediffusionggml
+  name: "stablediffusion-ggml-development"
+  capabilities:
+    default: "cpu-stablediffusion-ggml-development"
+    nvidia: "cuda12-stablediffusion-ggml-development"
+    intel: "intel-sycl-f16-stablediffusion-ggml-development"
+    # amd: "rocm-stablediffusion-ggml-development"
+    vulkan: "vulkan-stablediffusion-ggml-development"
+    nvidia-l4t: "nvidia-l4t-arm64-stablediffusion-ggml-development"
+    metal: "metal-stablediffusion-ggml-development"
+    nvidia-cuda-13: "cuda13-stablediffusion-ggml-development"
+    nvidia-cuda-12: "cuda12-stablediffusion-ggml-development"
+    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-stablediffusion-ggml-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-stablediffusion-ggml-development"
 - !!merge <<: *neutts
  name: "cpu-neutts"
  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-neutts"
@@ -2731,7 +2744,6 @@
  name: "whisperx-development"
  capabilities:
    nvidia: "cuda12-whisperx-development"
-    amd: "rocm-whisperx-development"
    metal: "metal-whisperx-development"
    default: "cpu-whisperx-development"
    nvidia-cuda-13: "cuda13-whisperx-development"
@@ -2757,16 +2769,6 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-whisperx"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-whisperx
- !!merge <<: *whisperx
-  name: "rocm-whisperx"
-  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-whisperx"
-  mirrors:
-    - localai/localai-backends:latest-gpu-rocm-hipblas-whisperx
- !!merge <<: *whisperx
-  name: "rocm-whisperx-development"
-  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-whisperx"
-  mirrors:
-    - localai/localai-backends:master-gpu-rocm-hipblas-whisperx
 - !!merge <<: *whisperx
  name: "cuda13-whisperx"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-whisperx"
--- a/backend/python/whisperx/requirements-hipblas.txt
+++ b/backend/python/whisperx/requirements-hipblas.txt
@@ -1,6 +0,0 @@
-# whisperx hard-pins torch~=2.8.0, which is not available in the rocm7.x indexes
-# (they start at torch 2.10). Keep rocm6.4 wheels here — they still load against
-# the rocm7.2.1 runtime via AMD's forward-compatibility window.
--extra-index-url https://download.pytorch.org/whl/rocm6.4
-torch==2.8.0+rocm6.4
-whisperx @ git+https://github.com/m-bain/whisperX.git
--- a/core/backend/llm.go
+++ b/core/backend/llm.go
@@ -40,6 +40,12 @@ type TokenUsage struct {
 	ChatDeltas             []*proto.ChatDelta // per-chunk deltas from C++ autoparser (only set during streaming)
 }

+func needsThinkingProbe(c *config.ModelConfig) bool {
+	return c.TemplateConfig.UseTokenizerTemplate &&
+		(c.ReasoningConfig.DisableReasoning == nil ||
+			c.ReasoningConfig.DisableReasoningTagPrefill == nil)
+}
+
 // HasChatDeltaContent returns true if any chat delta carries content or reasoning text.
 // Used to decide whether to prefer C++ autoparser deltas over Go-side tag extraction.
 func (t TokenUsage) HasChatDeltaContent() bool {
@@ -100,11 +106,9 @@ func ModelInference(ctx context.Context, s string, messages schema.Messages, ima
 	// tokenizer template path is active) and the multimodal media marker (needed
 	// by custom chat templates so markers line up with what mtmd expects).
 	// We probe whenever any of those slots is still empty.
-	needsThinkingProbe := c.TemplateConfig.UseTokenizerTemplate &&
-		c.ReasoningConfig.DisableReasoning == nil &&
-		c.ReasoningConfig.DisableReasoningTagPrefill == nil
+	shouldProbeThinking := needsThinkingProbe(c)
 	needsMarkerProbe := c.MediaMarker == ""
-	if needsThinkingProbe || needsMarkerProbe {
+	if shouldProbeThinking || needsMarkerProbe {
 		modelOpts := grpcModelOpts(*c, o.SystemState.Model.ModelsPath)
 		config.DetectThinkingSupportFromBackend(ctx, c, inferenceModel, modelOpts)
 		// Update the config in the loader so it persists for future requests
--- a/core/backend/llm_probe_test.go
+++ b/core/backend/llm_probe_test.go
@@ -0,0 +1,29 @@
+package backend
+
+import (
+	"github.com/mudler/LocalAI/core/config"
+
+	"github.com/gpustack/gguf-parser-go/util/ptr"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("thinking probe gating", func() {
+	It("probes tokenizer-template models when any reasoning default is still unset", func() {
+		cfg := &config.ModelConfig{
+			TemplateConfig: config.TemplateConfig{UseTokenizerTemplate: true},
+		}
+		Expect(needsThinkingProbe(cfg)).To(BeTrue())
+
+		cfg.ReasoningConfig.DisableReasoning = ptr.To(true)
+		Expect(needsThinkingProbe(cfg)).To(BeTrue())
+
+		cfg.ReasoningConfig.DisableReasoningTagPrefill = ptr.To(true)
+		Expect(needsThinkingProbe(cfg)).To(BeFalse())
+	})
+
+	It("does not probe when tokenizer templates are disabled", func() {
+		cfg := &config.ModelConfig{}
+		Expect(needsThinkingProbe(cfg)).To(BeFalse())
+	})
+})
--- a/core/cli/run.go
+++ b/core/cli/run.go
@@ -507,7 +507,7 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {

 	app, err := application.New(opts...)
 	if err != nil {
-		return fmt.Errorf("failed basic startup tasks with error %s", err.Error())
+		return fmt.Errorf("LocalAI failed to start: %w.\nTroubleshooting steps:\n  1. Check that your models directory exists and is accessible: %s\n  2. Verify model config files are valid YAML: 'local-ai util usecase-heuristic <config>'\n  3. Check available disk space and file permissions\n  4. Run with --log-level=debug for more details\nSee https://localai.io/basics/troubleshooting/ for more help", err, r.ModelsPath)
 	}

 	appHTTP, err := http.API(app)
--- a/core/cli/transcript.go
+++ b/core/cli/transcript.go
@@ -3,7 +3,6 @@ package cli
 import (
 	"context"
 	"encoding/json"
-	"errors"
 	"fmt"
 	"strings"

@@ -60,7 +59,7 @@ func (t *TranscriptCMD) Run(ctx *cliContext.Context) error {

 	c, exists := cl.GetModelConfig(t.Model)
 	if !exists {
-		return errors.New("model not found")
+		return fmt.Errorf("model %q not found. Run 'local-ai models list' to see available models, or install one with 'local-ai models install <model>'. See https://localai.io/models/ for more information", t.Model)
 	}

 	c.Threads = &t.Threads
--- a/core/cli/util.go
+++ b/core/cli/util.go
@@ -74,7 +74,7 @@ func (u *CreateOCIImageCMD) Run(ctx *cliContext.Context) error {

 func (u *GGUFInfoCMD) Run(ctx *cliContext.Context) error {
 	if len(u.Args) == 0 {
-		return fmt.Errorf("no GGUF file provided")
+		return fmt.Errorf("no GGUF file provided. Usage: local-ai util gguf-info <path-to-file.gguf>\nGGUF is a binary format for storing quantized language models. You can download GGUF models from https://huggingface.co or install one with 'local-ai models install <model>'")
 	}
 	// We try to guess only if we don't have a template defined already
 	f, err := gguf.ParseGGUFFile(u.Args[0])
--- a/core/cli/worker.go
+++ b/core/cli/worker.go
@@ -21,6 +21,7 @@ import (
 	"github.com/mudler/LocalAI/core/cli/workerregistry"
 	"github.com/mudler/LocalAI/core/config"
 	"github.com/mudler/LocalAI/core/gallery"
+	"github.com/mudler/LocalAI/core/services/galleryop"
 	"github.com/mudler/LocalAI/core/services/messaging"
 	"github.com/mudler/LocalAI/core/services/nodes"
 	"github.com/mudler/LocalAI/core/services/storage"
@@ -597,12 +598,20 @@ func (s *backendSupervisor) installBackend(req messaging.BackendInstallRequest)
 	// Try to find the backend binary
 	backendPath := s.findBackend(req.Backend)
 	if backendPath == "" {
-		// Backend not found locally — try auto-installing from gallery
-		xlog.Info("Backend not found locally, attempting gallery install", "backend", req.Backend)
-		if err := gallery.InstallBackendFromGallery(
-			context.Background(), galleries, s.systemState, s.ml, req.Backend, nil, false,
-		); err != nil {
-			return "", fmt.Errorf("installing backend from gallery: %w", err)
+		if req.URI != "" {
+			xlog.Info("Backend not found locally, attempting external install", "backend", req.Backend, "uri", req.URI)
+			if err := galleryop.InstallExternalBackend(
+				context.Background(), galleries, s.systemState, s.ml, nil, req.URI, req.Name, req.Alias,
+			); err != nil {
+				return "", fmt.Errorf("installing backend from gallery: %w", err)
+			}
+		} else {
+			xlog.Info("Backend not found locally, attempting gallery install", "backend", req.Backend)
+			if err := gallery.InstallBackendFromGallery(
+				context.Background(), galleries, s.systemState, s.ml, req.Backend, nil, false,
+			); err != nil {
+				return "", fmt.Errorf("installing backend from gallery: %w", err)
+			}
 		}
 		// Re-register after install and retry
 		gallery.RegisterBackends(s.systemState, s.ml)
--- a/core/cli/worker/worker_p2p.go
+++ b/core/cli/worker/worker_p2p.go
@@ -38,7 +38,7 @@ func (r *P2P) Run(ctx *cliContext.Context) error {
 	// Check if the token is set
 	// as we always need it.
 	if r.Token == "" {
-		return fmt.Errorf("Token is required")
+		return fmt.Errorf("a P2P token is required to join the network. Set it via the LOCALAI_TOKEN environment variable or the --token flag. You can generate a token by running 'local-ai run --p2p' on the main node. See https://localai.io/features/distribute/ for more information")
 	}

 	port, err := freeport.GetFreePort()
--- a/core/config/gguf.go
+++ b/core/config/gguf.go
@@ -125,19 +125,7 @@ func DetectThinkingSupportFromBackend(ctx context.Context, cfg *ModelConfig, bac
 			return
 		}

-		cfg.ReasoningConfig.DisableReasoning = ptr.To(!metadata.SupportsThinking)
-
-		// Use the rendered template to detect if thinking token is at the end
-		// This reuses the existing DetectThinkingStartToken function
-		if metadata.RenderedTemplate != "" {
-			thinkingStartToken := reasoning.DetectThinkingStartToken(metadata.RenderedTemplate, &cfg.ReasoningConfig)
-			thinkingForcedOpen := thinkingStartToken != ""
-			cfg.ReasoningConfig.DisableReasoningTagPrefill = ptr.To(!thinkingForcedOpen)
-			xlog.Debug("[gguf] DetectThinkingSupportFromBackend: thinking support detected", "supports_thinking", metadata.SupportsThinking, "thinking_forced_open", thinkingForcedOpen, "thinking_start_token", thinkingStartToken)
-		} else {
-			cfg.ReasoningConfig.DisableReasoningTagPrefill = ptr.To(true)
-			xlog.Debug("[gguf] DetectThinkingSupportFromBackend: thinking support detected", "supports_thinking", metadata.SupportsThinking, "thinking_forced_open", false)
-		}
+		applyDetectedThinkingConfig(cfg, metadata)

 		// Extract tool format markers from autoparser analysis
 		if tf := metadata.GetToolFormat(); tf != nil && tf.FormatType != "" {
@@ -180,3 +168,34 @@ func DetectThinkingSupportFromBackend(ctx context.Context, cfg *ModelConfig, bac
 		}
 	}
 }
+
+func applyDetectedThinkingConfig(cfg *ModelConfig, metadata *pb.ModelMetadataResponse) {
+	if cfg == nil || metadata == nil {
+		return
+	}
+
+	// Respect explicit YAML/user config. Backend probing should only fill defaults
+	// when the reasoning mode has not already been set.
+	if cfg.ReasoningConfig.DisableReasoning == nil {
+		cfg.ReasoningConfig.DisableReasoning = ptr.To(!metadata.SupportsThinking)
+	}
+
+	// Respect explicit prefill config for the same reason. Only infer the
+	// default prefill behavior when the user did not set it.
+	if cfg.ReasoningConfig.DisableReasoningTagPrefill == nil {
+		// Use the rendered template to detect if thinking token is at the end.
+		// This reuses the existing DetectThinkingStartToken function.
+		if metadata.RenderedTemplate != "" {
+			thinkingStartToken := reasoning.DetectThinkingStartToken(metadata.RenderedTemplate, &cfg.ReasoningConfig)
+			thinkingForcedOpen := thinkingStartToken != ""
+			cfg.ReasoningConfig.DisableReasoningTagPrefill = ptr.To(!thinkingForcedOpen)
+			xlog.Debug("[gguf] DetectThinkingSupportFromBackend: thinking support detected", "supports_thinking", metadata.SupportsThinking, "thinking_forced_open", thinkingForcedOpen, "thinking_start_token", thinkingStartToken)
+		} else {
+			cfg.ReasoningConfig.DisableReasoningTagPrefill = ptr.To(true)
+			xlog.Debug("[gguf] DetectThinkingSupportFromBackend: thinking support detected", "supports_thinking", metadata.SupportsThinking, "thinking_forced_open", false)
+		}
+		return
+	}
+
+	xlog.Debug("[gguf] DetectThinkingSupportFromBackend: preserving explicit reasoning config", "supports_thinking", metadata.SupportsThinking, "disable_reasoning", *cfg.ReasoningConfig.DisableReasoning, "disable_reasoning_tag_prefill", *cfg.ReasoningConfig.DisableReasoningTagPrefill)
+}
--- a/core/config/gguf_reasoning_test.go
+++ b/core/config/gguf_reasoning_test.go
@@ -0,0 +1,101 @@
+package config
+
+import (
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+	"github.com/mudler/LocalAI/pkg/reasoning"
+
+	"github.com/gpustack/gguf-parser-go/util/ptr"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("GGUF backend metadata reasoning defaults", func() {
+	It("fills reasoning defaults when unset", func() {
+		cfg := &ModelConfig{
+			TemplateConfig: TemplateConfig{UseTokenizerTemplate: true},
+		}
+
+		applyDetectedThinkingConfig(cfg, &pb.ModelMetadataResponse{
+			SupportsThinking: true,
+			RenderedTemplate: "{{ bos_token }}<think>",
+		})
+
+		Expect(cfg.ReasoningConfig.DisableReasoning).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoning).To(BeFalse())
+		Expect(cfg.ReasoningConfig.DisableReasoningTagPrefill).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoningTagPrefill).To(BeFalse())
+	})
+
+	It("preserves fully explicit reasoning settings", func() {
+		cfg := &ModelConfig{
+			TemplateConfig: TemplateConfig{UseTokenizerTemplate: true},
+			ReasoningConfig: reasoning.Config{
+				DisableReasoning:           ptr.To(true),
+				DisableReasoningTagPrefill: ptr.To(true),
+			},
+		}
+
+		applyDetectedThinkingConfig(cfg, &pb.ModelMetadataResponse{
+			SupportsThinking: true,
+			RenderedTemplate: "{{ bos_token }}<think>",
+		})
+
+		Expect(cfg.ReasoningConfig.DisableReasoning).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoning).To(BeTrue())
+		Expect(cfg.ReasoningConfig.DisableReasoningTagPrefill).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoningTagPrefill).To(BeTrue())
+	})
+
+	It("preserves explicit disable while still inferring missing prefill", func() {
+		cfg := &ModelConfig{
+			TemplateConfig: TemplateConfig{UseTokenizerTemplate: true},
+			ReasoningConfig: reasoning.Config{
+				DisableReasoning: ptr.To(true),
+			},
+		}
+
+		applyDetectedThinkingConfig(cfg, &pb.ModelMetadataResponse{
+			SupportsThinking: true,
+			RenderedTemplate: "{{ bos_token }}<think>",
+		})
+
+		Expect(cfg.ReasoningConfig.DisableReasoning).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoning).To(BeTrue())
+		Expect(cfg.ReasoningConfig.DisableReasoningTagPrefill).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoningTagPrefill).To(BeFalse())
+	})
+
+	It("preserves explicit prefill while still inferring missing disable flag", func() {
+		cfg := &ModelConfig{
+			TemplateConfig: TemplateConfig{UseTokenizerTemplate: true},
+			ReasoningConfig: reasoning.Config{
+				DisableReasoningTagPrefill: ptr.To(true),
+			},
+		}
+
+		applyDetectedThinkingConfig(cfg, &pb.ModelMetadataResponse{
+			SupportsThinking: true,
+			RenderedTemplate: "{{ bos_token }}<think>",
+		})
+
+		Expect(cfg.ReasoningConfig.DisableReasoning).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoning).To(BeFalse())
+		Expect(cfg.ReasoningConfig.DisableReasoningTagPrefill).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoningTagPrefill).To(BeTrue())
+	})
+
+	It("defaults to disabling reasoning when backend does not support thinking", func() {
+		cfg := &ModelConfig{
+			TemplateConfig: TemplateConfig{UseTokenizerTemplate: true},
+		}
+
+		applyDetectedThinkingConfig(cfg, &pb.ModelMetadataResponse{
+			SupportsThinking: false,
+		})
+
+		Expect(cfg.ReasoningConfig.DisableReasoning).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoning).To(BeTrue())
+		Expect(cfg.ReasoningConfig.DisableReasoningTagPrefill).ToNot(BeNil())
+		Expect(*cfg.ReasoningConfig.DisableReasoningTagPrefill).To(BeTrue())
+	})
+})
--- a/core/config/model_config_loader.go
+++ b/core/config/model_config_loader.go
@@ -193,9 +193,9 @@ func (bcl *ModelConfigLoader) ReadModelConfig(file string, opts ...ConfigLoaderO
 		bcl.configs[c.Name] = *c
 	} else {
 		if err != nil {
-			return fmt.Errorf("config is not valid: %w", err)
+			return fmt.Errorf("model config %q is not valid: %w. Ensure the YAML file has a valid 'name' field and correct syntax. See https://localai.io/docs/getting-started/customize-model/ for config reference", file, err)
 		}
-		return fmt.Errorf("config is not valid")
+		return fmt.Errorf("model config %q is not valid. Ensure the YAML file has a valid 'name' field and correct syntax. See https://localai.io/docs/getting-started/customize-model/ for config reference", file)
 	}

 	return nil
@@ -373,9 +373,9 @@ func (bcl *ModelConfigLoader) LoadModelConfigsFromPath(path string, opts ...Conf
 		files = append(files, info)
 	}
 	for _, file := range files {
-		// Skip templates, YAML and .keep files
-		if !strings.Contains(file.Name(), ".yaml") && !strings.Contains(file.Name(), ".yml") ||
-			strings.HasPrefix(file.Name(), ".") {
+		// Only load real YAML config files and ignore dotfiles or backup variants
+		ext := strings.ToLower(filepath.Ext(file.Name()))
+		if (ext != ".yaml" && ext != ".yml") || strings.HasPrefix(file.Name(), ".") {
 			continue
 		}

--- a/core/config/model_test.go
+++ b/core/config/model_test.go
@@ -2,6 +2,7 @@ package config

 import (
 	"os"
+	"path/filepath"

 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
@@ -109,5 +110,50 @@ options:
 			Expect(testModel.Options).To(ContainElements("foo", "bar", "baz"))

 		})
+
+		It("Only loads files ending with yaml or yml", func() {
+			tmpdir, err := os.MkdirTemp("", "model-config-loader")
+			Expect(err).ToNot(HaveOccurred())
+			defer os.RemoveAll(tmpdir)
+
+			err = os.WriteFile(filepath.Join(tmpdir, "foo.yaml"), []byte(
+				`name: "foo-model"
+description: "formal config"
+backend: "llama-cpp"
+parameters:
+  model: "foo.gguf"
+`), 0644)
+			Expect(err).ToNot(HaveOccurred())
+
+			err = os.WriteFile(filepath.Join(tmpdir, "foo.yaml.bak"), []byte(
+				`name: "foo-model"
+description: "backup config"
+backend: "llama-cpp"
+parameters:
+  model: "foo-backup.gguf"
+`), 0644)
+			Expect(err).ToNot(HaveOccurred())
+
+			err = os.WriteFile(filepath.Join(tmpdir, "foo.yaml.bak.123"), []byte(
+				`name: "foo-backup-only"
+description: "timestamped backup config"
+backend: "llama-cpp"
+parameters:
+  model: "foo-timestamped.gguf"
+`), 0644)
+			Expect(err).ToNot(HaveOccurred())
+
+			bcl := NewModelConfigLoader(tmpdir)
+			err = bcl.LoadModelConfigsFromPath(tmpdir)
+			Expect(err).ToNot(HaveOccurred())
+
+			configs := bcl.GetAllModelsConfigs()
+			Expect(configs).To(HaveLen(1))
+			Expect(configs[0].Name).To(Equal("foo-model"))
+			Expect(configs[0].Description).To(Equal("formal config"))
+
+			_, exists := bcl.GetModelConfig("foo-backup-only")
+			Expect(exists).To(BeFalse())
+		})
 	})
 })
--- a/core/gallery/backends.go
+++ b/core/gallery/backends.go
@@ -110,7 +110,13 @@ func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery,
 		if err != nil {
 			return err
 		}
-		if backends.Exists(name) {
+		// Only short-circuit if the install is *actually usable*. An orphaned
+		// meta entry whose concrete was removed still shows up in
+		// ListSystemBackends with a RunFile pointing at a path that no longer
+		// exists; returning early there leaves the caller with a broken
+		// alias and the worker fails with "backend not found after install
+		// attempt" on every retry. Re-install in that case.
+		if existing, ok := backends.Get(name); ok && isBackendRunnable(existing) {
 			return nil
 		}
 	}
@@ -375,17 +381,44 @@ func DeleteBackendFromSystem(systemState *system.SystemState, name string) error
 	}

 	if metadata != nil && metadata.MetaBackendFor != "" {
-		metaBackendDirectory := filepath.Join(systemState.Backend.BackendsPath, metadata.MetaBackendFor)
-		xlog.Debug("Deleting meta backend", "backendDirectory", metaBackendDirectory)
-		if _, err := os.Stat(metaBackendDirectory); os.IsNotExist(err) {
-			return fmt.Errorf("meta backend %q not found", metadata.MetaBackendFor)
+		concreteDirectory := filepath.Join(systemState.Backend.BackendsPath, metadata.MetaBackendFor)
+		xlog.Debug("Deleting concrete backend referenced by meta", "concreteDirectory", concreteDirectory)
+		// If the concrete the meta points to is already gone (earlier delete,
+		// partial install, or manual cleanup), keep going and remove the
+		// orphaned meta dir. Previously we returned an error here, which made
+		// the orphaned meta impossible to uninstall from the UI — the delete
+		// kept failing and every subsequent install short-circuited because
+		// the stale meta metadata made ListSystemBackends.Exists(name) true.
+		if _, statErr := os.Stat(concreteDirectory); statErr == nil {
+			os.RemoveAll(concreteDirectory)
+		} else if os.IsNotExist(statErr) {
+			xlog.Warn("Concrete backend referenced by meta not found — removing orphaned meta only",
+				"meta", name, "concrete", metadata.MetaBackendFor)
+		} else {
+			return statErr
 		}
-		os.RemoveAll(metaBackendDirectory)
 	}

 	return os.RemoveAll(backendDirectory)
 }

+// isBackendRunnable reports whether the given backend entry can actually be
+// invoked. A meta backend is runnable only if its concrete's run.sh still
+// exists on disk; concrete backends are considered runnable as long as their
+// RunFile is set (ListSystemBackends only emits them when the runfile is
+// present). Used to guard the "already installed" short-circuit so an
+// orphaned meta pointing at a missing concrete triggers a real reinstall
+// rather than being silently skipped.
+func isBackendRunnable(b SystemBackend) bool {
+	if b.RunFile == "" {
+		return false
+	}
+	if fi, err := os.Stat(b.RunFile); err != nil || fi.IsDir() {
+		return false
+	}
+	return true
+}
+
 type SystemBackend struct {
 	Name             string
 	RunFile          string
--- a/core/gallery/backends_test.go
+++ b/core/gallery/backends_test.go
@@ -952,6 +952,58 @@ var _ = Describe("Gallery Backends", func() {
 			err = DeleteBackendFromSystem(systemState, "non-existent")
 			Expect(err).To(HaveOccurred())
 		})
+
+		It("removes an orphaned meta backend whose concrete is missing", func() {
+			// Real scenario from the dev cluster: the concrete got wiped
+			// (partial install, manual cleanup, previous crash) but the meta
+			// directory + metadata.json still points at it. The old code
+			// errored with "meta backend X not found" and left the orphan in
+			// place, making the backend impossible to uninstall.
+			metaName := "meta-backend"
+			concreteName := "concrete-backend-that-vanished"
+			metaPath := filepath.Join(tempDir, metaName)
+			Expect(os.MkdirAll(metaPath, 0750)).To(Succeed())
+
+			meta := BackendMetadata{Name: metaName, MetaBackendFor: concreteName}
+			data, err := json.MarshalIndent(meta, "", "  ")
+			Expect(err).NotTo(HaveOccurred())
+			Expect(os.WriteFile(filepath.Join(metaPath, "metadata.json"), data, 0644)).To(Succeed())
+
+			// Concrete directory intentionally absent.
+			systemState, err := system.GetSystemState(system.WithBackendPath(tempDir))
+			Expect(err).NotTo(HaveOccurred())
+
+			Expect(DeleteBackendFromSystem(systemState, metaName)).To(Succeed())
+			Expect(metaPath).NotTo(BeADirectory())
+		})
+	})
+
+	Describe("InstallBackendFromGallery — orphaned meta reinstall", func() {
+		It("re-runs install when the meta's concrete is missing", func() {
+			// Seed state: meta dir exists with metadata pointing at a
+			// concrete that was removed from disk. ListSystemBackends still
+			// surfaces the meta via its metadata.Name → the old short-circuit
+			// at `if backends.Exists(name) { return nil }` returned silently,
+			// leaving the worker's findBackend() with a dead alias forever.
+			// The fix: require the backend to be runnable before we skip.
+			metaName := "meta-orphan"
+			concreteName := "concrete-gone"
+			metaPath := filepath.Join(tempDir, metaName)
+			Expect(os.MkdirAll(metaPath, 0750)).To(Succeed())
+			meta := BackendMetadata{Name: metaName, MetaBackendFor: concreteName}
+			data, err := json.MarshalIndent(meta, "", "  ")
+			Expect(err).NotTo(HaveOccurred())
+			Expect(os.WriteFile(filepath.Join(metaPath, "metadata.json"), data, 0644)).To(Succeed())
+
+			systemState, err := system.GetSystemState(system.WithBackendPath(tempDir))
+			Expect(err).NotTo(HaveOccurred())
+
+			listed, err := ListSystemBackends(systemState)
+			Expect(err).NotTo(HaveOccurred())
+			b, ok := listed.Get(metaName)
+			Expect(ok).To(BeTrue())
+			Expect(isBackendRunnable(b)).To(BeFalse()) // concrete run.sh absent
+		})
 	})

 	Describe("ListSystemBackends", func() {
--- a/core/http/endpoints/localai/backend_monitor.go
+++ b/core/http/endpoints/localai/backend_monitor.go
@@ -9,19 +9,26 @@ import (
 // BackendMonitorEndpoint returns the status of the specified backend
 // @Summary Backend monitor endpoint
 // @Tags monitoring
-// @Param request body schema.BackendMonitorRequest true "Backend statistics request"
+// @Param model query string true "Name of the model to monitor"
 // @Success 200 {object} proto.StatusResponse "Response"
 // @Router /backend/monitor [get]
 func BackendMonitorEndpoint(bm *monitoring.BackendMonitorService) echo.HandlerFunc {
 	return func(c echo.Context) error {
-
-		input := new(schema.BackendMonitorRequest)
-		// Get input data from the request body
-		if err := c.Bind(input); err != nil {
-			return err
+		model := c.QueryParam("model")
+		// Fall back to binding the request body so pre-existing clients that
+		// sent `{"model": "..."}` with GET keep working.
+		if model == "" {
+			input := new(schema.BackendMonitorRequest)
+			if err := c.Bind(input); err != nil {
+				return err
+			}
+			model = input.Model
+		}
+		if model == "" {
+			return echo.NewHTTPError(400, "model query parameter is required")
 		}

-		resp, err := bm.CheckAndSample(input.Model)
+		resp, err := bm.CheckAndSample(model)
 		if err != nil {
 			return err
 		}
--- a/core/http/endpoints/localai/nodes.go
+++ b/core/http/endpoints/localai/nodes.go
@@ -376,7 +376,7 @@ func InstallBackendOnNodeEndpoint(unloader nodes.NodeCommandSender) echo.Handler
 		if err := c.Bind(&req); err != nil || req.Backend == "" {
 			return c.JSON(http.StatusBadRequest, nodeError(http.StatusBadRequest, "backend name required"))
 		}
-		reply, err := unloader.InstallBackend(nodeID, req.Backend, "", req.BackendGalleries)
+		reply, err := unloader.InstallBackend(nodeID, req.Backend, "", req.BackendGalleries, "", "", "")
 		if err != nil {
 			xlog.Error("Failed to install backend on node", "node", nodeID, "backend", req.Backend, "error", err)
 			return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "failed to install backend on node"))
--- a/core/http/endpoints/localai/settings.go
+++ b/core/http/endpoints/localai/settings.go
@@ -110,6 +110,27 @@ func UpdateSettingsEndpoint(app *application.Application) echo.HandlerFunc {
 			})
 		}

+		// The UI reads ApiKeys from GET /api/settings, which already returns the
+		// merged env+runtime list. When the user clicks Save, the same merged
+		// list comes back in the POST body. Strip the env-supplied keys from
+		// the incoming list before we persist or re-merge, otherwise each save
+		// duplicates the env keys on top of the previous merge (#9071).
+		if settings.ApiKeys != nil {
+			envKeys := startupConfig.ApiKeys
+			envSet := make(map[string]struct{}, len(envKeys))
+			for _, k := range envKeys {
+				envSet[k] = struct{}{}
+			}
+			runtimeOnly := make([]string, 0, len(*settings.ApiKeys))
+			for _, k := range *settings.ApiKeys {
+				if _, fromEnv := envSet[k]; fromEnv {
+					continue
+				}
+				runtimeOnly = append(runtimeOnly, k)
+			}
+			settings.ApiKeys = &runtimeOnly
+		}
+
 		settingsFile := filepath.Join(appConfig.DynamicConfigsDir, "runtime_settings.json")
 		settingsJSON, err := json.MarshalIndent(settings, "", "  ")
 		if err != nil {
--- a/core/http/endpoints/openai/chat.go
+++ b/core/http/endpoints/openai/chat.go
@@ -147,6 +147,7 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 		result := ""
 		lastEmittedCount := 0
 		sentInitialRole := false
+		sentReasoning := false
 		hasChatDeltaToolCalls := false
 		hasChatDeltaContent := false

@@ -190,6 +191,7 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 					}},
 					Object: "chat.completion.chunk",
 				}
+				sentReasoning = true
 			}

 			// Stream content deltas (cleaned of reasoning tags) while no tool calls
@@ -363,7 +365,12 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 			functionResults = functions.ParseFunctionCall(cleanedResult, config.FunctionsConfig)
 		}
 		xlog.Debug("[ChatDeltas] final tool call decision", "tool_calls", len(functionResults), "text_content", *textContentToReturn)
-		noActionToRun := len(functionResults) > 0 && functionResults[0].Name == noAction || len(functionResults) == 0
+		// noAction is a sentinel "just answer" pseudo-function — not a real
+		// tool call. Scan the whole slice rather than only index 0 so we
+		// don't drop a real tool call that happens to follow a noAction
+		// entry, and so the default branch isn't entered with only noAction
+		// entries to emit as tool_calls.
+		noActionToRun := !hasRealCall(functionResults, noAction)

 		switch {
 		case noActionToRun:
@@ -377,108 +384,31 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 				usage.TimingPromptProcessing = tokenUsage.TimingPromptProcessing
 			}

-			if sentInitialRole {
-				// Content was already streamed during the callback — just emit usage.
-				delta := &schema.Message{}
-				if reasoning != "" && extractor.Reasoning() == "" {
-					delta.Reasoning = &reasoning
-				}
-				responses <- schema.OpenAIResponse{
-					ID: id, Created: created, Model: req.Model,
-					Choices: []schema.Choice{{Delta: delta, Index: 0}},
-					Object:  "chat.completion.chunk",
-					Usage:   usage,
-				}
-			} else {
-				// Content was NOT streamed — send everything at once (fallback).
-				responses <- schema.OpenAIResponse{
-					ID: id, Created: created, Model: req.Model,
-					Choices: []schema.Choice{{Delta: &schema.Message{Role: "assistant"}, Index: 0}},
-					Object:  "chat.completion.chunk",
-				}
-
-				result, err := handleQuestion(config, functionResults, extractor.CleanedContent(), prompt)
-				if err != nil {
-					xlog.Error("error handling question", "error", err)
-					return err
-				}
-
-				delta := &schema.Message{Content: &result}
-				if reasoning != "" {
-					delta.Reasoning = &reasoning
-				}
-				responses <- schema.OpenAIResponse{
-					ID: id, Created: created, Model: req.Model,
-					Choices: []schema.Choice{{Delta: delta, Index: 0}},
-					Object:  "chat.completion.chunk",
-					Usage:   usage,
+			var result string
+			if !sentInitialRole {
+				var hqErr error
+				result, hqErr = handleQuestion(config, functionResults, extractor.CleanedContent(), prompt)
+				if hqErr != nil {
+					xlog.Error("error handling question", "error", hqErr)
+					return hqErr
 				}
 			}
+			for _, chunk := range buildNoActionFinalChunks(
+				id, req.Model, created,
+				sentInitialRole, sentReasoning,
+				result, reasoning, usage,
+			) {
+				responses <- chunk
+			}

 		default:
-			for i, ss := range functionResults {
-				name, args := ss.Name, ss.Arguments
-				toolCallID := ss.ID
-				if toolCallID == "" {
-					toolCallID = id
-				}
-
-				if i < lastEmittedCount {
-					// Already emitted during streaming by the incremental
-					// JSON/XML parser — skip to avoid duplicate tool calls.
-					continue
-				}
-
-				// Tool call not yet emitted — send name + args (two chunks).
-				initialMessage := schema.OpenAIResponse{
-					ID:      id,
-					Created: created,
-					Model:   req.Model,
-					Choices: []schema.Choice{{
-						Delta: &schema.Message{
-							Role: "assistant",
-							ToolCalls: []schema.ToolCall{
-								{
-									Index: i,
-									ID:    toolCallID,
-									Type:  "function",
-									FunctionCall: schema.FunctionCall{
-										Name: name,
-									},
-								},
-							},
-						},
-						Index:        0,
-						FinishReason: nil,
-					}},
-					Object: "chat.completion.chunk",
-				}
-				responses <- initialMessage
-
-				responses <- schema.OpenAIResponse{
-					ID:      id,
-					Created: created,
-					Model:   req.Model,
-					Choices: []schema.Choice{{
-						Delta: &schema.Message{
-							Role:    "assistant",
-							Content: textContentToReturn,
-							ToolCalls: []schema.ToolCall{
-								{
-									Index: i,
-									ID:    toolCallID,
-									Type:  "function",
-									FunctionCall: schema.FunctionCall{
-										Arguments: args,
-									},
-								},
-							},
-						},
-						Index:        0,
-						FinishReason: nil,
-					}},
-					Object: "chat.completion.chunk",
-				}
+			for _, chunk := range buildDeferredToolCallChunks(
+				id, req.Model, created,
+				functionResults, lastEmittedCount,
+				sentInitialRole, *textContentToReturn,
+				sentReasoning, reasoning,
+			) {
+				responses <- chunk
 			}
 		}

--- a/core/http/endpoints/openai/chat_emit.go
+++ b/core/http/endpoints/openai/chat_emit.go
@@ -0,0 +1,233 @@
+package openai
+
+import (
+	"fmt"
+
+	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/pkg/functions"
+)
+
+// hasRealCall reports whether functionResults contains at least one
+// entry whose Name is something other than the noAction sentinel.
+// Used by processTools to decide between the "answer the question"
+// path and the real tool-call flush.
+func hasRealCall(functionResults []functions.FuncCallResults, noAction string) bool {
+	for _, fc := range functionResults {
+		if fc.Name != noAction {
+			return true
+		}
+	}
+	return false
+}
+
+// buildNoActionFinalChunks produces the closing SSE chunks for the
+// noActionToRun branch of processTools (i.e. the model chose the "answer"
+// pseudo-function or emitted no tool calls at all).
+//
+// When content was already streamed (contentAlreadyStreamed=true) the
+// helper emits a single trailing usage chunk, optionally carrying
+// reasoning that was produced but not streamed incrementally. When
+// content was not streamed it emits a role chunk followed by a
+// content+reasoning+usage chunk — the "send everything at once" fallback.
+//
+// Reasoning re-emission is guarded by reasoningAlreadyStreamed, not by
+// probing the extractor's Go-side state: the C++ autoparser delivers
+// reasoning through ProcessChatDeltaReasoning which populates a
+// separate accumulator that extractor.Reasoning() does not expose.
+// Without this guard the callback would stream reasoning incrementally
+// and the final chunk would duplicate it.
+func buildNoActionFinalChunks(
+	id, model string,
+	created int,
+	contentAlreadyStreamed bool,
+	reasoningAlreadyStreamed bool,
+	content string,
+	reasoning string,
+	usage schema.OpenAIUsage,
+) []schema.OpenAIResponse {
+	var out []schema.OpenAIResponse
+
+	if contentAlreadyStreamed {
+		delta := &schema.Message{}
+		if reasoning != "" && !reasoningAlreadyStreamed {
+			r := reasoning
+			delta.Reasoning = &r
+		}
+		out = append(out, schema.OpenAIResponse{
+			ID: id, Created: created, Model: model,
+			Choices: []schema.Choice{{Delta: delta, Index: 0}},
+			Object:  "chat.completion.chunk",
+			Usage:   usage,
+		})
+		return out
+	}
+
+	// Content was not streamed — send role, then content (+reasoning) + usage.
+	out = append(out, schema.OpenAIResponse{
+		ID: id, Created: created, Model: model,
+		Choices: []schema.Choice{{
+			Delta: &schema.Message{Role: "assistant"},
+			Index: 0,
+		}},
+		Object: "chat.completion.chunk",
+	})
+
+	c := content
+	delta := &schema.Message{Content: &c}
+	if reasoning != "" && !reasoningAlreadyStreamed {
+		r := reasoning
+		delta.Reasoning = &r
+	}
+	out = append(out, schema.OpenAIResponse{
+		ID: id, Created: created, Model: model,
+		Choices: []schema.Choice{{Delta: delta, Index: 0}},
+		Object:  "chat.completion.chunk",
+		Usage:   usage,
+	})
+	return out
+}
+
+// buildDeferredToolCallChunks produces the SSE chunks for tool calls that
+// were discovered only during final parsing (i.e. after the streaming
+// callback finished). The caller forwards every returned chunk to the
+// responses channel.
+//
+// Guarantees:
+//   - tool calls with i < lastEmittedCount are skipped (already streamed)
+//   - each emitted call yields two chunks: name-only, then args-only
+//   - no chunk ever carries both non-empty Content and non-empty ToolCalls
+//   - no chunk ever carries both non-empty Reasoning and non-empty ToolCalls
+//   - if !reasoningAlreadyStreamed && reasoningContent != "",
+//     a reasoning chunk is emitted first
+//   - if !contentAlreadyStreamed && textContent != "",
+//     a role chunk followed by a content chunk is emitted (after reasoning)
+//   - chunks order: [reasoning?] [role+content?] (name, args)+
+//   - fallback IDs for empty ss.ID are unique per index so a client can
+//     match tool_result messages back to the right call
+func buildDeferredToolCallChunks(
+	id, model string,
+	created int,
+	functionResults []functions.FuncCallResults,
+	lastEmittedCount int,
+	contentAlreadyStreamed bool,
+	textContent string,
+	reasoningAlreadyStreamed bool,
+	reasoningContent string,
+) []schema.OpenAIResponse {
+	// If every call was already emitted incrementally there's nothing to
+	// flush — and no reason to emit a standalone reasoning/content chunk.
+	hasDeferred := false
+	for i := range functionResults {
+		if i >= lastEmittedCount {
+			hasDeferred = true
+			break
+		}
+	}
+	if !hasDeferred {
+		return nil
+	}
+
+	var out []schema.OpenAIResponse
+
+	// Reasoning first — the callback path at processTools emits reasoning
+	// incrementally in its own chunks, but when the C++ autoparser only
+	// surfaces reasoning as a final aggregate the callback never sees it.
+	// Recover it here (no duplication: contentAlreadyStreamed and
+	// reasoningAlreadyStreamed track what the callback already sent).
+	if !reasoningAlreadyStreamed && reasoningContent != "" {
+		r := reasoningContent
+		out = append(out, schema.OpenAIResponse{
+			ID: id, Created: created, Model: model,
+			Choices: []schema.Choice{{
+				Delta: &schema.Message{Reasoning: &r},
+				Index: 0,
+			}},
+			Object: "chat.completion.chunk",
+		})
+	}
+
+	// Then content, when it wasn't streamed via the callback. Emit role
+	// and content in separate deltas — the OpenAI streaming contract
+	// forbids bundling content alongside tool_calls in one delta.
+	if !contentAlreadyStreamed && textContent != "" {
+		out = append(out, schema.OpenAIResponse{
+			ID: id, Created: created, Model: model,
+			Choices: []schema.Choice{{
+				Delta: &schema.Message{Role: "assistant"},
+				Index: 0,
+			}},
+			Object: "chat.completion.chunk",
+		})
+		c := textContent
+		out = append(out, schema.OpenAIResponse{
+			ID: id, Created: created, Model: model,
+			Choices: []schema.Choice{{
+				Delta: &schema.Message{Content: &c},
+				Index: 0,
+			}},
+			Object: "chat.completion.chunk",
+		})
+	}
+
+	for i, ss := range functionResults {
+		if i < lastEmittedCount {
+			// Already streamed by the incremental JSON/XML parser during
+			// the token callback — skip to avoid a duplicate emission.
+			continue
+		}
+
+		toolCallID := ss.ID
+		if toolCallID == "" {
+			// Unique per-index fallback so multiple empty-ID calls don't
+			// collide on the same request ID (clients match tool results
+			// back by tool_call_id).
+			toolCallID = fmt.Sprintf("%s-%d", id, i)
+		}
+
+		// Name chunk.
+		out = append(out, schema.OpenAIResponse{
+			ID: id, Created: created, Model: model,
+			Choices: []schema.Choice{{
+				Delta: &schema.Message{
+					Role: "assistant",
+					ToolCalls: []schema.ToolCall{{
+						Index: i,
+						ID:    toolCallID,
+						Type:  "function",
+						FunctionCall: schema.FunctionCall{
+							Name: ss.Name,
+						},
+					}},
+				},
+				Index:        0,
+				FinishReason: nil,
+			}},
+			Object: "chat.completion.chunk",
+		})
+
+		// Args chunk — no Content here. Either it was streamed through
+		// the token callback earlier, or the role+content pair above
+		// already delivered it.
+		out = append(out, schema.OpenAIResponse{
+			ID: id, Created: created, Model: model,
+			Choices: []schema.Choice{{
+				Delta: &schema.Message{
+					Role: "assistant",
+					ToolCalls: []schema.ToolCall{{
+						Index: i,
+						ID:    toolCallID,
+						Type:  "function",
+						FunctionCall: schema.FunctionCall{
+							Arguments: ss.Arguments,
+						},
+					}},
+				},
+				Index:        0,
+				FinishReason: nil,
+			}},
+			Object: "chat.completion.chunk",
+		})
+	}
+
+	return out
+}
--- a/core/http/endpoints/openai/chat_emit_test.go
+++ b/core/http/endpoints/openai/chat_emit_test.go
@@ -0,0 +1,717 @@
+package openai
+
+import (
+	"fmt"
+
+	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/pkg/functions"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+// contentOf extracts the string payload from a chunk's delta.Content,
+// transparently handling both *string and string underlying types so
+// assertions don't have to care which one the helper produced.
+func contentOf(ch schema.OpenAIResponse) string {
+	if len(ch.Choices) == 0 || ch.Choices[0].Delta == nil {
+		return ""
+	}
+	switch v := ch.Choices[0].Delta.Content.(type) {
+	case *string:
+		if v == nil {
+			return ""
+		}
+		return *v
+	case string:
+		return v
+	default:
+		return ""
+	}
+}
+
+// reasoningOf mirrors contentOf for the delta.Reasoning field, which is a
+// *string on schema.Message.
+func reasoningOf(ch schema.OpenAIResponse) string {
+	if len(ch.Choices) == 0 || ch.Choices[0].Delta == nil {
+		return ""
+	}
+	r := ch.Choices[0].Delta.Reasoning
+	if r == nil {
+		return ""
+	}
+	return *r
+}
+
+// toolCallsOf returns the ToolCalls slice of a chunk's delta, or nil.
+func toolCallsOf(ch schema.OpenAIResponse) []schema.ToolCall {
+	if len(ch.Choices) == 0 || ch.Choices[0].Delta == nil {
+		return nil
+	}
+	return ch.Choices[0].Delta.ToolCalls
+}
+
+// expectSpecCompliant enforces the invariants on every chunk:
+//   - Object == "chat.completion.chunk"
+//   - Exactly one Choice with Index==0
+//   - No delta ever carries both non-empty Content and non-empty ToolCalls
+//   - No delta ever carries both non-empty Reasoning and non-empty ToolCalls
+func expectSpecCompliant(chunks []schema.OpenAIResponse) {
+	for i, ch := range chunks {
+		Expect(ch.Object).To(Equal("chat.completion.chunk"), "chunk[%d] Object", i)
+		Expect(ch.Choices).To(HaveLen(1), "chunk[%d] Choices length", i)
+		Expect(ch.Choices[0].Index).To(Equal(0), "chunk[%d] Choices[0].Index", i)
+
+		hasContent := contentOf(ch) != ""
+		hasReasoning := reasoningOf(ch) != ""
+		hasToolCalls := len(toolCallsOf(ch)) > 0
+
+		if hasContent && hasToolCalls {
+			Fail(fmt.Sprintf("chunk[%d] violates spec: Content and ToolCalls in same delta", i))
+		}
+		if hasReasoning && hasToolCalls {
+			Fail(fmt.Sprintf("chunk[%d] violates spec: Reasoning and ToolCalls in same delta", i))
+		}
+	}
+}
+
+// expectMetadata asserts every chunk carries the same id/model/created.
+func expectMetadata(chunks []schema.OpenAIResponse, id, model string, created int) {
+	for i, ch := range chunks {
+		Expect(ch.ID).To(Equal(id), "chunk[%d] ID", i)
+		Expect(ch.Model).To(Equal(model), "chunk[%d] Model", i)
+		Expect(ch.Created).To(Equal(created), "chunk[%d] Created", i)
+	}
+}
+
+var _ = Describe("buildDeferredToolCallChunks", func() {
+	const (
+		testID      = "req"
+		testModel   = "test-model"
+		testCreated = 1700000000
+	)
+
+	Describe("Case A — primary bug: content already streamed, 1 deferred call", func() {
+		It("emits only the tool_call chunks, no Content anywhere", func() {
+			results := []functions.FuncCallResults{
+				{Name: "search", Arguments: `{"q":"x"}`, ID: "tc1"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				true, "Let me search…",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(2), "two chunks: name, args")
+
+			// Name chunk
+			tc0 := toolCallsOf(chunks[0])
+			Expect(tc0).To(HaveLen(1))
+			Expect(tc0[0].Index).To(Equal(0))
+			Expect(tc0[0].ID).To(Equal("tc1"))
+			Expect(tc0[0].FunctionCall.Name).To(Equal("search"))
+			Expect(tc0[0].FunctionCall.Arguments).To(BeEmpty())
+			Expect(contentOf(chunks[0])).To(BeEmpty())
+
+			// Args chunk — MUST NOT carry Content
+			tc1 := toolCallsOf(chunks[1])
+			Expect(tc1).To(HaveLen(1))
+			Expect(tc1[0].FunctionCall.Name).To(BeEmpty())
+			Expect(tc1[0].FunctionCall.Arguments).To(Equal(`{"q":"x"}`))
+			Expect(contentOf(chunks[1])).To(BeEmpty(),
+				"args chunk must not duplicate already-streamed content")
+		})
+	})
+
+	Describe("Case B — autoparser / content not streamed", func() {
+		It("emits role, content, then name+args", func() {
+			results := []functions.FuncCallResults{
+				{Name: "do", Arguments: "{}", ID: "tc1"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				false, "Here is my plan…",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(4), "role, content, name, args")
+
+			// Role chunk
+			Expect(chunks[0].Choices[0].Delta.Role).To(Equal("assistant"))
+			Expect(contentOf(chunks[0])).To(BeEmpty())
+			Expect(toolCallsOf(chunks[0])).To(BeEmpty())
+
+			// Content chunk
+			Expect(contentOf(chunks[1])).To(Equal("Here is my plan…"))
+			Expect(toolCallsOf(chunks[1])).To(BeEmpty())
+
+			// Name + args chunks
+			Expect(toolCallsOf(chunks[2])).To(HaveLen(1))
+			Expect(toolCallsOf(chunks[2])[0].FunctionCall.Name).To(Equal("do"))
+			Expect(toolCallsOf(chunks[3])).To(HaveLen(1))
+			Expect(toolCallsOf(chunks[3])[0].FunctionCall.Arguments).To(Equal("{}"))
+		})
+	})
+
+	Describe("Case C — multiple deferred calls, content already streamed", func() {
+		It("emits (name, args) × 3 with no Content anywhere", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tcA"},
+				{Name: "b", Arguments: "{}", ID: "tcB"},
+				{Name: "c", Arguments: "{}", ID: "tcC"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				true, "some narration",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(6))
+
+			for i := 0; i < 3; i++ {
+				Expect(contentOf(chunks[2*i])).To(BeEmpty(),
+					"call #%d name chunk must not carry Content", i)
+				Expect(contentOf(chunks[2*i+1])).To(BeEmpty(),
+					"call #%d args chunk must not carry Content", i)
+				Expect(toolCallsOf(chunks[2*i])[0].Index).To(Equal(i))
+				Expect(toolCallsOf(chunks[2*i+1])[0].Index).To(Equal(i))
+			}
+			Expect(toolCallsOf(chunks[0])[0].FunctionCall.Name).To(Equal("a"))
+			Expect(toolCallsOf(chunks[2])[0].FunctionCall.Name).To(Equal("b"))
+			Expect(toolCallsOf(chunks[4])[0].FunctionCall.Name).To(Equal("c"))
+		})
+	})
+
+	Describe("Case D — partial incremental emission", func() {
+		It("emits only the deferred tail (call #1), skipping #0", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tc0"},
+				{Name: "b", Arguments: "{}", ID: "tc1"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 1,
+				true, "narration",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(2))
+			Expect(toolCallsOf(chunks[0])[0].Index).To(Equal(1))
+			Expect(toolCallsOf(chunks[0])[0].FunctionCall.Name).To(Equal("b"))
+			Expect(toolCallsOf(chunks[1])[0].Index).To(Equal(1))
+			Expect(toolCallsOf(chunks[1])[0].FunctionCall.Arguments).To(Equal("{}"))
+		})
+	})
+
+	Describe("Case E — all calls already emitted incrementally", func() {
+		It("emits nothing", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tc0"},
+				{Name: "b", Arguments: "{}", ID: "tc1"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 2,
+				true, "narration",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(BeEmpty())
+		})
+	})
+
+	Describe("Case F — content not streamed but textContent empty", func() {
+		It("emits only the tool call chunks, no leading role/content", func() {
+			results := []functions.FuncCallResults{
+				{Name: "x", Arguments: "{}", ID: "tcX"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				false, "",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(2))
+			Expect(toolCallsOf(chunks[0])[0].FunctionCall.Name).To(Equal("x"))
+			Expect(toolCallsOf(chunks[1])[0].FunctionCall.Arguments).To(Equal("{}"))
+		})
+	})
+
+	Describe("Case G — empty ss.ID falls back to a unique per-index ID", func() {
+		It("emits a deterministic per-index fallback", func() {
+			results := []functions.FuncCallResults{
+				{Name: "x", Arguments: "{}", ID: ""},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				true, "narration",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(2))
+			expectedID := fmt.Sprintf("%s-%d", testID, 0)
+			Expect(toolCallsOf(chunks[0])[0].ID).To(Equal(expectedID))
+			Expect(toolCallsOf(chunks[1])[0].ID).To(Equal(expectedID))
+		})
+	})
+
+	Describe("Case G2 — multiple empty IDs get distinct fallbacks", func() {
+		It("avoids the collision bug where every empty-ID call shared the request id", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: ""},
+				{Name: "b", Arguments: "{}", ID: ""},
+				{Name: "c", Arguments: "{}", ID: ""},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				true, "narration",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(6))
+
+			ids := map[string]int{}
+			for _, ch := range chunks {
+				for _, tc := range toolCallsOf(ch) {
+					ids[tc.ID]++
+				}
+			}
+			// Each call yields a name chunk + args chunk → each distinct ID
+			// should appear in exactly two chunks. Three distinct IDs
+			// overall.
+			Expect(ids).To(HaveLen(3), "three distinct per-index fallback IDs")
+			for id, n := range ids {
+				Expect(n).To(Equal(2), "ID %q should appear in exactly 2 chunks", id)
+			}
+		})
+	})
+
+	Describe("Case H — indices preserved across skip with multiple calls", func() {
+		It("emits Index fields matching functionResults positions", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tc0"},
+				{Name: "b", Arguments: "{}", ID: "tc1"},
+				{Name: "c", Arguments: "{}", ID: "tc2"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 1,
+				true, "narration",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(4))
+
+			Expect(toolCallsOf(chunks[0])[0].Index).To(Equal(1))
+			Expect(toolCallsOf(chunks[1])[0].Index).To(Equal(1))
+			Expect(toolCallsOf(chunks[2])[0].Index).To(Equal(2))
+			Expect(toolCallsOf(chunks[3])[0].Index).To(Equal(2))
+		})
+	})
+
+	Describe("Case I — explicit non-empty ID is preserved", func() {
+		It("does not touch ss.ID when it's already set", func() {
+			results := []functions.FuncCallResults{
+				{Name: "x", Arguments: "{}", ID: "abc123"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				true, "narration",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(2))
+			Expect(toolCallsOf(chunks[0])[0].ID).To(Equal("abc123"))
+			Expect(toolCallsOf(chunks[1])[0].ID).To(Equal("abc123"))
+		})
+	})
+
+	Describe("Case J — chunk-shape sanity", func() {
+		It("splits Name into the first chunk and Arguments into the second", func() {
+			results := []functions.FuncCallResults{
+				{Name: "x", Arguments: `{"k":"v"}`, ID: "tcX"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				true, "narration",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(2))
+
+			Expect(toolCallsOf(chunks[0])[0].FunctionCall.Name).To(Equal("x"))
+			Expect(toolCallsOf(chunks[0])[0].FunctionCall.Arguments).To(BeEmpty())
+
+			Expect(toolCallsOf(chunks[1])[0].FunctionCall.Name).To(BeEmpty())
+			Expect(toolCallsOf(chunks[1])[0].FunctionCall.Arguments).To(Equal(`{"k":"v"}`))
+		})
+	})
+
+	Describe("Case K — metadata propagation", func() {
+		It("stamps every chunk with the same id/model/created", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tcA"},
+				{Name: "b", Arguments: "{}", ID: "tcB"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				false, "hello",
+				true, "",
+			)
+
+			expectSpecCompliant(chunks)
+			expectMetadata(chunks, testID, testModel, testCreated)
+		})
+	})
+
+	Describe("Case L — Choices[0].Index == 0 invariant", func() {
+		It("is upheld across every branch the helper can take", func() {
+			scenarios := []struct {
+				name                  string
+				functionResults       []functions.FuncCallResults
+				lastEmittedCount      int
+				contentStreamed       bool
+				text                  string
+				reasoningStreamed     bool
+				reasoning             string
+			}{
+				{"streamed-content-deferred-call",
+					[]functions.FuncCallResults{{Name: "a", Arguments: "{}"}},
+					0, true, "hi", true, ""},
+				{"unstreamed-content-deferred-call",
+					[]functions.FuncCallResults{{Name: "a", Arguments: "{}"}},
+					0, false, "hello", true, ""},
+				{"unstreamed-reasoning-and-content",
+					[]functions.FuncCallResults{{Name: "a", Arguments: "{}"}},
+					0, false, "hello", false, "thinking…"},
+				{"partial-incremental",
+					[]functions.FuncCallResults{
+						{Name: "a", Arguments: "{}"},
+						{Name: "b", Arguments: "{}"}},
+					1, true, "hi", true, ""},
+			}
+			for _, sc := range scenarios {
+				chunks := buildDeferredToolCallChunks(
+					testID, testModel, testCreated,
+					sc.functionResults, sc.lastEmittedCount,
+					sc.contentStreamed, sc.text,
+					sc.reasoningStreamed, sc.reasoning,
+				)
+				for i, ch := range chunks {
+					Expect(ch.Choices[0].Index).To(Equal(0),
+						"scenario %q chunk[%d] Choices[0].Index", sc.name, i)
+				}
+			}
+		})
+	})
+
+	Describe("Case M — spec compliance across every scenario", func() {
+		It("never mixes Content or Reasoning with ToolCalls in a single delta", func() {
+			scenarios := []struct {
+				name                  string
+				functionResults       []functions.FuncCallResults
+				lastEmittedCount      int
+				contentStreamed       bool
+				text                  string
+				reasoningStreamed     bool
+				reasoning             string
+			}{
+				{"A", []functions.FuncCallResults{{Name: "a", Arguments: "{}", ID: "tc"}},
+					0, true, "already-streamed", true, ""},
+				{"C", []functions.FuncCallResults{
+					{Name: "a", Arguments: "{}", ID: "tc0"},
+					{Name: "b", Arguments: "{}", ID: "tc1"}},
+					0, true, "already-streamed", true, ""},
+				{"B", []functions.FuncCallResults{{Name: "a", Arguments: "{}", ID: "tc"}},
+					0, false, "plan", true, ""},
+				{"Reasoning-deferred", []functions.FuncCallResults{{Name: "a", Arguments: "{}", ID: "tc"}},
+					0, false, "plan", false, "thinking…"},
+			}
+			for _, sc := range scenarios {
+				chunks := buildDeferredToolCallChunks(
+					testID, testModel, testCreated,
+					sc.functionResults, sc.lastEmittedCount,
+					sc.contentStreamed, sc.text,
+					sc.reasoningStreamed, sc.reasoning,
+				)
+				for i, ch := range chunks {
+					hasContent := contentOf(ch) != ""
+					hasReasoning := reasoningOf(ch) != ""
+					hasToolCalls := len(toolCallsOf(ch)) > 0
+					Expect(hasContent && hasToolCalls).To(BeFalse(),
+						"scenario %q chunk[%d] mixes Content with ToolCalls", sc.name, i)
+					Expect(hasReasoning && hasToolCalls).To(BeFalse(),
+						"scenario %q chunk[%d] mixes Reasoning with ToolCalls", sc.name, i)
+				}
+			}
+		})
+	})
+
+	Describe("Case N — empty functionResults", func() {
+		It("emits nothing, including no leading role/content/reasoning", func() {
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				nil, 0,
+				false, "ignored",
+				false, "ignored",
+			)
+			Expect(chunks).To(BeEmpty())
+		})
+	})
+
+	Describe("Case O — content not streamed but all calls already emitted", func() {
+		It("emits nothing, not even a standalone content chunk", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tc0"},
+				{Name: "b", Arguments: "{}", ID: "tc1"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 2,
+				false, "narration",
+				false, "thinking…",
+			)
+			Expect(chunks).To(BeEmpty(),
+				"no tool_calls to trigger on, so no leading role/content/reasoning either")
+		})
+	})
+
+	Describe("Reasoning — autoparser delivered reasoning only at end", func() {
+		It("emits a leading reasoning chunk when !reasoningAlreadyStreamed", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tc"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				true, "streamed content",
+				false, "model's private thoughts",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(3), "reasoning, name, args")
+
+			Expect(reasoningOf(chunks[0])).To(Equal("model's private thoughts"))
+			Expect(contentOf(chunks[0])).To(BeEmpty())
+			Expect(toolCallsOf(chunks[0])).To(BeEmpty())
+
+			// The following two are the tool_call name + args chunks.
+			Expect(toolCallsOf(chunks[1])[0].FunctionCall.Name).To(Equal("a"))
+			Expect(toolCallsOf(chunks[2])[0].FunctionCall.Arguments).To(Equal("{}"))
+		})
+
+		It("emits reasoning before role+content when neither was streamed", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tc"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				false, "final plan",
+				false, "private thoughts",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(5), "reasoning, role, content, name, args")
+
+			Expect(reasoningOf(chunks[0])).To(Equal("private thoughts"))
+			Expect(chunks[1].Choices[0].Delta.Role).To(Equal("assistant"))
+			Expect(contentOf(chunks[2])).To(Equal("final plan"))
+			Expect(toolCallsOf(chunks[3])[0].FunctionCall.Name).To(Equal("a"))
+			Expect(toolCallsOf(chunks[4])[0].FunctionCall.Arguments).To(Equal("{}"))
+		})
+
+		It("does not re-emit reasoning that was already streamed", func() {
+			results := []functions.FuncCallResults{
+				{Name: "a", Arguments: "{}", ID: "tc"},
+			}
+			chunks := buildDeferredToolCallChunks(
+				testID, testModel, testCreated,
+				results, 0,
+				true, "streamed",
+				true, "already-sent reasoning",
+			)
+
+			expectSpecCompliant(chunks)
+			Expect(chunks).To(HaveLen(2), "only name + args; no reasoning re-emission")
+			for _, ch := range chunks {
+				Expect(reasoningOf(ch)).To(BeEmpty())
+			}
+		})
+	})
+})
+
+var _ = Describe("hasRealCall", func() {
+	const noAction = "answer"
+
+	It("returns false for nil and empty slices", func() {
+		Expect(hasRealCall(nil, noAction)).To(BeFalse())
+		Expect(hasRealCall([]functions.FuncCallResults{}, noAction)).To(BeFalse())
+	})
+
+	It("returns false when every entry is the noAction sentinel", func() {
+		results := []functions.FuncCallResults{
+			{Name: noAction, Arguments: `{"message":"hi"}`},
+			{Name: noAction, Arguments: `{"message":"hello"}`},
+		}
+		Expect(hasRealCall(results, noAction)).To(BeFalse())
+	})
+
+	It("returns true when only one entry is a real call", func() {
+		results := []functions.FuncCallResults{
+			{Name: "search", Arguments: "{}"},
+		}
+		Expect(hasRealCall(results, noAction)).To(BeTrue())
+	})
+
+	It("returns true when a real call follows a noAction entry", func() {
+		// This is the regression the follow-up fixes: the old
+		// functionResults[0].Name == noAction check would declare this
+		// noActionToRun and drop the real call entirely.
+		results := []functions.FuncCallResults{
+			{Name: noAction, Arguments: `{"message":"hi"}`},
+			{Name: "search", Arguments: "{}"},
+		}
+		Expect(hasRealCall(results, noAction)).To(BeTrue())
+	})
+
+	It("returns true when a real call precedes a noAction entry", func() {
+		results := []functions.FuncCallResults{
+			{Name: "search", Arguments: "{}"},
+			{Name: noAction, Arguments: `{"message":"hi"}`},
+		}
+		Expect(hasRealCall(results, noAction)).To(BeTrue())
+	})
+})
+
+var _ = Describe("buildNoActionFinalChunks", func() {
+	const (
+		testID      = "req"
+		testModel   = "test-model"
+		testCreated = 1700000000
+	)
+	usage := schema.OpenAIUsage{PromptTokens: 5, CompletionTokens: 7, TotalTokens: 12}
+
+	Describe("Content streamed — trailing usage chunk", func() {
+		It("emits just one chunk with usage, no content, no reasoning when reasoning was streamed", func() {
+			chunks := buildNoActionFinalChunks(
+				testID, testModel, testCreated,
+				true, true,
+				"", "already-streamed-reasoning", usage,
+			)
+
+			Expect(chunks).To(HaveLen(1))
+			Expect(chunks[0].Usage.TotalTokens).To(Equal(12))
+			Expect(contentOf(chunks[0])).To(BeEmpty())
+			Expect(reasoningOf(chunks[0])).To(BeEmpty(),
+				"reasoning must not be re-emitted once it was streamed via the callback")
+		})
+
+		It("emits a trailing reasoning delivery when reasoning came only at end", func() {
+			chunks := buildNoActionFinalChunks(
+				testID, testModel, testCreated,
+				true, false,
+				"", "autoparser final reasoning", usage,
+			)
+
+			Expect(chunks).To(HaveLen(1))
+			Expect(reasoningOf(chunks[0])).To(Equal("autoparser final reasoning"))
+			Expect(contentOf(chunks[0])).To(BeEmpty())
+			Expect(chunks[0].Usage.TotalTokens).To(Equal(12))
+		})
+
+		It("omits reasoning when it's empty regardless of streamed flag", func() {
+			chunks := buildNoActionFinalChunks(
+				testID, testModel, testCreated,
+				true, false,
+				"", "", usage,
+			)
+
+			Expect(chunks).To(HaveLen(1))
+			Expect(reasoningOf(chunks[0])).To(BeEmpty())
+		})
+	})
+
+	Describe("Content not streamed — role, then content+usage", func() {
+		It("emits role chunk then content chunk without reasoning when reasoning was streamed", func() {
+			chunks := buildNoActionFinalChunks(
+				testID, testModel, testCreated,
+				false, true,
+				"the answer", "already-streamed-reasoning", usage,
+			)
+
+			Expect(chunks).To(HaveLen(2))
+			Expect(chunks[0].Choices[0].Delta.Role).To(Equal("assistant"))
+			Expect(contentOf(chunks[0])).To(BeEmpty())
+
+			Expect(contentOf(chunks[1])).To(Equal("the answer"))
+			Expect(reasoningOf(chunks[1])).To(BeEmpty(),
+				"reasoning must not be re-emitted if it was streamed earlier")
+			Expect(chunks[1].Usage.TotalTokens).To(Equal(12))
+		})
+
+		It("emits role, then content+reasoning when reasoning was not streamed", func() {
+			chunks := buildNoActionFinalChunks(
+				testID, testModel, testCreated,
+				false, false,
+				"the answer", "autoparser final reasoning", usage,
+			)
+
+			Expect(chunks).To(HaveLen(2))
+			Expect(chunks[0].Choices[0].Delta.Role).To(Equal("assistant"))
+
+			Expect(contentOf(chunks[1])).To(Equal("the answer"))
+			Expect(reasoningOf(chunks[1])).To(Equal("autoparser final reasoning"))
+			Expect(chunks[1].Usage.TotalTokens).To(Equal(12))
+		})
+
+		It("still emits content even when reasoning is empty", func() {
+			chunks := buildNoActionFinalChunks(
+				testID, testModel, testCreated,
+				false, false,
+				"just an answer", "", usage,
+			)
+
+			Expect(chunks).To(HaveLen(2))
+			Expect(contentOf(chunks[1])).To(Equal("just an answer"))
+			Expect(reasoningOf(chunks[1])).To(BeEmpty())
+		})
+	})
+
+	Describe("Metadata and shape invariants", func() {
+		It("stamps every chunk with the same id/model/created and object", func() {
+			chunks := buildNoActionFinalChunks(
+				testID, testModel, testCreated,
+				false, false,
+				"hi", "reasoning", usage,
+			)
+			for i, ch := range chunks {
+				Expect(ch.ID).To(Equal(testID), "chunk[%d] ID", i)
+				Expect(ch.Model).To(Equal(testModel), "chunk[%d] Model", i)
+				Expect(ch.Created).To(Equal(testCreated), "chunk[%d] Created", i)
+				Expect(ch.Object).To(Equal("chat.completion.chunk"), "chunk[%d] Object", i)
+				Expect(ch.Choices).To(HaveLen(1))
+				Expect(ch.Choices[0].Index).To(Equal(0))
+			}
+		})
+	})
+})
--- a/core/http/middleware/trace.go
+++ b/core/http/middleware/trace.go
@@ -3,6 +3,7 @@ package middleware
 import (
 	"bytes"
 	"io"
+	"mime"
 	"net/http"
 	"slices"
 	"sync"
@@ -94,7 +95,8 @@ func TraceMiddleware(app *application.Application) echo.MiddlewareFunc {

 			initializeTracing(app.ApplicationConfig().TracingMaxItems)

-			if c.Request().Header.Get("Content-Type") != "application/json" {
+			ct, _, _ := mime.ParseMediaType(c.Request().Header.Get("Content-Type"))
+			if ct != "application/json" {
 				return next(c)
 			}

--- a/core/http/routes/ui_api.go
+++ b/core/http/routes/ui_api.go
@@ -23,7 +23,6 @@ import (
 	"github.com/mudler/LocalAI/core/gallery"
 	"github.com/mudler/LocalAI/core/http/auth"
 	"github.com/mudler/LocalAI/core/http/endpoints/localai"
-	"github.com/mudler/LocalAI/core/http/middleware"
 	"github.com/mudler/LocalAI/core/p2p"
 	"github.com/mudler/LocalAI/core/services/galleryop"
 	"github.com/mudler/LocalAI/pkg/model"
@@ -1458,24 +1457,5 @@ func RegisterUIAPIRoutes(app *echo.Echo, cl *config.ModelConfigLoader, ml *model
 		app.POST("/api/settings", localai.UpdateSettingsEndpoint(applicationInstance), adminMiddleware)
 	}

-	// Logs API (admin only)
-	app.GET("/api/traces", func(c echo.Context) error {
-		if !appConfig.EnableTracing {
-			return c.JSON(503, map[string]any{
-				"error": "Tracing disabled",
-			})
-		}
-		traces := middleware.GetTraces()
-		return c.JSON(200, map[string]any{
-			"traces": traces,
-		})
-	}, adminMiddleware)
-
-	app.POST("/api/traces/clear", func(c echo.Context) error {
-		middleware.ClearTraces()
-		return c.JSON(200, map[string]any{
-			"message": "Traces cleared",
-		})
-	}, adminMiddleware)
 }

--- a/core/services/messaging/subjects.go
+++ b/core/services/messaging/subjects.go
@@ -124,8 +124,13 @@ func SubjectNodeBackendInstall(nodeID string) string {
 // BackendInstallRequest is the payload for a backend.install NATS request.
 type BackendInstallRequest struct {
 	Backend          string `json:"backend"`
-	ModelID          string `json:"model_id,omitempty"` // unique model identifier — each model gets its own gRPC process
+	ModelID          string `json:"model_id,omitempty"`
 	BackendGalleries string `json:"backend_galleries,omitempty"`
+	// URI is set for external installs (OCI image, URL, or path). When non-empty
+	// the worker routes to InstallExternalBackend instead of the gallery lookup.
+	URI   string `json:"uri,omitempty"`
+	Name  string `json:"name,omitempty"`
+	Alias string `json:"alias,omitempty"`
 }

 // BackendInstallReply is the response from a backend.install NATS request.
--- a/core/services/nodes/managers_distributed.go
+++ b/core/services/nodes/managers_distributed.go
@@ -106,6 +106,13 @@ func (d *DistributedBackendManager) enqueueAndDrainBackendOp(ctx context.Context
 		if node.Status == StatusPending {
 			continue
 		}
+		// Backend lifecycle ops only make sense on backend-type workers.
+		// Agent workers don't subscribe to backend.install/delete/list, so
+		// enqueueing for them guarantees a forever-retrying row that the
+		// reconciler can never drain. Silently skip — they aren't consumers.
+		if node.NodeType != "" && node.NodeType != NodeTypeBackend {
+			continue
+		}
 		if err := d.registry.UpsertPendingBackendOp(ctx, node.ID, backend, op, galleriesJSON); err != nil {
 			xlog.Warn("Failed to enqueue backend op", "op", op, "node", node.Name, "backend", backend, "error", err)
 			result.Nodes = append(result.Nodes, NodeOpStatus{
@@ -286,7 +293,7 @@ func (d *DistributedBackendManager) InstallBackend(ctx context.Context, op *gall
 	backendName := op.GalleryElementName

 	_, err := d.enqueueAndDrainBackendOp(ctx, OpBackendInstall, backendName, galleriesJSON, func(node BackendNode) error {
-		reply, err := d.adapter.InstallBackend(node.ID, backendName, "", string(galleriesJSON))
+		reply, err := d.adapter.InstallBackend(node.ID, backendName, "", string(galleriesJSON), op.ExternalURI, op.ExternalName, op.ExternalAlias)
 		if err != nil {
 			return err
 		}
@@ -304,7 +311,7 @@ func (d *DistributedBackendManager) UpgradeBackend(ctx context.Context, name str
 	galleriesJSON, _ := json.Marshal(d.backendGalleries)

 	_, err := d.enqueueAndDrainBackendOp(ctx, OpBackendUpgrade, name, galleriesJSON, func(node BackendNode) error {
-		reply, err := d.adapter.InstallBackend(node.ID, name, "", string(galleriesJSON))
+		reply, err := d.adapter.InstallBackend(node.ID, name, "", string(galleriesJSON), "", "", "")
 		if err != nil {
 			return err
 		}
--- a/core/services/nodes/reconciler.go
+++ b/core/services/nodes/reconciler.go
@@ -3,12 +3,14 @@ package nodes
 import (
 	"context"
 	"encoding/json"
+	"errors"
 	"fmt"
 	"time"

 	"github.com/mudler/LocalAI/core/services/advisorylock"
 	grpcclient "github.com/mudler/LocalAI/pkg/grpc"
 	"github.com/mudler/xlog"
+	"github.com/nats-io/nats.go"
 	"gorm.io/gorm"
 )

@@ -186,7 +188,7 @@ func (rc *ReplicaReconciler) drainPendingBackendOps(ctx context.Context) {
 		case OpBackendDelete:
 			_, applyErr = rc.adapter.DeleteBackend(op.NodeID, op.Backend)
 		case OpBackendInstall, OpBackendUpgrade:
-			reply, err := rc.adapter.InstallBackend(op.NodeID, op.Backend, "", string(op.Galleries))
+			reply, err := rc.adapter.InstallBackend(op.NodeID, op.Backend, "", string(op.Galleries), "", "", "")
 			if err != nil {
 				applyErr = err
 			} else if !reply.Success {
@@ -206,12 +208,47 @@ func (rc *ReplicaReconciler) drainPendingBackendOps(ctx context.Context) {
 			}
 			continue
 		}
+
+		// ErrNoResponders means the node has no active NATS subscription for
+		// this subject. Either its connection dropped, or it's the wrong
+		// node type entirely. Mark unhealthy so the health monitor's
+		// heartbeat-only pass doesn't immediately flip it back — and so
+		// ListDuePendingBackendOps (which filters by status=healthy) stops
+		// picking the row until the node genuinely recovers.
+		if errors.Is(applyErr, nats.ErrNoResponders) {
+			xlog.Warn("Reconciler: no NATS responders — marking node unhealthy",
+				"op", op.Op, "backend", op.Backend, "node", op.NodeID)
+			_ = rc.registry.MarkUnhealthy(ctx, op.NodeID)
+		}
+
+		// Dead-letter cap: after maxAttempts the row is the reconciler
+		// equivalent of a poison message. Delete it loudly so the queue
+		// doesn't churn NATS every tick forever — operators can re-issue
+		// the op from the UI if they still want it applied.
+		if op.Attempts+1 >= maxPendingBackendOpAttempts {
+			xlog.Error("Reconciler: abandoning pending backend op after max attempts",
+				"op", op.Op, "backend", op.Backend, "node", op.NodeID,
+				"attempts", op.Attempts+1, "last_error", applyErr)
+			if err := rc.registry.DeletePendingBackendOp(ctx, op.ID); err != nil {
+				xlog.Warn("Reconciler: failed to delete abandoned op row", "id", op.ID, "error", err)
+			}
+			continue
+		}
+
 		_ = rc.registry.RecordPendingBackendOpFailure(ctx, op.ID, applyErr.Error())
 		xlog.Warn("Reconciler: pending backend op retry failed",
 			"op", op.Op, "backend", op.Backend, "node", op.NodeID, "attempts", op.Attempts+1, "error", applyErr)
 	}
 }

+// maxPendingBackendOpAttempts caps how many times the reconciler retries a
+// failing row before dead-lettering it. Ten attempts at exponential backoff
+// (30s → 15m cap) is >1h of wall-clock patience — well past any transient
+// worker restart or network blip. Poisoned rows beyond that are almost
+// certainly structural (wrong node type, non-existent gallery entry) and no
+// amount of further retrying will help.
+const maxPendingBackendOpAttempts = 10
+
 // probeLoadedModels gRPC-health-checks model addresses that the DB says are
 // loaded. If a model's backend process is gone (OOM, crash, manual restart)
 // we remove the row so ghosts don't linger. Only probes rows older than
--- a/core/services/nodes/reconciler_test.go
+++ b/core/services/nodes/reconciler_test.go
@@ -373,4 +373,30 @@ var _ = Describe("ReplicaReconciler — state reconciliation", func() {
 			Expect(row.NextRetryAt).To(BeTemporally(">", before))
 		})
 	})
+
+	Describe("NewNodeRegistry malformed-row pruning", func() {
+		It("drops queue rows for agent nodes and non-existent nodes on startup", func() {
+			agent := &BackendNode{Name: "agent-1", NodeType: NodeTypeAgent, Address: "x"}
+			Expect(registry.Register(context.Background(), agent, true)).To(Succeed())
+			backend := &BackendNode{Name: "backend-1", NodeType: NodeTypeBackend, Address: "y"}
+			Expect(registry.Register(context.Background(), backend, true)).To(Succeed())
+
+			// Three rows: one for a valid backend node (should survive),
+			// one for an agent node (pruned), one for an empty backend name
+			// on the valid node (pruned).
+			Expect(registry.UpsertPendingBackendOp(context.Background(), backend.ID, "foo", OpBackendInstall, nil)).To(Succeed())
+			Expect(registry.UpsertPendingBackendOp(context.Background(), agent.ID, "foo", OpBackendInstall, nil)).To(Succeed())
+			Expect(registry.UpsertPendingBackendOp(context.Background(), backend.ID, "", OpBackendInstall, nil)).To(Succeed())
+
+			// Re-instantiating the registry runs the cleanup migration.
+			_, err := NewNodeRegistry(db)
+			Expect(err).ToNot(HaveOccurred())
+
+			var rows []PendingBackendOp
+			Expect(db.Find(&rows).Error).To(Succeed())
+			Expect(rows).To(HaveLen(1))
+			Expect(rows[0].NodeID).To(Equal(backend.ID))
+			Expect(rows[0].Backend).To(Equal("foo"))
+		})
+	})
 })
--- a/core/services/nodes/registry.go
+++ b/core/services/nodes/registry.go
@@ -148,6 +148,30 @@ func NewNodeRegistry(db *gorm.DB) (*NodeRegistry, error) {
 	}); err != nil {
 		return nil, fmt.Errorf("migrating node tables: %w", err)
 	}
+
+	// One-shot cleanup of queue rows that can never drain: ops targeted at
+	// agent workers (wrong subscription set), at non-existent nodes, or with
+	// an empty backend name. The guard in enqueueAndDrainBackendOp prevents
+	// new ones from being written, but rows persisted by earlier versions
+	// keep the reconciler busy retrying a permanently-failing NATS request
+	// every 30s. Guarded by the same migration advisory lock so only one
+	// frontend runs it.
+	_ = advisorylock.WithLockCtx(context.Background(), db, advisorylock.KeySchemaMigrate, func() error {
+		res := db.Exec(`
+			DELETE FROM pending_backend_ops
+			WHERE backend = ''
+			   OR node_id NOT IN (SELECT id FROM backend_nodes WHERE node_type = ? OR node_type = '')
+		`, NodeTypeBackend)
+		if res.Error != nil {
+			xlog.Warn("Failed to prune malformed pending_backend_ops rows", "error", res.Error)
+			return res.Error
+		}
+		if res.RowsAffected > 0 {
+			xlog.Info("Pruned pending_backend_ops rows (wrong node type or empty backend)", "count", res.RowsAffected)
+		}
+		return nil
+	})
+
 	return &NodeRegistry{db: db}, nil
 }

--- a/core/services/nodes/router.go
+++ b/core/services/nodes/router.go
@@ -504,7 +504,7 @@ func (r *SmartRouter) installBackendOnNode(ctx context.Context, node *BackendNod
 		return "", fmt.Errorf("no NATS connection for backend installation")
 	}

-	reply, err := r.unloader.InstallBackend(node.ID, backendType, modelID, r.galleriesJSON)
+	reply, err := r.unloader.InstallBackend(node.ID, backendType, modelID, r.galleriesJSON, "", "", "")
 	if err != nil {
 		return "", err
 	}
--- a/core/services/nodes/router_test.go
+++ b/core/services/nodes/router_test.go
@@ -244,7 +244,7 @@ type fakeUnloader struct {
 	unloadErr    error
 }

-func (f *fakeUnloader) InstallBackend(_, _, _, _ string) (*messaging.BackendInstallReply, error) {
+func (f *fakeUnloader) InstallBackend(_, _, _, _, _, _, _ string) (*messaging.BackendInstallReply, error) {
 	return f.installReply, f.installErr
 }

--- a/core/services/nodes/unloader.go
+++ b/core/services/nodes/unloader.go
@@ -17,7 +17,7 @@ type backendStopRequest struct {
 // NodeCommandSender abstracts NATS-based commands to worker nodes.
 // Used by HTTP endpoint handlers to avoid coupling to the concrete RemoteUnloaderAdapter.
 type NodeCommandSender interface {
-	InstallBackend(nodeID, backendType, modelID, galleriesJSON string) (*messaging.BackendInstallReply, error)
+	InstallBackend(nodeID, backendType, modelID, galleriesJSON, uri, name, alias string) (*messaging.BackendInstallReply, error)
 	DeleteBackend(nodeID, backendName string) (*messaging.BackendDeleteReply, error)
 	ListBackends(nodeID string) (*messaging.BackendListReply, error)
 	StopBackend(nodeID, backend string) error
@@ -72,7 +72,7 @@ func (a *RemoteUnloaderAdapter) UnloadRemoteModel(modelName string) error {
 // The worker installs the backend from gallery (if not already installed),
 // starts the gRPC process, and replies when ready.
 // Timeout: 5 minutes (gallery install can take a while).
-func (a *RemoteUnloaderAdapter) InstallBackend(nodeID, backendType, modelID, galleriesJSON string) (*messaging.BackendInstallReply, error) {
+func (a *RemoteUnloaderAdapter) InstallBackend(nodeID, backendType, modelID, galleriesJSON, uri, name, alias string) (*messaging.BackendInstallReply, error) {
 	subject := messaging.SubjectNodeBackendInstall(nodeID)
 	xlog.Info("Sending NATS backend.install", "nodeID", nodeID, "backend", backendType, "modelID", modelID)

@@ -80,6 +80,9 @@ func (a *RemoteUnloaderAdapter) InstallBackend(nodeID, backendType, modelID, gal
 		Backend:          backendType,
 		ModelID:          modelID,
 		BackendGalleries: galleriesJSON,
+		URI:              uri,
+		Name:             name,
+		Alias:            alias,
 	}, 5*time.Minute)
 }

--- a/docs/content/features/backend-monitor.md
+++ b/docs/content/features/backend-monitor.md
@@ -14,11 +14,13 @@ LocalAI provides endpoints to monitor and manage running backends. The `/backend

 ### Request

-The request body is JSON:
+The model to monitor is passed as a query parameter:

-| Parameter | Type     | Required | Description                    |
-|-----------|----------|----------|--------------------------------|
-| `model`   | `string` | Yes      | Name of the model to monitor   |
+| Parameter | Type     | Required | Location | Description                    |
+|-----------|----------|----------|----------|--------------------------------|
+| `model`   | `string` | Yes      | query    | Name of the model to monitor   |
+
+For backwards compatibility, a JSON body with the same field is still accepted when the `model` query parameter is not set, but new clients should use the query parameter.

 ### Response

@@ -42,9 +44,7 @@ If the gRPC status call fails, the endpoint falls back to local process metrics:
 ### Usage

 ```bash
-curl http://localhost:8080/backend/monitor \
-  -H "Content-Type: application/json" \
-  -d '{"model": "my-model"}'
+curl "http://localhost:8080/backend/monitor?model=my-model"
 ```

 ### Example response
--- a/docs/content/reference/_index.md
+++ b/docs/content/reference/_index.md
@@ -130,6 +130,19 @@ Reference for system information commands and diagnostics.

 ---

+### 🤖 [AI Coding Assistants](ai-coding-assistants.md)
+Policy for AI-assisted contributions — licensing, DCO, and attribution.
+
+**Key topics:**
+- Aligned with the Linux kernel's AI assistants policy
+- Signed-off-by and DCO rules
+- `Assisted-by` commit trailer format
+- Scope and responsibility of the human submitter
+
+**Recommended for:** Contributors using AI coding assistants (Claude, Copilot, Cursor, Codex, etc.)
+
+---
+
 ## Quick Links

 | Task | Documentation |
@@ -138,6 +151,7 @@ Reference for system information commands and diagnostics.
 | CLI commands | [CLI Reference](cli-reference.md) |
 | Check compatibility | [Compatibility Table](compatibility-table.md) |
 | System diagnostics | [System Info](system-info.md) |
+| Contribute with AI assistance | [AI Coding Assistants](ai-coding-assistants.md) |

 ---

--- a/docs/content/reference/ai-coding-assistants.md
+++ b/docs/content/reference/ai-coding-assistants.md
@@ -0,0 +1,79 @@
+
+++
+disableToc = false
+title = "AI Coding Assistants"
+weight = 28
+++
+
+This document provides guidance for AI tools and developers using AI assistance when contributing to LocalAI.
+
+**LocalAI follows the same guidelines as the Linux kernel project for AI-assisted contributions.** See the upstream policy here: <https://docs.kernel.org/process/coding-assistants.html>. The rules below mirror that policy, adapted to LocalAI's license and project layout.
+
+AI tools helping with LocalAI development should follow the standard project development process:
+
+- [CONTRIBUTING.md](https://github.com/mudler/LocalAI/blob/master/CONTRIBUTING.md) — development workflow, commit conventions, and PR guidelines
+- [AGENTS.md](https://github.com/mudler/LocalAI/blob/master/AGENTS.md) — the agent entry point with links to all detailed topic guides
+- [.agents/ai-coding-assistants.md](https://github.com/mudler/LocalAI/blob/master/.agents/ai-coding-assistants.md) — the full policy source of truth
+
+## Licensing and Legal Requirements
+
+All contributions must comply with LocalAI's licensing requirements:
+
+- LocalAI is licensed under the **MIT License**
+- New source files should use the SPDX license identifier `MIT` where applicable to the file type
+- Contributions must be compatible with the MIT License and must not introduce code under incompatible licenses (e.g., GPL) without an explicit discussion with maintainers
+
+## Signed-off-by and Developer Certificate of Origin
+
+**AI agents MUST NOT add `Signed-off-by` tags.** Only humans can legally certify the Developer Certificate of Origin (DCO). The human submitter is responsible for:
+
+- Reviewing all AI-generated code
+- Ensuring compliance with licensing requirements
+- Adding their own `Signed-off-by` tag (when the project requires DCO) to certify the contribution
+- Taking full responsibility for the contribution
+
+AI agents MUST NOT add `Co-Authored-By` trailers for themselves either. A human reviewer owns the contribution; the AI's involvement is recorded via `Assisted-by` (see below).
+
+## Attribution
+
+When AI tools contribute to LocalAI development, proper attribution helps track the evolving role of AI in the development process. Contributions should include an `Assisted-by` tag in the commit message trailer in the following format:
+
+```
+Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]
+```
+
+Where:
+
+- `AGENT_NAME` — name of the AI tool or framework (e.g., `Claude`, `Copilot`, `Cursor`)
+- `MODEL_VERSION` — specific model version used (e.g., `claude-opus-4-7`, `gpt-5`)
+- `[TOOL1] [TOOL2]` — optional specialized analysis tools invoked by the agent (e.g., `golangci-lint`, `staticcheck`, `go vet`)
+
+Basic development tools (git, go, make, editors) should **not** be listed.
+
+### Example
+
+```
+fix(llama-cpp): handle empty tool call arguments
+
+Previously the parser panicked when the model returned a tool call with
+an empty arguments object. Fall back to an empty JSON object in that
+case so downstream consumers receive a valid payload.
+
+Assisted-by: Claude:claude-opus-4-7 golangci-lint
+Signed-off-by: Jane Developer <jane@example.com>
+```
+
+## Scope and Responsibility
+
+Using an AI assistant does not reduce the contributor's responsibility. The human submitter must:
+
+- Understand every line that lands in the PR
+- Verify that generated code compiles, passes tests, and follows the project style
+- Confirm that any referenced APIs, flags, or file paths actually exist in the current tree (AI models may hallucinate identifiers)
+- Not submit AI output verbatim without review
+
+Reviewers may ask for clarification on any change regardless of how it was produced. "An AI wrote it" is not an acceptable answer to a design question.
+
+{{% notice note %}}
+This policy is a living document. If you're unsure how to apply it to a specific contribution, open an issue or ask in the [Discord channel](https://discord.gg/uJAeKSAGDy) before submitting.
+{{% /notice %}}
--- a/docs/content/reference/compatibility-table.md
+++ b/docs/content/reference/compatibility-table.md
@@ -33,7 +33,7 @@ LocalAI will attempt to automatically load models which are not explicitly confi
 |---------|-------------|-------------|
 | [whisper.cpp](https://github.com/ggml-org/whisper.cpp) | OpenAI Whisper in C/C++ | CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T |
 | [faster-whisper](https://github.com/SYSTRAN/faster-whisper) | Fast Whisper with CTranslate2 | CUDA 12/13, ROCm, Intel, Metal |
-| [WhisperX](https://github.com/m-bain/whisperX) | Word-level timestamps and speaker diarization | CPU, CUDA 12/13, ROCm, Metal |
+| [WhisperX](https://github.com/m-bain/whisperX) | Word-level timestamps and speaker diarization | CPU, CUDA 12/13, Metal |
 | [moonshine](https://github.com/moonshine-ai/moonshine) | Ultra-fast transcription for low-end devices | CPU, CUDA 12/13, Metal |
 | [voxtral](https://github.com/mudler/voxtral.c) | Voxtral Realtime 4B speech-to-text in C | CPU, Metal |
 | [Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR) | Qwen3 automatic speech recognition | CPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T |
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -1,4 +1,151 @@
 ---
+- name: "qwen3.5-9b-glm5.1-distill-v1"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF
+  description: |
+    # 🪐 Qwen3.5-9B-GLM5.1-Distill-v1
+
+    ## 📌 Model Overview
+
+    **Model Name:** `Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1`
+    **Base Model:** Qwen3.5-9B
+    **Training Type:** Supervised Fine-Tuning (SFT, Distillation)
+    **Parameter Scale:** 9B
+    **Training Framework:** Unsloth
+
+    This model is a distilled variant of **Qwen3.5-9B**, trained on high-quality reasoning data derived from **GLM-5.1**.
+
+    The primary goals are to:
+
+      - Improve **structured reasoning ability**
+      - Enhance **instruction-following consistency**
+      - Activate **latent knowledge via better reasoning structure**
+
+    ## 📊 Training Data
+
+    ### Main Dataset
+
+      - `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`
+      - Cleaned from the original `Kassadin88/GLM-5.1-1000000x` dataset.
+      - Generated from a **GLM-5.1 teacher model**
+      - Approximately **700x** the scale of `Qwen3.5-reasoning-700x`
+      - Training used a **filtered subset**, not the full source dataset.
+
+    ### Auxiliary Dataset
+
+      - `Jackrong/Qwen3.5-reasoning-700x`
+
+    ...
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - qwen
+    - instruction-tuned
+    - reasoning
+  icon: https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/BnSg_x99v9bG9T5-8sKa1.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    mmproj: llama-cpp/mmproj/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF/mmproj.gguf
+    options:
+      - use_jinja:true
+    parameters:
+      min_p: 0
+      model: llama-cpp/models/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF/Qwen3.5-9B-GLM5.1-Distill-v1-Q4_K_M.gguf
+      presence_penalty: 1.5
+      repeat_penalty: 1
+      temperature: 0.7
+      top_k: 20
+      top_p: 0.8
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF/Qwen3.5-9B-GLM5.1-Distill-v1-Q4_K_M.gguf
+      sha256: f6f1d2b8efb2339ce9d4dd0f0329d2f2e4cf765eda49aa3f6df8f629f871a151
+      uri: https://huggingface.co/Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF/resolve/main/Qwen3.5-9B-GLM5.1-Distill-v1-Q4_K_M.gguf
+    - filename: llama-cpp/mmproj/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF/mmproj.gguf
+      sha256: e42c1c2ed0eaf6ea88a6ba10b26b4adf00a96a8c3d1803534a4c41060ad9e86b
+      uri: https://huggingface.co/Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF/resolve/main/mmproj.gguf
+- name: "supergemma4-26b-uncensored-v2"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/Jiunsong/supergemma4-26b-uncensored-gguf-v2
+  description: |
+    Hugging Face |
+    GitHub |
+    Launch Blog |
+    Documentation
+
+    License: Apache 2.0 | Authors: Google DeepMind
+
+    Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages.
+
+    Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: **E2B**, **E4B**, **26B A4B**, and **31B**. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI.
+
+    Gemma 4 introduces key **capability and architectural advancements**:
+
+    * **Reasoning** – All models in the family are designed as highly capable reasoners, with configurable thinking modes.
+
+    ...
+  license: "gemma"
+  tags:
+    - llm
+    - gguf
+  icon: https://ai.google.dev/gemma/images/gemma4_banner.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    options:
+      - use_jinja:true
+    parameters:
+      model: llama-cpp/models/supergemma4-26b-uncensored-gguf-v2/supergemma4-26b-uncensored-fast-v2-Q4_K_M.gguf
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/supergemma4-26b-uncensored-gguf-v2/supergemma4-26b-uncensored-fast-v2-Q4_K_M.gguf
+      sha256: e773b0a209d48524f9d485bca0818247f75d7ddde7cce951367a7e441fb59137
+      uri: https://huggingface.co/Jiunsong/supergemma4-26b-uncensored-gguf-v2/resolve/main/supergemma4-26b-uncensored-fast-v2-Q4_K_M.gguf
+- name: "qwopus-glm-18b-merged"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/Jackrong/Qwopus-GLM-18B-Merged-GGUF
+  description: "# \U0001FA90 Qwen3.5-9B-GLM5.1-Distill-v1\n\n## \U0001F4CC Model Overview\n\n**Model Name:** `Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1`\n**Base Model:** Qwen3.5-9B\n**Training Type:** Supervised Fine-Tuning (SFT, Distillation)\n**Parameter Scale:** 9B\n**Training Framework:** Unsloth\n\nThis model is a distilled variant of **Qwen3.5-9B**, trained on high-quality reasoning data derived from **GLM-5.1**.\n\nThe primary goals are to:\n\n  - Improve **structured reasoning ability**\n  - Enhance **instruction-following consistency**\n  - Activate **latent knowledge via better reasoning structure**\n\n## \U0001F4CA Training Data\n\n### Main Dataset\n\n  - `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`\n  - Cleaned from the original `Kassadin88/GLM-5.1-1000000x` dataset.\n  - Generated from a **GLM-5.1 teacher model**\n  - Approximately **700x** the scale of `Qwen3.5-reasoning-700x`\n  - Training used a **filtered subset**, not the full source dataset.\n\n### Auxiliary Dataset\n\n  - `Jackrong/Qwen3.5-reasoning-700x`\n\n...\n"
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - reasoning
+  icon: https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/BnSg_x99v9bG9T5-8sKa1.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    options:
+      - use_jinja:true
+    parameters:
+      model: llama-cpp/models/Qwopus-GLM-18B-Merged-GGUF/Qwopus-GLM-18B-Healed-Q4_K_M.gguf
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/Qwopus-GLM-18B-Merged-GGUF/Qwopus-GLM-18B-Healed-Q4_K_M.gguf
+      sha256: 13bd039f95c9ea46ef1d75905faa7be6ca4e47a5af9d4cf62e298a738a5b195f
+      uri: https://huggingface.co/Jackrong/Qwopus-GLM-18B-Merged-GGUF/resolve/main/Qwopus-GLM-18B-Healed-Q4_K_M.gguf
 - name: "qwen3.6-35b-a3b-apex"
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
  urls:
@@ -887,6 +1034,8 @@
    - gpu
  overrides:
    backend: neutts
+    parameters:
+      model: neuphonic/neutts-air
    known_usecases:
      - tts
 - name: vllm-omni-z-image-turbo
@@ -15186,14 +15335,16 @@
    - gpu
  overrides:
    parameters:
-      model: wan2.1-t2v-1.3B-Q8_0.gguf
+      model: wan2.1_t2v_1.3b-q8_0.gguf
  files:
-    - filename: "wan2.1-t2v-1.3B-Q8_0.gguf"
-      uri: "huggingface://calcuis/wan-gguf/wan2.1-t2v-1.3B-Q8_0.gguf"
+    - filename: "wan2.1_t2v_1.3b-q8_0.gguf"
+      sha256: "8f10260cc26498fee303851ee1c2047918934125731b9b78d4babfce4ec27458"
+      uri: "huggingface://calcuis/wan-gguf/wan2.1_t2v_1.3b-q8_0.gguf"
    - filename: "wan_2.1_vae.safetensors"
      uri: "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors"
    - filename: "umt5-xxl-encoder-Q8_0.gguf"
      uri: "huggingface://city96/umt5-xxl-encoder-gguf/umt5-xxl-encoder-Q8_0.gguf"
+      sha256: 2521d4de0bf9e1cc6549866463ceae85e4ec3239bc6063f7488810be39033bbc
 - name: wan-2.1-i2v-14b-480p-ggml
  license: apache-2.0
  url: "github:mudler/LocalAI/gallery/wan-ggml.yaml@master"
@@ -15214,11 +15365,103 @@
      model: wan2.1-i2v-14b-480p-Q4_K_M.gguf
    options:
      - "clip_vision_path:clip_vision_h.safetensors"
+      - "diffusion_model"
+      - "vae_decode_only:false"
+      - "sampler:euler"
+      - "flow_shift:3.0"
+      - "t5xxl_path:umt5-xxl-encoder-Q8_0.gguf"
+      - "vae_path:wan_2.1_vae.safetensors"
  files:
    - filename: "wan2.1-i2v-14b-480p-Q4_K_M.gguf"
+      sha256: "d91f7139acadb42ea05cdf97b311e5099f714f11fbe4d90916500e2f53cbba82"
      uri: "huggingface://city96/Wan2.1-I2V-14B-480P-gguf/wan2.1-i2v-14b-480p-Q4_K_M.gguf"
    - filename: "wan_2.1_vae.safetensors"
      uri: "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors"
+    - filename: "umt5-xxl-encoder-Q8_0.gguf"
+      uri: "huggingface://city96/umt5-xxl-encoder-gguf/umt5-xxl-encoder-Q8_0.gguf"
+      sha256: 2521d4de0bf9e1cc6549866463ceae85e4ec3239bc6063f7488810be39033bbc
+    - filename: "clip_vision_h.safetensors"
+      uri: "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/clip_vision/clip_vision_h.safetensors"
+- name: wan-2.1-flf2v-14b-720p-ggml
+  license: apache-2.0
+  url: "github:mudler/LocalAI/gallery/wan-ggml.yaml@master"
+  description: |
+    Wan 2.1 FLF2V 14B 720P — first-last-frame-to-video diffusion, GGUF Q4_K_M.
+    Takes a start and end reference image and interpolates a 33-frame clip
+    between them. Unlike the plain I2V variant this model feeds the end
+    frame through clip_vision as well, so it conditions semantically (not
+    just in pixel-space) on both endpoints. That makes it the right choice
+    for seamless loops (start_image == end_image) and clean narrative cuts.
+    Native 720p but accepts 480p resolutions; shares the same VAE, t5xxl
+    text encoder, and clip_vision_h as I2V 14B.
+  urls:
+    - https://huggingface.co/city96/Wan2.1-FLF2V-14B-720P-gguf
+  tags:
+    - image-to-video
+    - first-last-frame-to-video
+    - wan
+    - video-generation
+    - cpu
+    - gpu
+  overrides:
+    parameters:
+      model: wan2.1-flf2v-14b-720p-Q4_K_M.gguf
+    options:
+      - "clip_vision_path:clip_vision_h.safetensors"
+      - "diffusion_model"
+      - "vae_decode_only:false"
+      - "sampler:euler"
+      - "flow_shift:3.0"
+      - "t5xxl_path:umt5-xxl-encoder-Q8_0.gguf"
+      - "vae_path:wan_2.1_vae.safetensors"
+  files:
+    - filename: "wan2.1-flf2v-14b-720p-Q4_K_M.gguf"
+      sha256: "7652d7d8b0795009ff21ed83d806af762aae8a8faa8640dd07b3a67e4dfab445"
+      uri: "huggingface://city96/Wan2.1-FLF2V-14B-720P-gguf/wan2.1-flf2v-14b-720p-Q4_K_M.gguf"
+    - filename: "wan_2.1_vae.safetensors"
+      uri: "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors"
+    - filename: "umt5-xxl-encoder-Q8_0.gguf"
+      uri: "huggingface://city96/umt5-xxl-encoder-gguf/umt5-xxl-encoder-Q8_0.gguf"
+      sha256: 2521d4de0bf9e1cc6549866463ceae85e4ec3239bc6063f7488810be39033bbc
+    - filename: "clip_vision_h.safetensors"
+      uri: "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/clip_vision/clip_vision_h.safetensors"
+- name: wan-2.1-i2v-14b-720p-ggml
+  license: apache-2.0
+  url: "github:mudler/LocalAI/gallery/wan-ggml.yaml@master"
+  description: |
+    Wan 2.1 I2V 14B 720P — image-to-video diffusion, GGUF Q4_K_M.
+    Native 720p sibling of the 480p I2V model: animates a single
+    reference image into a 33-frame clip at up to 1280x720. Trained
+    purely as image-to-video (no first-last-frame interpolation path),
+    so motion is freer and better-suited to single-anchor animation
+    than repurposing the FLF2V 720P variant for i2v. Shares the same
+    VAE, umt5_xxl text encoder, and clip_vision_h as the I2V 14B 480P
+    and FLF2V 14B 720P entries.
+  urls:
+    - https://huggingface.co/city96/Wan2.1-I2V-14B-720P-gguf
+  tags:
+    - image-to-video
+    - wan
+    - video-generation
+    - cpu
+    - gpu
+  overrides:
+    parameters:
+      model: wan2.1-i2v-14b-720p-Q4_K_M.gguf
+    options:
+      - "clip_vision_path:clip_vision_h.safetensors"
+      - "diffusion_model"
+      - "vae_decode_only:false"
+      - "sampler:euler"
+      - "flow_shift:3.0"
+      - "t5xxl_path:umt5-xxl-encoder-Q8_0.gguf"
+      - "vae_path:wan_2.1_vae.safetensors"
+  files:
+    - filename: "wan2.1-i2v-14b-720p-Q4_K_M.gguf"
+      sha256: "ffecd91e4b636d8e3e43f3fa388218158ba447109547bde777c6d67ef4fe42a4"
+      uri: "huggingface://city96/Wan2.1-I2V-14B-720P-gguf/wan2.1-i2v-14b-720p-Q4_K_M.gguf"
+    - filename: "wan_2.1_vae.safetensors"
+      uri: "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors"
    - filename: "umt5-xxl-encoder-Q8_0.gguf"
      uri: "huggingface://city96/umt5-xxl-encoder-gguf/umt5-xxl-encoder-Q8_0.gguf"
    - filename: "clip_vision_h.safetensors"
--- a/gallery/wan-ggml.yaml
+++ b/gallery/wan-ggml.yaml
@@ -9,11 +9,6 @@ config_file: |
    - "diffusion_model"
    - "vae_decode_only:false"
    - "sampler:euler"
-    - "scheduler:discrete"
    - "flow_shift:3.0"
-    - "diffusion_flash_attn:true"
-    - "offload_params_to_cpu:true"
-    - "keep_vae_on_cpu:true"
-    - "keep_clip_on_cpu:true"
    - "t5xxl_path:umt5-xxl-encoder-Q8_0.gguf"
    - "vae_path:wan_2.1_vae.safetensors"
--- a/go.mod
+++ b/go.mod
@@ -8,13 +8,13 @@ require (
 	github.com/Masterminds/sprig/v3 v3.3.0
 	github.com/alecthomas/kong v1.14.0
 	github.com/anthropics/anthropic-sdk-go v1.27.0
-	github.com/aws/aws-sdk-go-v2 v1.41.5
-	github.com/aws/aws-sdk-go-v2/config v1.32.14
-	github.com/aws/aws-sdk-go-v2/credentials v1.19.14
-	github.com/aws/aws-sdk-go-v2/service/s3 v1.97.1
+	github.com/aws/aws-sdk-go-v2 v1.41.6
+	github.com/aws/aws-sdk-go-v2/config v1.32.16
+	github.com/aws/aws-sdk-go-v2/credentials v1.19.15
+	github.com/aws/aws-sdk-go-v2/service/s3 v1.99.1
 	github.com/charmbracelet/glamour v1.0.0
-	github.com/containerd/containerd v1.7.30
-	github.com/coreos/go-oidc/v3 v3.17.0
+	github.com/containerd/containerd v1.7.31
+	github.com/coreos/go-oidc/v3 v3.18.0
 	github.com/dhowden/tag v0.0.0-20240417053706-3d75831295e8
 	github.com/ebitengine/purego v0.10.0
 	github.com/emirpasic/gods/v2 v2.0.0-alpha
@@ -35,7 +35,7 @@ require (
 	github.com/lithammer/fuzzysearch v1.1.8
 	github.com/mholt/archiver/v3 v3.5.1
 	github.com/microcosm-cc/bluemonday v1.0.27
-	github.com/modelcontextprotocol/go-sdk v1.4.1
+	github.com/modelcontextprotocol/go-sdk v1.5.0
 	github.com/mudler/cogito v0.9.5-0.20260315222927-63abdec7189b
 	github.com/mudler/edgevpn v0.31.1
 	github.com/mudler/go-processmanager v0.1.0
@@ -75,24 +75,23 @@ require (
 )

 require (
-	github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.7 // indirect
-	github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.18.21 // indirect
-	github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.21 // indirect
-	github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.7.21 // indirect
-	github.com/aws/aws-sdk-go-v2/internal/ini v1.8.6 // indirect
-	github.com/aws/aws-sdk-go-v2/internal/v4a v1.4.21 // indirect
-	github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.13.7 // indirect
-	github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.9.12 // indirect
-	github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.13.21 // indirect
-	github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.19.20 // indirect
-	github.com/aws/aws-sdk-go-v2/service/signin v1.0.9 // indirect
-	github.com/aws/aws-sdk-go-v2/service/sso v1.30.15 // indirect
-	github.com/aws/aws-sdk-go-v2/service/ssooidc v1.35.19 // indirect
-	github.com/aws/aws-sdk-go-v2/service/sts v1.41.10 // indirect
-	github.com/aws/smithy-go v1.24.2 // indirect
+	github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.9 // indirect
+	github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.18.22 // indirect
+	github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.22 // indirect
+	github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.7.22 // indirect
+	github.com/aws/aws-sdk-go-v2/internal/v4a v1.4.23 // indirect
+	github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.13.8 // indirect
+	github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.9.14 // indirect
+	github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.13.22 // indirect
+	github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.19.22 // indirect
+	github.com/aws/aws-sdk-go-v2/service/signin v1.0.10 // indirect
+	github.com/aws/aws-sdk-go-v2/service/sso v1.30.16 // indirect
+	github.com/aws/aws-sdk-go-v2/service/ssooidc v1.35.20 // indirect
+	github.com/aws/aws-sdk-go-v2/service/sts v1.42.0 // indirect
+	github.com/aws/smithy-go v1.25.0 // indirect
 	github.com/bahlo/generic-list-go v0.2.0 // indirect
 	github.com/buger/jsonparser v1.1.1 // indirect
-	github.com/go-jose/go-jose/v4 v4.1.3 // indirect
+	github.com/go-jose/go-jose/v4 v4.1.4 // indirect
 	github.com/jinzhu/inflection v1.0.0 // indirect
 	github.com/jinzhu/now v1.1.5 // indirect
 	github.com/mattn/go-sqlite3 v1.14.24 // indirect
--- a/go.sum
+++ b/go.sum
@@ -70,44 +70,42 @@ github.com/anthropics/anthropic-sdk-go v1.27.0 h1:0CWbmBq5ofGAjF2H6lefCNRbnaUMGi
 github.com/anthropics/anthropic-sdk-go v1.27.0/go.mod h1:qUKmaW+uuPB64iy1l+4kOSvaLqPXnHTTBKH6RVZ7q5Q=
 github.com/armon/go-socks5 v0.0.0-20160902184237-e75332964ef5 h1:0CwZNZbxp69SHPdPJAN/hZIm0C4OItdklCFmMRWYpio=
 github.com/armon/go-socks5 v0.0.0-20160902184237-e75332964ef5/go.mod h1:wHh0iHkYZB8zMSxRWpUBQtwG5a7fFgvEO+odwuTv2gs=
-github.com/aws/aws-sdk-go-v2 v1.41.5 h1:dj5kopbwUsVUVFgO4Fi5BIT3t4WyqIDjGKCangnV/yY=
-github.com/aws/aws-sdk-go-v2 v1.41.5/go.mod h1:mwsPRE8ceUUpiTgF7QmQIJ7lgsKUPQOUl3o72QBrE1o=
-github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.7 h1:3kGOqnh1pPeddVa/E37XNTaWJ8W6vrbYV9lJEkCnhuY=
-github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.7/go.mod h1:lyw7GFp3qENLh7kwzf7iMzAxDn+NzjXEAGjKS2UOKqI=
-github.com/aws/aws-sdk-go-v2/config v1.32.14 h1:opVIRo/ZbbI8OIqSOKmpFaY7IwfFUOCCXBsUpJOwDdI=
-github.com/aws/aws-sdk-go-v2/config v1.32.14/go.mod h1:U4/V0uKxh0Tl5sxmCBZ3AecYny4UNlVmObYjKuuaiOo=
-github.com/aws/aws-sdk-go-v2/credentials v1.19.14 h1:n+UcGWAIZHkXzYt87uMFBv/l8THYELoX6gVcUvgl6fI=
-github.com/aws/aws-sdk-go-v2/credentials v1.19.14/go.mod h1:cJKuyWB59Mqi0jM3nFYQRmnHVQIcgoxjEMAbLkpr62w=
-github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.18.21 h1:NUS3K4BTDArQqNu2ih7yeDLaS3bmHD0YndtA6UP884g=
-github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.18.21/go.mod h1:YWNWJQNjKigKY1RHVJCuupeWDrrHjRqHm0N9rdrWzYI=
-github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.21 h1:Rgg6wvjjtX8bNHcvi9OnXWwcE0a2vGpbwmtICOsvcf4=
-github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.21/go.mod h1:A/kJFst/nm//cyqonihbdpQZwiUhhzpqTsdbhDdRF9c=
-github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.7.21 h1:PEgGVtPoB6NTpPrBgqSE5hE/o47Ij9qk/SEZFbUOe9A=
-github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.7.21/go.mod h1:p+hz+PRAYlY3zcpJhPwXlLC4C+kqn70WIHwnzAfs6ps=
-github.com/aws/aws-sdk-go-v2/internal/ini v1.8.6 h1:qYQ4pzQ2Oz6WpQ8T3HvGHnZydA72MnLuFK9tJwmrbHw=
-github.com/aws/aws-sdk-go-v2/internal/ini v1.8.6/go.mod h1:O3h0IK87yXci+kg6flUKzJnWeziQUKciKrLjcatSNcY=
-github.com/aws/aws-sdk-go-v2/internal/v4a v1.4.21 h1:SwGMTMLIlvDNyhMteQ6r8IJSBPlRdXX5d4idhIGbkXA=
-github.com/aws/aws-sdk-go-v2/internal/v4a v1.4.21/go.mod h1:UUxgWxofmOdAMuqEsSppbDtGKLfR04HGsD0HXzvhI1k=
-github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.13.7 h1:5EniKhLZe4xzL7a+fU3C2tfUN4nWIqlLesfrjkuPFTY=
-github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.13.7/go.mod h1:x0nZssQ3qZSnIcePWLvcoFisRXJzcTVvYpAAdYX8+GI=
-github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.9.12 h1:qtJZ70afD3ISKWnoX3xB0J2otEqu3LqicRcDBqsj0hQ=
-github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.9.12/go.mod h1:v2pNpJbRNl4vEUWEh5ytQok0zACAKfdmKS51Hotc3pQ=
-github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.13.21 h1:c31//R3xgIJMSC8S6hEVq+38DcvUlgFY0FM6mSI5oto=
-github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.13.21/go.mod h1:r6+pf23ouCB718FUxaqzZdbpYFyDtehyZcmP5KL9FkA=
-github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.19.20 h1:siU1A6xjUZ2N8zjTHSXFhB9L/2OY8Dqs0xXiLjF30jA=
-github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.19.20/go.mod h1:4TLZCmVJDM3FOu5P5TJP0zOlu9zWgDWU7aUxWbr+rcw=
-github.com/aws/aws-sdk-go-v2/service/s3 v1.97.1 h1:csi9NLpFZXb9fxY7rS1xVzgPRGMt7MSNWeQ6eo247kE=
-github.com/aws/aws-sdk-go-v2/service/s3 v1.97.1/go.mod h1:qXVal5H0ChqXP63t6jze5LmFalc7+ZE7wOdLtZ0LCP0=
-github.com/aws/aws-sdk-go-v2/service/signin v1.0.9 h1:QKZH0S178gCmFEgst8hN0mCX1KxLgHBKKY/CLqwP8lg=
-github.com/aws/aws-sdk-go-v2/service/signin v1.0.9/go.mod h1:7yuQJoT+OoH8aqIxw9vwF+8KpvLZ8AWmvmUWHsGQZvI=
-github.com/aws/aws-sdk-go-v2/service/sso v1.30.15 h1:lFd1+ZSEYJZYvv9d6kXzhkZu07si3f+GQ1AaYwa2LUM=
-github.com/aws/aws-sdk-go-v2/service/sso v1.30.15/go.mod h1:WSvS1NLr7JaPunCXqpJnWk1Bjo7IxzZXrZi1QQCkuqM=
-github.com/aws/aws-sdk-go-v2/service/ssooidc v1.35.19 h1:dzztQ1YmfPrxdrOiuZRMF6fuOwWlWpD2StNLTceKpys=
-github.com/aws/aws-sdk-go-v2/service/ssooidc v1.35.19/go.mod h1:YO8TrYtFdl5w/4vmjL8zaBSsiNp3w0L1FfKVKenZT7w=
-github.com/aws/aws-sdk-go-v2/service/sts v1.41.10 h1:p8ogvvLugcR/zLBXTXrTkj0RYBUdErbMnAFFp12Lm/U=
-github.com/aws/aws-sdk-go-v2/service/sts v1.41.10/go.mod h1:60dv0eZJfeVXfbT1tFJinbHrDfSJ2GZl4Q//OSSNAVw=
-github.com/aws/smithy-go v1.24.2 h1:FzA3bu/nt/vDvmnkg+R8Xl46gmzEDam6mZ1hzmwXFng=
-github.com/aws/smithy-go v1.24.2/go.mod h1:YE2RhdIuDbA5E5bTdciG9KrW3+TiEONeUWCqxX9i1Fc=
+github.com/aws/aws-sdk-go-v2 v1.41.6 h1:1AX0AthnBQzMx1vbmir3Y4WsnJgiydmnJjiLu+LvXOg=
+github.com/aws/aws-sdk-go-v2 v1.41.6/go.mod h1:dy0UzBIfwSeot4grGvY1AqFWN5zgziMmWGzysDnHFcQ=
+github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.9 h1:adBsCIIpLbLmYnkQU+nAChU5yhVTvu5PerROm+/Kq2A=
+github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.9/go.mod h1:uOYhgfgThm/ZyAuJGNQ5YgNyOlYfqnGpTHXvk3cpykg=
+github.com/aws/aws-sdk-go-v2/config v1.32.16 h1:Q0iQ7quUgJP0F/SCRTieScnaMdXr9h/2+wze1u3cNeM=
+github.com/aws/aws-sdk-go-v2/config v1.32.16/go.mod h1:duCCnJEFqpt2RC6no1iK6q+8HpwOAkiUua0pY507dQc=
+github.com/aws/aws-sdk-go-v2/credentials v1.19.15 h1:fyvgWTszojq8hEnMi8PPBTvZdTtEVmAVyo+NFLHBhH4=
+github.com/aws/aws-sdk-go-v2/credentials v1.19.15/go.mod h1:gJiYyMOjNg8OEdRWOf3CrFQxM2a98qmrtjx1zuiQfB8=
+github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.18.22 h1:IOGsJ1xVWhsi+ZO7/NW8OuZZBtMJLZbk4P5HDjJO0jQ=
+github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.18.22/go.mod h1:b+hYdbU+jGKfXE8kKM6g1+h+L/Go3vMvzlxBsiuGsxg=
+github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.22 h1:GmLa5Kw1ESqtFpXsx5MmC84QWa/ZrLZvlJGa2y+4kcQ=
+github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.22/go.mod h1:6sW9iWm9DK9YRpRGga/qzrzNLgKpT2cIxb7Vo2eNOp0=
+github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.7.22 h1:dY4kWZiSaXIzxnKlj17nHnBcXXBfac6UlsAx2qL6XrU=
+github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.7.22/go.mod h1:KIpEUx0JuRZLO7U6cbV204cWAEco2iC3l061IxlwLtI=
+github.com/aws/aws-sdk-go-v2/internal/v4a v1.4.23 h1:FPXsW9+gMuIeKmz7j6ENWcWtBGTe1kH8r9thNt5Uxx4=
+github.com/aws/aws-sdk-go-v2/internal/v4a v1.4.23/go.mod h1:7J8iGMdRKk6lw2C+cMIphgAnT8uTwBwNOsGkyOCm80U=
+github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.13.8 h1:HtOTYcbVcGABLOVuPYaIihj6IlkqubBwFj10K5fxRek=
+github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.13.8/go.mod h1:VsK9abqQeGlzPgUr+isNWzPlK2vKe9INMLWnY65f5Xs=
+github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.9.14 h1:xnvDEnw+pnj5mctWiYuFbigrEzSm35x7k4KS/ZkCANg=
+github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.9.14/go.mod h1:yS5rNogD8e0Wu9+l3MUwr6eENBzEeGejvINpN5PAYfY=
+github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.13.22 h1:PUmZeJU6Y1Lbvt9WFuJ0ugUK2xn6hIWUBBbKuOWF30s=
+github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.13.22/go.mod h1:nO6egFBoAaoXze24a2C0NjQCvdpk8OueRoYimvEB9jo=
+github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.19.22 h1:SE+aQ4DEqG53RRCAIHlCf//B2ycxGH7jFkpnAh/kKPM=
+github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.19.22/go.mod h1:ES3ynECd7fYeJIL6+oax+uIEljmfps0S70BaQzbMd/o=
+github.com/aws/aws-sdk-go-v2/service/s3 v1.99.1 h1:kU/eBN5+MWNo/LcbNa4hWDdN76hdcd7hocU5kvu7IsU=
+github.com/aws/aws-sdk-go-v2/service/s3 v1.99.1/go.mod h1:Fw9aqhJicIVee1VytBBjH+l+5ov6/PhbtIK/u3rt/ls=
+github.com/aws/aws-sdk-go-v2/service/signin v1.0.10 h1:a1Fq/KXn75wSzoJaPQTgZO0wHGqE9mjFnylnqEPTchA=
+github.com/aws/aws-sdk-go-v2/service/signin v1.0.10/go.mod h1:p6+MXNxW7IA6dMgHfTAzljuwSKD0NCm/4lbS4t6+7vI=
+github.com/aws/aws-sdk-go-v2/service/sso v1.30.16 h1:x6bKbmDhsgSZwv6q19wY/u3rLk/3FGjJWyqKcIRufpE=
+github.com/aws/aws-sdk-go-v2/service/sso v1.30.16/go.mod h1:CudnEVKRtLn0+3uMV0yEXZ+YZOKnAtUJ5DmDhilVnIw=
+github.com/aws/aws-sdk-go-v2/service/ssooidc v1.35.20 h1:oK/njaL8GtyEihkWMD4k3VgHCT64RQKkZwh0DG5j8ak=
+github.com/aws/aws-sdk-go-v2/service/ssooidc v1.35.20/go.mod h1:JHs8/y1f3zY7U5WcuzoJ/yAYGYtNIVPKLIbp61euvmg=
+github.com/aws/aws-sdk-go-v2/service/sts v1.42.0 h1:ks8KBcZPh3PYISr5dAiXCM5/Thcuxk8l+PG4+A0exds=
+github.com/aws/aws-sdk-go-v2/service/sts v1.42.0/go.mod h1:pFw33T0WLvXU3rw1WBkpMlkgIn54eCB5FYLhjDc9Foo=
+github.com/aws/smithy-go v1.25.0 h1:Sz/XJ64rwuiKtB6j98nDIPyYrV1nVNJ4YU74gttcl5U=
+github.com/aws/smithy-go v1.25.0/go.mod h1:YE2RhdIuDbA5E5bTdciG9KrW3+TiEONeUWCqxX9i1Fc=
 github.com/aymanbagabas/go-osc52/v2 v2.0.1 h1:HwpRHbFMcZLEVr42D4p7XBqjyuxQH5SMiErDT4WkJ2k=
 github.com/aymanbagabas/go-osc52/v2 v2.0.1/go.mod h1:uYgXzlJ7ZpABp8OJ+exZzJJhRNQ2ASbcXHWsFqH8hp8=
 github.com/aymanbagabas/go-udiff v0.2.0 h1:TK0fH4MteXUDspT88n8CKzvK0X9O2xu9yQjWpi6yML8=
@@ -198,8 +196,8 @@ github.com/cloudflare/circl v1.6.1/go.mod h1:uddAzsPgqdMAYatqJ0lsjX1oECcQLIlRpzZ
 github.com/cncf/udpa/go v0.0.0-20191209042840-269d4d468f6f/go.mod h1:M8M6+tZqaGXZJjfX53e64911xZQV5JYwmTeXPW+k8Sc=
 github.com/containerd/cgroups v1.1.0 h1:v8rEWFl6EoqHB+swVNjVoCJE8o3jX7e8nqBGPLaDFBM=
 github.com/containerd/cgroups v1.1.0/go.mod h1:6ppBcbh/NOOUU+dMKrykgaBnK9lCIBxHqJDGwsa1mIw=
-github.com/containerd/containerd v1.7.30 h1:/2vezDpLDVGGmkUXmlNPLCCNKHJ5BbC5tJB5JNzQhqE=
-github.com/containerd/containerd v1.7.30/go.mod h1:fek494vwJClULlTpExsmOyKCMUAbuVjlFsJQc4/j44M=
+github.com/containerd/containerd v1.7.31 h1:jn3IMuTV4Bb1Uwb0MFPW2ASJAD3W1lh6QqqZHIZwDh4=
+github.com/containerd/containerd v1.7.31/go.mod h1:jdwD6s/BhV4XVJGrvtziNPVA+83n66TwptVaPKprq4E=
 github.com/containerd/continuity v0.4.4 h1:/fNVfTJ7wIl/YPMHjf+5H32uFhl63JucB34PlCpMKII=
 github.com/containerd/continuity v0.4.4/go.mod h1:/lNJvtJKUQStBzpVQ1+rasXO1LAWtUQssk28EZvJ3nE=
 github.com/containerd/errdefs v1.0.0 h1:tg5yIfIlQIrxYtu9ajqY42W3lpS19XqdxRQeEwYG8PI=
@@ -212,8 +210,8 @@ github.com/containerd/platforms v0.2.1 h1:zvwtM3rz2YHPQsF2CHYM8+KtB5dvhISiXh5ZpS
 github.com/containerd/platforms v0.2.1/go.mod h1:XHCb+2/hzowdiut9rkudds9bE5yJ7npe7dG/wG+uFPw=
 github.com/containerd/stargz-snapshotter/estargz v0.18.2 h1:yXkZFYIzz3eoLwlTUZKz2iQ4MrckBxJjkmD16ynUTrw=
 github.com/containerd/stargz-snapshotter/estargz v0.18.2/go.mod h1:XyVU5tcJ3PRpkA9XS2T5us6Eg35yM0214Y+wvrZTBrY=
-github.com/coreos/go-oidc/v3 v3.17.0 h1:hWBGaQfbi0iVviX4ibC7bk8OKT5qNr4klBaCHVNvehc=
-github.com/coreos/go-oidc/v3 v3.17.0/go.mod h1:wqPbKFrVnE90vty060SB40FCJ8fTHTxSwyXJqZH+sI8=
+github.com/coreos/go-oidc/v3 v3.18.0 h1:V9orjXynvu5wiC9SemFTWnG4F45v403aIcjWo0d41+A=
+github.com/coreos/go-oidc/v3 v3.18.0/go.mod h1:DYCf24+ncYi+XkIH97GY1+dqoRlbaSI26KVTCI9SrY4=
 github.com/coreos/go-systemd v0.0.0-20181012123002-c6f51f82210d/go.mod h1:F5haX7vjVVG0kc13fIWeqUViNPyEJxv/OmvnBo0Yme4=
 github.com/coreos/go-systemd/v22 v22.5.0/go.mod h1:Y58oyj3AT4RCenI/lSvhwexgC+NSVTIJ3seZv2GcEnc=
 github.com/cpuguy83/dockercfg v0.3.2 h1:DlJTyZGBDlXqUZ2Dk2Q3xHs/FtnooJJVaad2S9GKorA=
@@ -336,8 +334,8 @@ github.com/go-gl/gl v0.0.0-20231021071112-07e5d0ea2e71 h1:5BVwOaUSBTlVZowGO6VZGw
 github.com/go-gl/gl v0.0.0-20231021071112-07e5d0ea2e71/go.mod h1:9YTyiznxEY1fVinfM7RvRcjRHbw2xLBJ3AAGIT0I4Nw=
 github.com/go-gl/glfw/v3.3/glfw v0.0.0-20240506104042-037f3cc74f2a h1:vxnBhFDDT+xzxf1jTJKMKZw3H0swfWk9RpWbBbDK5+0=
 github.com/go-gl/glfw/v3.3/glfw v0.0.0-20240506104042-037f3cc74f2a/go.mod h1:tQ2UAYgL5IevRw8kRxooKSPJfGvJ9fJQFa0TUsXzTg8=
-github.com/go-jose/go-jose/v4 v4.1.3 h1:CVLmWDhDVRa6Mi/IgCgaopNosCaHz7zrMeF9MlZRkrs=
-github.com/go-jose/go-jose/v4 v4.1.3/go.mod h1:x4oUasVrzR7071A4TnHLGSPpNOm2a21K9Kf04k1rs08=
+github.com/go-jose/go-jose/v4 v4.1.4 h1:moDMcTHmvE6Groj34emNPLs/qtYXRVcd6S7NHbHz3kA=
+github.com/go-jose/go-jose/v4 v4.1.4/go.mod h1:x4oUasVrzR7071A4TnHLGSPpNOm2a21K9Kf04k1rs08=
 github.com/go-logr/logr v1.2.2/go.mod h1:jdQByPbusPIv2/zmleS9BjJVeZ6kBagPoEUsqbVz/1A=
 github.com/go-logr/logr v1.4.3 h1:CjnDlHq8ikf6E492q6eKboGOC0T8CDaOvkHCIg8idEI=
 github.com/go-logr/logr v1.4.3/go.mod h1:9T104GzyrTigFIr8wt5mBrctHMim0Nb2HLGrmQ40KvY=
@@ -385,8 +383,8 @@ github.com/gofrs/flock v0.13.0/go.mod h1:jxeyy9R1auM5S6JYDBhDt+E2TCo7DkratH4Pgi8
 github.com/gogo/protobuf v1.1.1/go.mod h1:r8qH/GZQm5c6nD/R0oafs1akxWv10x8SbQlK7atdtwQ=
 github.com/gogo/protobuf v1.3.2 h1:Ov1cvc58UF3b5XjBnZv7+opcTcQFZebYjWzi34vdm4Q=
 github.com/gogo/protobuf v1.3.2/go.mod h1:P1XiOD3dCwIKUDQYPy72D8LYyHL2YPYrpS2s69NZV8Q=
-github.com/golang-jwt/jwt/v5 v5.3.0 h1:pv4AsKCKKZuqlgs5sUmn4x8UlGa0kEVt/puTpKx9vvo=
-github.com/golang-jwt/jwt/v5 v5.3.0/go.mod h1:fxCRLWMO43lRc8nhHWY6LGqRcf+1gQWArsqaEUEa5bE=
+github.com/golang-jwt/jwt/v5 v5.3.1 h1:kYf81DTWFe7t+1VvL7eS+jKFVWaUnK9cB1qbwn63YCY=
+github.com/golang-jwt/jwt/v5 v5.3.1/go.mod h1:fxCRLWMO43lRc8nhHWY6LGqRcf+1gQWArsqaEUEa5bE=
 github.com/golang/glog v0.0.0-20160126235308-23def4e6c14b/go.mod h1:SBH7ygxi8pfUlaOkMMuAQtPIUF8ecWP5IEl/CR7VP2Q=
 github.com/golang/groupcache v0.0.0-20200121045136-8c9f03a8e57e/go.mod h1:cIg4eruTrX1D+g88fzRXU5OdNfaM+9IcxsU14FzY7Hc=
 github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da/go.mod h1:cIg4eruTrX1D+g88fzRXU5OdNfaM+9IcxsU14FzY7Hc=
@@ -691,8 +689,8 @@ github.com/moby/sys/userns v0.1.0 h1:tVLXkFOxVu9A64/yh59slHVv9ahO9UIev4JZusOLG/g
 github.com/moby/sys/userns v0.1.0/go.mod h1:IHUYgu/kao6N8YZlp9Cf444ySSvCmDlmzUcYfDHOl28=
 github.com/moby/term v0.5.2 h1:6qk3FJAFDs6i/q3W/pQ97SX192qKfZgGjCQqfCJkgzQ=
 github.com/moby/term v0.5.2/go.mod h1:d3djjFCrjnB+fl8NJux+EJzu0msscUP+f8it8hPkFLc=
-github.com/modelcontextprotocol/go-sdk v1.4.1 h1:M4x9GyIPj+HoIlHNGpK2hq5o3BFhC+78PkEaldQRphc=
-github.com/modelcontextprotocol/go-sdk v1.4.1/go.mod h1:Bo/mS87hPQqHSRkMv4dQq1XCu6zv4INdXnFZabkNU6s=
+github.com/modelcontextprotocol/go-sdk v1.5.0 h1:CHU0FIX9kpueNkxuYtfYQn1Z0slhFzBZuq+x6IiblIU=
+github.com/modelcontextprotocol/go-sdk v1.5.0/go.mod h1:gggDIhoemhWs3BGkGwd1umzEXCEMMvAnhTrnbXJKKKA=
 github.com/modern-go/concurrent v0.0.0-20180228061459-e0a39a4cb421/go.mod h1:6dJC0mAP4ikYIbvyc7fijjWJddQyLn8Ig3JB5CqoB9Q=
 github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd h1:TRLaZ9cD/w8PVh93nsPXa1VrQ6jlwL5oN8l14QlcNfg=
 github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd/go.mod h1:6dJC0mAP4ikYIbvyc7fijjWJddQyLn8Ig3JB5CqoB9Q=
--- a/pkg/system/capabilities_test.go
+++ b/pkg/system/capabilities_test.go
@@ -159,7 +159,6 @@ var _ = Describe("CapabilityFilterDisabled", func() {
 		os.Setenv(capabilityEnv, "disable")
 		s := &SystemState{}
 		Expect(s.IsBackendCompatible("cuda12-whisperx", "quay.io/nvidia-cuda-12")).To(BeTrue())
-		Expect(s.IsBackendCompatible("rocm-whisperx", "quay.io/rocm")).To(BeTrue())
 		Expect(s.IsBackendCompatible("metal-whisperx", "quay.io/metal-darwin")).To(BeTrue())
 		Expect(s.IsBackendCompatible("intel-whisperx", "quay.io/intel-sycl")).To(BeTrue())
 		Expect(s.IsBackendCompatible("cpu-whisperx", "quay.io/cpu")).To(BeTrue())
--- a/swagger/docs.go
+++ b/swagger/docs.go
@@ -985,13 +985,11 @@ const docTemplate = `{
                "summary": "Backend monitor endpoint",
                "parameters": [
                    {
-                        "description": "Backend statistics request",
-                        "name": "request",
-                        "in": "body",
-                        "required": true,
-                        "schema": {
-                            "$ref": "#/definitions/schema.BackendMonitorRequest"
-                        }
+                        "type": "string",
+                        "description": "Name of the model to monitor",
+                        "name": "model",
+                        "in": "query",
+                        "required": true
                    }
                ],
                "responses": {
@@ -2408,6 +2406,23 @@ const docTemplate = `{
                }
            }
        },
+        "gallery.NodeDriftInfo": {
+            "type": "object",
+            "properties": {
+                "digest": {
+                    "type": "string"
+                },
+                "node_id": {
+                    "type": "string"
+                },
+                "node_name": {
+                    "type": "string"
+                },
+                "version": {
+                    "type": "string"
+                }
+            }
+        },
        "gallery.UpgradeInfo": {
            "type": "object",
            "properties": {
@@ -2425,6 +2440,13 @@ const docTemplate = `{
                },
                "installed_version": {
                    "type": "string"
+                },
+                "node_drift": {
+                    "description": "NodeDrift lists nodes whose installed version or digest differs from\nthe cluster majority. Non-empty means the cluster has diverged and an\nupgrade will realign it. Empty in single-node mode.",
+                    "type": "array",
+                    "items": {
+                        "$ref": "#/definitions/gallery.NodeDriftInfo"
+                    }
                }
            }
        },
--- a/swagger/swagger.json
+++ b/swagger/swagger.json
@@ -982,13 +982,11 @@
                "summary": "Backend monitor endpoint",
                "parameters": [
                    {
-                        "description": "Backend statistics request",
-                        "name": "request",
-                        "in": "body",
-                        "required": true,
-                        "schema": {
-                            "$ref": "#/definitions/schema.BackendMonitorRequest"
-                        }
+                        "type": "string",
+                        "description": "Name of the model to monitor",
+                        "name": "model",
+                        "in": "query",
+                        "required": true
                    }
                ],
                "responses": {
@@ -2405,6 +2403,23 @@
                }
            }
        },
+        "gallery.NodeDriftInfo": {
+            "type": "object",
+            "properties": {
+                "digest": {
+                    "type": "string"
+                },
+                "node_id": {
+                    "type": "string"
+                },
+                "node_name": {
+                    "type": "string"
+                },
+                "version": {
+                    "type": "string"
+                }
+            }
+        },
        "gallery.UpgradeInfo": {
            "type": "object",
            "properties": {
@@ -2422,6 +2437,13 @@
                },
                "installed_version": {
                    "type": "string"
+                },
+                "node_drift": {
+                    "description": "NodeDrift lists nodes whose installed version or digest differs from\nthe cluster majority. Non-empty means the cluster has diverged and an\nupgrade will realign it. Empty in single-node mode.",
+                    "type": "array",
+                    "items": {
+                        "$ref": "#/definitions/gallery.NodeDriftInfo"
+                    }
                }
            }
        },
--- a/swagger/swagger.yaml
+++ b/swagger/swagger.yaml
@@ -157,6 +157,17 @@ definitions:
          type: string
        type: array
    type: object
+  gallery.NodeDriftInfo:
+    properties:
+      digest:
+        type: string
+      node_id:
+        type: string
+      node_name:
+        type: string
+      version:
+        type: string
+    type: object
  gallery.UpgradeInfo:
    properties:
      available_digest:
@@ -169,6 +180,14 @@ definitions:
        type: string
      installed_version:
        type: string
+      node_drift:
+        description: |-
+          NodeDrift lists nodes whose installed version or digest differs from
+          the cluster majority. Non-empty means the cluster has diverged and an
+          upgrade will realign it. Empty in single-node mode.
+        items:
+          $ref: '#/definitions/gallery.NodeDriftInfo'
+        type: array
    type: object
  galleryop.OpStatus:
    properties:
@@ -2363,12 +2382,11 @@ paths:
  /backend/monitor:
    get:
      parameters:
-      - description: Backend statistics request
-        in: body
-        name: request
+      - description: Name of the model to monitor
+        in: query
+        name: model
        required: true
-        schema:
-          $ref: '#/definitions/schema.BackendMonitorRequest'
+        type: string
      responses:
        "200":
          description: Response
--- a/tests/e2e/distributed/node_lifecycle_test.go
+++ b/tests/e2e/distributed/node_lifecycle_test.go
@@ -57,7 +57,7 @@ var _ = Describe("Node Backend Lifecycle (NATS-driven)", Label("Distributed"), f
 			FlushNATS(infra.NC)

 			adapter := nodes.NewRemoteUnloaderAdapter(registry, infra.NC)
-			installReply, err := adapter.InstallBackend(node.ID, "llama-cpp", "", "")
+			installReply, err := adapter.InstallBackend(node.ID, "llama-cpp", "", "", "", "", "")
 			Expect(err).ToNot(HaveOccurred())
 			Expect(installReply.Success).To(BeTrue())
 		})
@@ -78,7 +78,7 @@ var _ = Describe("Node Backend Lifecycle (NATS-driven)", Label("Distributed"), f
 			FlushNATS(infra.NC)

 			adapter := nodes.NewRemoteUnloaderAdapter(registry, infra.NC)
-			installReply, err := adapter.InstallBackend(node.ID, "nonexistent", "", "")
+			installReply, err := adapter.InstallBackend(node.ID, "nonexistent", "", "", "", "", "")
 			Expect(err).ToNot(HaveOccurred())
 			Expect(installReply.Success).To(BeFalse())
 			Expect(installReply.Error).To(ContainSubstring("backend not found"))